Windows (PowerShell)/New terminal:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
python -m pip install PyYAMLmacOS/Linux (bash):
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
python -m pip install PyYAMLList sources (from config.yaml):
python -m scraper_reporter sources listValidate config file:
python -m scraper_reporter config validatePreview a group (console only, no files):
python -m scraper_reporter run --group daily --preview
python -m scraper_reporter run --group br_daily --preview
python -m scraper_reporter run --group internships_daily --preview
python -m scraper_reporter run --group jobs_daily --previewRun specific sources by ID (comma‑separated):
python -m scraper_reporter run --sources bbc_tech_rss,hn_frontpage --previewSave a Markdown report:
python -m scraper_reporter run --group jobs_daily --format md --out outSave HTML and PDF reports:
python -m scraper_reporter run --group jobs_daily --format html --out out
python -m scraper_reporter run --group jobs_daily --format pdf --out outPDF uses ReportLab. If ReportLab gives errors on Python 3.13, either skip PDF or use Python 3.11/3.12.
Create an email draft (.eml) you can open in Outlook/Mail:
python -m scraper_reporter email draft --group jobs_daily --out outOutput files go to out/ and include today's date, e.g.:
out/report_YYYY-MM-DD_jobs_daily.mdout/report_YYYY-MM-DD_jobs_daily.htmlout/report_YYYY-MM-DD_jobs_daily.pdfout/email_YYYY-MM-DD_jobs_daily.eml
To enable JSON-based (JS‑free) scraping for BlackRock, ensure you have:
scraper_reporter/sources/workday.py(WorkdaySource)cli.pyupdated to registertype: workday
Example sources in config.yaml:
sources:
br_workday_all:
type: workday
host: blackrock.wd1.myworkdayjobs.com
tenant: blackrock
site: BlackRock
searchText: "" # or "software london intern"
limit: 100
tags: [jobs, blackrock]
br_workday_intern_london_se:
type: workday
host: blackrock.wd1.myworkdayjobs.com
tenant: blackrock
site: BlackRock
searchText: "intern london software"
limit: 100
tags: [jobs, blackrock, london, software]
groups:
br_daily:
include_tags: [blackrock]
limit: 120Run:
python -m scraper_reporter run --group br_daily --preview
python -m scraper_reporter run --group br_daily --format md --out outWhen adding HTML sources, you can use a text filter and absolute URLs by using the upgraded html.py.
Example (Bright Network):
sources:
brightnetwork_interns:
type: html
url: https://www.brightnetwork.co.uk/internships/
tags: [jobs, internships, brightnetwork]
item:
selector: "a[href*='intern']"
text_contains: ["intern", "placement"]
fields:
title: { selector: "a[href*='intern']" }
link: { selector: "a[href*='intern']", attr: "href" }
groups:
internships_daily:
include_tags: [internships]
limit: 150Run:
python -m scraper_reporter run --group internships_daily --preview
python -m scraper_reporter run --group internships_daily --format md --out outWindows Task Scheduler (daily at 08:30):
- Program/script:
python - Add arguments:
-m scraper_reporter run --group jobs_daily --format md --out out - Start in: your project folder (where
config.yamllives)