Skip to content

kmzer06/Scraper-Reporter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraper Reporter

Setup (one‑time per machine)

Windows (PowerShell)/New terminal:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
python -m pip install PyYAML

macOS/Linux (bash):

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
python -m pip install PyYAML

Inspect your sources & config

List sources (from config.yaml):

python -m scraper_reporter sources list

Validate config file:

python -m scraper_reporter config validate

Run scrapes

Preview a group (console only, no files):

python -m scraper_reporter run --group daily --preview
python -m scraper_reporter run --group br_daily --preview
python -m scraper_reporter run --group internships_daily --preview
python -m scraper_reporter run --group jobs_daily --preview

Run specific sources by ID (comma‑separated):

python -m scraper_reporter run --sources bbc_tech_rss,hn_frontpage --preview

Save a Markdown report:

python -m scraper_reporter run --group jobs_daily --format md --out out

Save HTML and PDF reports:

python -m scraper_reporter run --group jobs_daily --format html --out out
python -m scraper_reporter run --group jobs_daily --format pdf --out out

PDF uses ReportLab. If ReportLab gives errors on Python 3.13, either skip PDF or use Python 3.11/3.12.

Create an email draft (.eml) you can open in Outlook/Mail:

python -m scraper_reporter email draft --group jobs_daily --out out

Output files go to out/ and include today's date, e.g.:

  • out/report_YYYY-MM-DD_jobs_daily.md
  • out/report_YYYY-MM-DD_jobs_daily.html
  • out/report_YYYY-MM-DD_jobs_daily.pdf
  • out/email_YYYY-MM-DD_jobs_daily.eml

BlackRock (Workday) sources

To enable JSON-based (JS‑free) scraping for BlackRock, ensure you have:

  • scraper_reporter/sources/workday.py (WorkdaySource)
  • cli.py updated to register type: workday

Example sources in config.yaml:

sources:
  br_workday_all:
    type: workday
    host: blackrock.wd1.myworkdayjobs.com
    tenant: blackrock
    site: BlackRock
    searchText: ""                  # or "software london intern"
    limit: 100
    tags: [jobs, blackrock]

  br_workday_intern_london_se:
    type: workday
    host: blackrock.wd1.myworkdayjobs.com
    tenant: blackrock
    site: BlackRock
    searchText: "intern london software"
    limit: 100
    tags: [jobs, blackrock, london, software]

groups:
  br_daily:
    include_tags: [blackrock]
    limit: 120

Run:

python -m scraper_reporter run --group br_daily --preview
python -m scraper_reporter run --group br_daily --format md --out out

Internship boards (HTML)

When adding HTML sources, you can use a text filter and absolute URLs by using the upgraded html.py.

Example (Bright Network):

sources:
  brightnetwork_interns:
    type: html
    url: https://www.brightnetwork.co.uk/internships/
    tags: [jobs, internships, brightnetwork]
    item:
      selector: "a[href*='intern']"
      text_contains: ["intern", "placement"]
      fields:
        title: { selector: "a[href*='intern']" }
        link:  { selector: "a[href*='intern']", attr: "href" }

groups:
  internships_daily:
    include_tags: [internships]
    limit: 150

Run:

python -m scraper_reporter run --group internships_daily --preview
python -m scraper_reporter run --group internships_daily --format md --out out

Scheduling (optional)

Windows Task Scheduler (daily at 08:30):

  • Program/script: python
  • Add arguments: -m scraper_reporter run --group jobs_daily --format md --out out
  • Start in: your project folder (where config.yaml lives)

About

Scraper Reporter is a lightweight tool that automates web data collection and transforms it into structured, easy-to-read reports. It’s designed to fetch information from websites, clean and organize the results, and then generate outputs in formats like CSV, PDF, or Markdown.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages