Skip to content

Latest commit

 

History

History
197 lines (158 loc) · 5.88 KB

File metadata and controls

197 lines (158 loc) · 5.88 KB

Agents Guide for Scraper

This repo builds static HTML reports for TV shows and movie rentals. The current system is small, fragile, and scraper-driven, so future agents should optimize for accurate diagnosis, minimal churn, and end-to-end verification.

Current Architecture

TV shows

  • eztv.rss.js Maintains eztv.rss.json from the EZTV RSS feed at https://myrss.org/eztv.
  • tvshows.puppeteer.js Reads eztv.rss.json, enriches each show from IMDb, caches details in shows.json, and writes tvshows.html.

Important invariant:

  • eztv.rss.json now contains one entry per show, not one entry per episode.
  • The stored episode should be the most recently seen episode for that show.

Movie rentals

  • rentals.puppeteer.js Scrapes Official Charts, enriches from IMDb, and writes movierentals.html.

Publishing / orchestration

  • telescuff Shell entrypoint used for cron-style runs and optional Neocities upload.

Current modes:

  • ./telescuff rss Refresh only eztv.rss.json
  • ./telescuff tv Build tvshows.html from the existing RSS cache
  • ./telescuff movies Build movierentals.html
  • ./telescuff all Build movies + tv, but does not run the high-frequency RSS refresh

Intended cron split:

  • run rss frequently
  • run tv less frequently

Files That Matter

  • eztv.rss.js RSS ingestion and cache maintenance.
  • eztv.rss.json Local TV source-of-truth cache, one row per show.
  • tvshows.puppeteer.js IMDb search + title scraping for TV.
  • shows.json Local IMDb enrichment cache for TV titles.
  • rentals.puppeteer.js Movie rental scraper.
  • body.ejs Shared HTML template.
  • telescuff Main wrapper script for cron/manual runs.

Legacy files exist but should generally not be used for new work:

  • tvshows.js
  • scrape.js
  • rentals.casper.js

How TV Scraping Works Now

RSS cache phase

eztv.rss.js:

  • fetches RSS entries
  • parses titles with episode-parser
  • keys by normalized show name
  • keeps the highest season/episode seen for that show
  • updates last_seen
  • evicts entries older than --max-days

If you change this script:

  • preserve the one-entry-per-show invariant
  • keep the JSON shape stable unless you also update tvshows.puppeteer.js

IMDb enrichment phase

tvshows.puppeteer.js:

  • loads eztv.rss.json
  • merges any known data from shows.json
  • searches IMDb for missing url
  • opens IMDb title pages for missing details
  • writes shows.json
  • renders tvshows.html

IMDb-Specific Notes

IMDb is currently the hardest moving part in the repo.

Observed behavior:

  • search pages sometimes render normally
  • search pages sometimes show a broken shell with the browser title: Application error: a client-side exception has occurred
  • even in that broken state, script#__NEXT_DATA__ often still contains usable search/title data
  • the page console emits a lot of noisy client-side errors from IMDb itself

Current scraper strategy:

  • navigate with waitUntil: "domcontentloaded"
  • wait for either:
    • a visible page element, or
    • parseable __NEXT_DATA__
  • prefer structured JSON extraction over fragile DOM-only scraping
  • keep DOM/meta/ld+json fallbacks where they add value

Do not assume:

  • visible DOM means the best data is there
  • __NEXT_DATA__ appears immediately
  • console errors imply scraper failure

Common Failure Modes

1. IMDb search is slow

First suspects:

  • overly conservative wait strategy
  • waiting for visible UI when parseable JSON is already available
  • IMDb serving the broken shell page

What to check:

  • Navigate timing ...
  • IMDB search timing ...
  • whether readiness completed via visible-selector or json-ready

2. Missing rating / description / duration

Not every null is a selector bug.

What we have already confirmed:

  • many rating: null cases are real unrated IMDb pages
  • some description and duration gaps can be recovered from:
    • application/ld+json
    • meta[name="description"]
    • meta[property="og:description"]

Before changing selectors:

  • inspect the live page
  • determine whether IMDb actually has the value
  • avoid using generic IMDb boilerplate descriptions as show descriptions

3. Wrong IMDb match

This is a search quality issue, not a title-page selector issue.

Example class of problem:

  • a UK title resolving to a US series with the same/similar name

If this becomes frequent:

  • improve search scoring/matching logic
  • do not paper over it with title-page selector changes

4. eztv.rss.json grows duplicate shows

That is a bug in eztv.rss.js.

Expected state:

  • one JSON entry per normalized show name

Practical Debugging Workflow

For TV issues:

  1. Verify eztv.rss.json shape first.
  2. Confirm whether the problem is:
    • RSS parsing
    • IMDb search match
    • IMDb title extraction
    • cache reuse
  3. Use focused live probes rather than whole-pipeline runs when possible.
  4. If changing IMDb waits, measure timing before and after.

Useful commands:

node eztv.rss.js --max-days 7
node tvshows.puppeteer.js
./telescuff rss
./telescuff tv

Low-cost checks:

node --check eztv.rss.js
node --check tvshows.puppeteer.js
bash -n telescuff

Expectations For Future Agents

  • Prefer Puppeteer over CasperJS for new work.
  • Keep changes small and behaviorally justified.
  • Preserve the cron split between RSS refresh and TV HTML generation.
  • Treat IMDb as unstable and verify live behavior before “fixing” selectors.
  • Add debug logging only when it helps isolate timing, readiness, or data-source choice.
  • When debugging scraper correctness, distinguish:
    • real source-data absence
    • selector breakage
    • wrong-title matching

Quick Mental Model

  • eztv.rss.js decides which episode is the current representative for a show.
  • tvshows.puppeteer.js decides which IMDb title that show maps to and what metadata is usable.
  • shows.json is a performance cache, not the source of truth.
  • eztv.rss.json is the TV input source of truth.