Agents Guide for Scraper

This repo builds static HTML reports for TV shows and movie rentals. The current system is small, fragile, and scraper-driven, so future agents should optimize for accurate diagnosis, minimal churn, and end-to-end verification.

Current Architecture

TV shows

eztv.rss.js Maintains eztv.rss.json from the EZTV RSS feed at https://myrss.org/eztv.
tvshows.puppeteer.js Reads eztv.rss.json, enriches each show from IMDb, caches details in shows.json, and writes tvshows.html.

Important invariant:

eztv.rss.json now contains one entry per show, not one entry per episode.
The stored episode should be the most recently seen episode for that show.

Movie rentals

rentals.puppeteer.js Scrapes Official Charts, enriches from IMDb, and writes movierentals.html.

Publishing / orchestration

telescuff Shell entrypoint used for cron-style runs and optional Neocities upload.

Current modes:

./telescuff rss Refresh only eztv.rss.json
./telescuff tv Build tvshows.html from the existing RSS cache
./telescuff movies Build movierentals.html
./telescuff all Build movies + tv, but does not run the high-frequency RSS refresh

Intended cron split:

run rss frequently
run tv less frequently

Files That Matter

eztv.rss.js RSS ingestion and cache maintenance.
eztv.rss.json Local TV source-of-truth cache, one row per show.
tvshows.puppeteer.js IMDb search + title scraping for TV.
shows.json Local IMDb enrichment cache for TV titles.
rentals.puppeteer.js Movie rental scraper.
body.ejs Shared HTML template.
telescuff Main wrapper script for cron/manual runs.

Legacy files exist but should generally not be used for new work:

tvshows.js
scrape.js
rentals.casper.js

How TV Scraping Works Now

RSS cache phase

eztv.rss.js:

fetches RSS entries
parses titles with episode-parser
keys by normalized show name
keeps the highest season/episode seen for that show
updates last_seen
evicts entries older than --max-days

If you change this script:

preserve the one-entry-per-show invariant
keep the JSON shape stable unless you also update tvshows.puppeteer.js

IMDb enrichment phase

tvshows.puppeteer.js:

loads eztv.rss.json
merges any known data from shows.json
searches IMDb for missing url
opens IMDb title pages for missing details
writes shows.json
renders tvshows.html

IMDb-Specific Notes

IMDb is currently the hardest moving part in the repo.

Observed behavior:

search pages sometimes render normally
search pages sometimes show a broken shell with the browser title: Application error: a client-side exception has occurred
even in that broken state, script#__NEXT_DATA__ often still contains usable search/title data
the page console emits a lot of noisy client-side errors from IMDb itself

Current scraper strategy:

navigate with waitUntil: "domcontentloaded"
wait for either:
- a visible page element, or
- parseable __NEXT_DATA__
prefer structured JSON extraction over fragile DOM-only scraping
keep DOM/meta/ld+json fallbacks where they add value

Do not assume:

visible DOM means the best data is there
__NEXT_DATA__ appears immediately
console errors imply scraper failure

Common Failure Modes

1. IMDb search is slow

First suspects:

overly conservative wait strategy
waiting for visible UI when parseable JSON is already available
IMDb serving the broken shell page

What to check:

Navigate timing ...
IMDB search timing ...
whether readiness completed via visible-selector or json-ready

2. Missing rating / description / duration

Not every null is a selector bug.

What we have already confirmed:

many rating: null cases are real unrated IMDb pages
some description and duration gaps can be recovered from:
- application/ld+json
- meta[name="description"]
- meta[property="og:description"]

Before changing selectors:

inspect the live page
determine whether IMDb actually has the value
avoid using generic IMDb boilerplate descriptions as show descriptions

3. Wrong IMDb match

This is a search quality issue, not a title-page selector issue.

Example class of problem:

a UK title resolving to a US series with the same/similar name

If this becomes frequent:

improve search scoring/matching logic
do not paper over it with title-page selector changes

4. `eztv.rss.json` grows duplicate shows

That is a bug in eztv.rss.js.

Expected state:

one JSON entry per normalized show name

Practical Debugging Workflow

For TV issues:

Verify eztv.rss.json shape first.
Confirm whether the problem is:
- RSS parsing
- IMDb search match
- IMDb title extraction
- cache reuse
Use focused live probes rather than whole-pipeline runs when possible.
If changing IMDb waits, measure timing before and after.

Useful commands:

node eztv.rss.js --max-days 7
node tvshows.puppeteer.js
./telescuff rss
./telescuff tv

Low-cost checks:

node --check eztv.rss.js
node --check tvshows.puppeteer.js
bash -n telescuff

Expectations For Future Agents

Prefer Puppeteer over CasperJS for new work.
Keep changes small and behaviorally justified.
Preserve the cron split between RSS refresh and TV HTML generation.
Treat IMDb as unstable and verify live behavior before “fixing” selectors.
Add debug logging only when it helps isolate timing, readiness, or data-source choice.
When debugging scraper correctness, distinguish:
- real source-data absence
- selector breakage
- wrong-title matching

Quick Mental Model

eztv.rss.js decides which episode is the current representative for a show.
tvshows.puppeteer.js decides which IMDb title that show maps to and what metadata is usable.
shows.json is a performance cache, not the source of truth.
eztv.rss.json is the TV input source of truth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents Guide for Scraper

Current Architecture

TV shows

Movie rentals

Publishing / orchestration

Files That Matter

How TV Scraping Works Now

RSS cache phase

IMDb enrichment phase

IMDb-Specific Notes

Common Failure Modes

1. IMDb search is slow

2. Missing rating / description / duration

3. Wrong IMDb match

4. `eztv.rss.json` grows duplicate shows

Practical Debugging Workflow

Expectations For Future Agents

Quick Mental Model

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Agents Guide for Scraper

Current Architecture

TV shows

Movie rentals

Publishing / orchestration

Files That Matter

How TV Scraping Works Now

RSS cache phase

IMDb enrichment phase

IMDb-Specific Notes

Common Failure Modes

1. IMDb search is slow

2. Missing rating / description / duration

3. Wrong IMDb match

4. eztv.rss.json grows duplicate shows

Practical Debugging Workflow

Expectations For Future Agents

Quick Mental Model

4. `eztv.rss.json` grows duplicate shows