feat(snapshots): reader-view + Markdown snapshot generator#11
Merged
Conversation
For every built page, emit a clean READER extraction (the @mozilla/readability engine that powers Firefox/Safari Reader, via linkedom — headless, no browser) as both <page>.reader.html and an analysis-friendly <page>.reader.md (YAML front-matter + Markdown via turndown). The Markdown is the durable, diffable twin of the page: machine-readable and far easier to run NLP/LLM analysis over than scraping live HTML, and it doubles as the AI-readable Markdown sibling (semantic.ai-readability). A non-empty extraction is also the proof of the "reader survivability" the structure-audit grades (readerOk). Pure (readability + linkedom + turndown — all npm, no browser), so the e2e is deterministic and runs in CI. Config-driven ($SNAPSHOT_DIST/$SNAPSHOT_PAGES/ $SNAPSHOT_BASE_URL/$SNAPSHOT_SUFFIX); ck-gen-snapshots bin; fixture + test. node test/run.mjs → 17/0. The PRINTED/PDF view (needs a print-CSS renderer — tezcatl --pdf locally) is the separate half of prx-rb3s, deferred since it can't be browser-free in Linux CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The reader/markdown half of prx-rb3s — and the part you flagged: a clean Markdown twin of each page is far easier to run analysis over than scraping live HTML.
What
generators/gen-snapshots.mjs(+ck-gen-snapshotsbin): for every built page, run @mozilla/readability (the Firefox/Safari Reader engine, vialinkedom— headless, no browser) and write:<page>.reader.html— the clean reader extraction (nav/footer chrome stripped).<page>.reader.md— YAML front-matter (title, byline, excerpt, source) + Markdown (viaturndown).The Markdown is the durable, diffable, analysis-friendly twin of the page. It also doubles as the AI-readable Markdown sibling (
semantic.ai-readability), and a non-empty extraction is the proof of the "reader survivability" the structure-audit already grades (readerOk).Why it's clean
readability + linkedom + turndown are all pure npm — no browser — so the e2e is deterministic and runs in CI. Pure
extractReader/toMarkdowncore, config-driven CLI, fixture + test.Verification
node test/run.mjs→ 17 passed, 0 failed. Tests reader extraction (chrome stripped, headings/lists/emphasis preserved, front-matter records the source URL) + graceful null on a contentless page.Deferred (the other half of prx-rb3s)
The printed/PDF view needs a print-CSS renderer (
tezcatl --pdflocally) — it can't be browser-free in Linux CI, so it's a separate generator (local/deploy artifact).🤖 Generated with Claude Code