Skip to content

feat(snapshots): reader-view + Markdown snapshot generator#11

Merged
bdelanghe merged 1 commit into
mainfrom
conformance/reader-snapshots
Jun 29, 2026
Merged

feat(snapshots): reader-view + Markdown snapshot generator#11
bdelanghe merged 1 commit into
mainfrom
conformance/reader-snapshots

Conversation

@bdelanghe

Copy link
Copy Markdown
Contributor

The reader/markdown half of prx-rb3s — and the part you flagged: a clean Markdown twin of each page is far easier to run analysis over than scraping live HTML.

What

generators/gen-snapshots.mjs (+ ck-gen-snapshots bin): for every built page, run @mozilla/readability (the Firefox/Safari Reader engine, via linkedomheadless, no browser) and write:

  • <page>.reader.html — the clean reader extraction (nav/footer chrome stripped).
  • <page>.reader.md — YAML front-matter (title, byline, excerpt, source) + Markdown (via turndown).

The Markdown is the durable, diffable, analysis-friendly twin of the page. It also doubles as the AI-readable Markdown sibling (semantic.ai-readability), and a non-empty extraction is the proof of the "reader survivability" the structure-audit already grades (readerOk).

Why it's clean

readability + linkedom + turndown are all pure npm — no browser — so the e2e is deterministic and runs in CI. Pure extractReader/toMarkdown core, config-driven CLI, fixture + test.

Verification

node test/run.mjs17 passed, 0 failed. Tests reader extraction (chrome stripped, headings/lists/emphasis preserved, front-matter records the source URL) + graceful null on a contentless page.

Deferred (the other half of prx-rb3s)

The printed/PDF view needs a print-CSS renderer (tezcatl --pdf locally) — it can't be browser-free in Linux CI, so it's a separate generator (local/deploy artifact).

🤖 Generated with Claude Code

For every built page, emit a clean READER extraction (the @mozilla/readability
engine that powers Firefox/Safari Reader, via linkedom — headless, no browser) as
both <page>.reader.html and an analysis-friendly <page>.reader.md (YAML
front-matter + Markdown via turndown).

The Markdown is the durable, diffable twin of the page: machine-readable and far
easier to run NLP/LLM analysis over than scraping live HTML, and it doubles as the
AI-readable Markdown sibling (semantic.ai-readability). A non-empty extraction is
also the proof of the "reader survivability" the structure-audit grades (readerOk).

Pure (readability + linkedom + turndown — all npm, no browser), so the e2e is
deterministic and runs in CI. Config-driven ($SNAPSHOT_DIST/$SNAPSHOT_PAGES/
$SNAPSHOT_BASE_URL/$SNAPSHOT_SUFFIX); ck-gen-snapshots bin; fixture + test.
node test/run.mjs → 17/0.

The PRINTED/PDF view (needs a print-CSS renderer — tezcatl --pdf locally) is the
separate half of prx-rb3s, deferred since it can't be browser-free in Linux CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bdelanghe bdelanghe merged commit c9bff77 into main Jun 29, 2026
1 check passed
@bdelanghe bdelanghe deleted the conformance/reader-snapshots branch June 29, 2026 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant