feat(snapshots): reader-view + Markdown snapshot generator by bdelanghe · Pull Request #11 · bounded-systems/conformance-kit

bdelanghe · 2026-06-29T01:14:25Z

The reader/markdown half of prx-rb3s — and the part you flagged: a clean Markdown twin of each page is far easier to run analysis over than scraping live HTML.

What

generators/gen-snapshots.mjs (+ ck-gen-snapshots bin): for every built page, run @mozilla/readability (the Firefox/Safari Reader engine, via linkedom — headless, no browser) and write:

<page>.reader.html — the clean reader extraction (nav/footer chrome stripped).
<page>.reader.md — YAML front-matter (title, byline, excerpt, source) + Markdown (via turndown).

The Markdown is the durable, diffable, analysis-friendly twin of the page. It also doubles as the AI-readable Markdown sibling (semantic.ai-readability), and a non-empty extraction is the proof of the "reader survivability" the structure-audit already grades (readerOk).

Why it's clean

readability + linkedom + turndown are all pure npm — no browser — so the e2e is deterministic and runs in CI. Pure extractReader/toMarkdown core, config-driven CLI, fixture + test.

Verification

node test/run.mjs → 17 passed, 0 failed. Tests reader extraction (chrome stripped, headings/lists/emphasis preserved, front-matter records the source URL) + graceful null on a contentless page.

Deferred (the other half of prx-rb3s)

The printed/PDF view needs a print-CSS renderer (tezcatl --pdf locally) — it can't be browser-free in Linux CI, so it's a separate generator (local/deploy artifact).

🤖 Generated with Claude Code

For every built page, emit a clean READER extraction (the @mozilla/readability engine that powers Firefox/Safari Reader, via linkedom — headless, no browser) as both <page>.reader.html and an analysis-friendly <page>.reader.md (YAML front-matter + Markdown via turndown). The Markdown is the durable, diffable twin of the page: machine-readable and far easier to run NLP/LLM analysis over than scraping live HTML, and it doubles as the AI-readable Markdown sibling (semantic.ai-readability). A non-empty extraction is also the proof of the "reader survivability" the structure-audit grades (readerOk). Pure (readability + linkedom + turndown — all npm, no browser), so the e2e is deterministic and runs in CI. Config-driven ($SNAPSHOT_DIST/$SNAPSHOT_PAGES/ $SNAPSHOT_BASE_URL/$SNAPSHOT_SUFFIX); ck-gen-snapshots bin; fixture + test. node test/run.mjs → 17/0. The PRINTED/PDF view (needs a print-CSS renderer — tezcatl --pdf locally) is the separate half of prx-rb3s, deferred since it can't be browser-free in Linux CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bdelanghe merged commit c9bff77 into main Jun 29, 2026
1 check passed

bdelanghe deleted the conformance/reader-snapshots branch June 29, 2026 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(snapshots): reader-view + Markdown snapshot generator#11

feat(snapshots): reader-view + Markdown snapshot generator#11
bdelanghe merged 1 commit into
mainfrom
conformance/reader-snapshots

bdelanghe commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bdelanghe commented Jun 29, 2026

What

Why it's clean

Verification

Deferred (the other half of prx-rb3s)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant