feat: lightpanda headless browser for web ingest#2
Merged
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a best-effort lightpanda install step to the setup wizard, between the reveal step and skills summary. Uses DOTAIOS_SKIP_LIGHTPANDA_DOWNLOAD=1 as a CI escape hatch to keep tests hermetic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tests Existing ingestUrl tests passed fetchImpl but not resolveLightpandaImpl. After Task 5 added lightpanda dispatch, the real resolver hit ~/.dotaios/bin/lightpanda on dev machines that ran Task 6's setup once, spawning lightpanda against fake URLs and hanging the suite. Inject async () => null to keep tests hermetic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Lightpanda returns raw binary bytes for PDF URLs; with exit status 0 and non-empty stdout, fetchHtml was returning via="lightpanda" causing the PDF content-type branch (only entered for via="plain") to be skipped and handing PDF bytes to Readability, producing READABILITY_NULL. Guard the lightpanda spawn (and the hint-flag write) behind a looksLikePdfUrl check so .pdf URLs always fall through to plain fetch and enter the PDF ingestPdfResponse path. Regression test added to tests/cli/ingest_routing.test.mjs: resolves lightpanda to a fake path, asserts spawn is never called, and verifies result.kind === "pdf" and result.parser === "unpdf". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
- downloadLightpanda: idempotency check (skip re-download if executable), streaming write via stream/promises (avoid buffering 66MB), freeze PLATFORM_BINARIES lookup table - resolveLightpanda: fs.access with X_OK constant catches non-exec files, memoize defaultWhich result to avoid spawn per ingestUrl call - paths.mjs: extract dotaiosDir() + lightpandaHintFlagPath() helpers - web.mjs: extract PARSER_LIGHTPANDA/PARSER_PLAIN constants, derive hintFlagPath default from new helper, collapse PDF-URL short-circuit into single conditional, use pathExists from files.mjs - setup.mjs: pass platformBinary through to avoid double lookup, neutralize test-env skip message, drop task-plan numbering comment - v1_4_0.test.mjs: extract NO_LIGHTPANDA constant (was repeated 9x) - render.test.mjs, lightpanda.test.mjs: hoist mid-file imports to top
Critical bug found during real smoke test on macOS 25.3.0 arm64: 1. Release URL 'releases/latest/download' on this repo redirects to the 'nightly' tag, which historically ships broken builds (silent startup hangs, zero output). Pin to stable tag 0.3.0. 2. spawn args were 'lightpanda fetch --dump <url>' — wrong. Actual CLI syntax is 'lightpanda fetch --dump html <url>'. The --dump flag requires an explicit format argument (html|markdown). Also UX polish (drop 'Lightpanda' jargon from user-facing strings, say 'web browsing engine' instead). End-to-end smoke verified: - dotaios setup → installs 0.3.0 binary at ~/.dotaios/bin/lightpanda - dotaios ingest https://example.com → 0.4s, parser=lightpanda+... - dotaios ingest https://JS-rendered-demo → 0.56s, real content extracted 272 tests / 271 pass / 0 fail / 1 skipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dotaios ingest <url>so JavaScript-heavy pages render correctly.dotaios setupdownloads the binary silently to~/.dotaios/bin/lightpandaon Mac/Linux (Windows skipped — no upstream binary).ingestUrl()dispatches to lightpanda when available; falls back to plainfetchon any failure (spawn error, non-zero exit, empty stdout, timeout). PDF URLs bypass lightpanda to preserve existing Path B routing.AGENTS.md.hbsgains a rule instructing all agents to route URLs throughdotaios ingestinstead of their own web tools.parserfield reflects which fetcher ran:lightpanda+readability+turndownvsreadability+turndown.Test Plan
node --test tests/**/*.test.mjs— 271 pass / 0 fail / 1 skipped (was 249 baseline; +22 new tests acrosspaths,lightpanda,ingest_routing,setup,render)DOTAIOS_SKIP_LIGHTPANDA_DOWNLOAD=1test escape hatch)lightpanda+readability+turndownreadability+turndown~/.dotaios/.lightpanda_hint_shownflag*.pdf) skip lightpanda → still route through Path B (unpdf)npx dotaios setupon a fresh Mac, confirm~/.dotaios/bin/lightpandaexists and is executabledotaios ingest https://react-spa-example.com→ rendered content saved, frontmatter shows lightpanda parserFiles Changed
packages/core/src/paths.mjsdotaiosBinDir(),lightpandaBinPath()packages/core/src/lightpanda.mjspackages/cli/src/ingest/web.mjsfetchHtml()dispatcher with lightpanda + plain-fetch + PDF bypasspackages/cli/src/commands/setup.mjstemplates/AGENTS.md.hbsREADME.mdtests/core/{paths,lightpanda}.test.mjstests/cli/{ingest_routing,setup,v1_4_0}.test.mjsdocs/superpowers/plans/2026-05-18-lightpanda-ingest.mdNotes
DOTAIOS_SKIP_LIGHTPANDA_DOWNLOAD=1env var skips the real download (used by setup test).ingestUrltests inv1_4_0.test.mjswere updated to passresolveLightpandaImpl: async () => nullso the suite stays hermetic regardless of whether lightpanda is installed on the dev machine.