Skip to content

feat: lightpanda headless browser for web ingest#2

Merged
filocosta46 merged 14 commits into
mainfrom
feat/lightpanda-ingest
May 19, 2026
Merged

feat: lightpanda headless browser for web ingest#2
filocosta46 merged 14 commits into
mainfrom
feat/lightpanda-ingest

Conversation

@filocosta46
Copy link
Copy Markdown
Owner

Summary

  • Bundles Lightpanda (zero-config headless browser) as the default fetcher for dotaios ingest <url> so JavaScript-heavy pages render correctly.
  • dotaios setup downloads the binary silently to ~/.dotaios/bin/lightpanda on Mac/Linux (Windows skipped — no upstream binary).
  • ingestUrl() dispatches to lightpanda when available; falls back to plain fetch on any failure (spawn error, non-zero exit, empty stdout, timeout). PDF URLs bypass lightpanda to preserve existing Path B routing.
  • AGENTS.md.hbs gains a rule instructing all agents to route URLs through dotaios ingest instead of their own web tools.
  • Frontmatter parser field reflects which fetcher ran: lightpanda+readability+turndown vs readability+turndown.

Test Plan

  • node --test tests/**/*.test.mjs — 271 pass / 0 fail / 1 skipped (was 249 baseline; +22 new tests across paths, lightpanda, ingest_routing, setup, render)
  • Setup wizard prints lightpanda install line (verified via DOTAIOS_SKIP_LIGHTPANDA_DOWNLOAD=1 test escape hatch)
  • Lightpanda success path → frontmatter shows lightpanda+readability+turndown
  • Lightpanda crash → fallback to plain fetch with readability+turndown
  • Lightpanda missing → plain fetch + one-time ~/.dotaios/.lightpanda_hint_shown flag
  • PDF URLs (*.pdf) skip lightpanda → still route through Path B (unpdf)
  • Manual: run npx dotaios setup on a fresh Mac, confirm ~/.dotaios/bin/lightpanda exists and is executable
  • Manual: dotaios ingest https://react-spa-example.com → rendered content saved, frontmatter shows lightpanda parser

Files Changed

File What
packages/core/src/paths.mjs dotaiosBinDir(), lightpandaBinPath()
packages/core/src/lightpanda.mjs New — platform detection, download, resolve (zero npm deps)
packages/cli/src/ingest/web.mjs fetchHtml() dispatcher with lightpanda + plain-fetch + PDF bypass
packages/cli/src/commands/setup.mjs Best-effort lightpanda install step
templates/AGENTS.md.hbs URL routing rule under `## Rules`
README.md Lightpanda mention in ingest section
tests/core/{paths,lightpanda}.test.mjs New + extended
tests/cli/{ingest_routing,setup,v1_4_0}.test.mjs New regression tests + DI overrides
docs/superpowers/plans/2026-05-18-lightpanda-ingest.md Implementation plan

Notes

  • Lightpanda download is best-effort: setup never fails because of it.
  • DOTAIOS_SKIP_LIGHTPANDA_DOWNLOAD=1 env var skips the real download (used by setup test).
  • Existing ingestUrl tests in v1_4_0.test.mjs were updated to pass resolveLightpandaImpl: async () => null so the suite stays hermetic regardless of whether lightpanda is installed on the dev machine.

filocosta46 and others added 12 commits May 18, 2026 11:28
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a best-effort lightpanda install step to the setup wizard, between
the reveal step and skills summary. Uses DOTAIOS_SKIP_LIGHTPANDA_DOWNLOAD=1
as a CI escape hatch to keep tests hermetic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tests

Existing ingestUrl tests passed fetchImpl but not resolveLightpandaImpl.
After Task 5 added lightpanda dispatch, the real resolver hit ~/.dotaios/bin/lightpanda
on dev machines that ran Task 6's setup once, spawning lightpanda against fake URLs
and hanging the suite. Inject async () => null to keep tests hermetic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Lightpanda returns raw binary bytes for PDF URLs; with exit status 0 and
non-empty stdout, fetchHtml was returning via="lightpanda" causing the
PDF content-type branch (only entered for via="plain") to be skipped and
handing PDF bytes to Readability, producing READABILITY_NULL.

Guard the lightpanda spawn (and the hint-flag write) behind a
looksLikePdfUrl check so .pdf URLs always fall through to plain fetch
and enter the PDF ingestPdfResponse path.

Regression test added to tests/cli/ingest_routing.test.mjs: resolves
lightpanda to a fake path, asserts spawn is never called, and verifies
result.kind === "pdf" and result.parser === "unpdf".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dotaios Ready Ready Preview, Comment May 18, 2026 2:18pm

- downloadLightpanda: idempotency check (skip re-download if executable),
  streaming write via stream/promises (avoid buffering 66MB), freeze
  PLATFORM_BINARIES lookup table
- resolveLightpanda: fs.access with X_OK constant catches non-exec files,
  memoize defaultWhich result to avoid spawn per ingestUrl call
- paths.mjs: extract dotaiosDir() + lightpandaHintFlagPath() helpers
- web.mjs: extract PARSER_LIGHTPANDA/PARSER_PLAIN constants, derive
  hintFlagPath default from new helper, collapse PDF-URL short-circuit
  into single conditional, use pathExists from files.mjs
- setup.mjs: pass platformBinary through to avoid double lookup, neutralize
  test-env skip message, drop task-plan numbering comment
- v1_4_0.test.mjs: extract NO_LIGHTPANDA constant (was repeated 9x)
- render.test.mjs, lightpanda.test.mjs: hoist mid-file imports to top
Critical bug found during real smoke test on macOS 25.3.0 arm64:

1. Release URL 'releases/latest/download' on this repo redirects to the
   'nightly' tag, which historically ships broken builds (silent startup
   hangs, zero output). Pin to stable tag 0.3.0.

2. spawn args were 'lightpanda fetch --dump <url>' — wrong. Actual CLI
   syntax is 'lightpanda fetch --dump html <url>'. The --dump flag
   requires an explicit format argument (html|markdown).

Also UX polish (drop 'Lightpanda' jargon from user-facing strings,
say 'web browsing engine' instead).

End-to-end smoke verified:
- dotaios setup → installs 0.3.0 binary at ~/.dotaios/bin/lightpanda
- dotaios ingest https://example.com → 0.4s, parser=lightpanda+...
- dotaios ingest https://JS-rendered-demo → 0.56s, real content extracted

272 tests / 271 pass / 0 fail / 1 skipped.
@filocosta46 filocosta46 merged commit dad589c into main May 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant