GhostReader

A token-frugal web reader for AI agents. Staged extraction skips the browser when it can — raw HTML first, then framework-embedded JSON, and only as a last resort a real Chromium render.

Status: v0.2.0 — local / single-user, with a benchmark harness. v0.1.0 was the proof of the idea; v0.2.0 adds a regression net and the numbers below. Still not hardened for shared or production use. See the roadmap.

Why it exists

The standard way agents read the web is to launch a headless browser, render every page, hand the resulting HTML blob to an LLM and ask it to figure out the content. That works, and it also wastes tokens, money and seconds on every URL.

A lot of modern pages already contain the article text, product details or post body inside the initial HTML response — embedded as JSON in things like __NEXT_DATA__, JSON-LD scripts, Apollo / Nuxt / Redux hydration blobs. The browser exists to paint that data onto a screen for a human. Your agent doesn't need a screen, it just needs the data.

GhostReader is the small tool that exploits that:

Stage 1 — raw HTML. Fetch with httpx, extract main text with trafilatura. Most articles, blogs, docs and Wikipedia pages return cleanly here in 200–500 ms.
Stage 2 — embedded data. Parse JSON-LD, __NEXT_DATA__ and common hydration blobs. Walk the JSON for the actual content. Catches a huge slice of modern SPAs without ever launching a browser.
Stage 3 — browser fallback. Only when the cheap paths fail, render with Playwright/Chromium (images, media and fonts blocked).

A confidence score from a heuristic over the extracted text decides when to escalate.

What you get in practice

~20× fewer input tokens than handing raw HTML to a model (measured on a Wikipedia article: ~500 KB raw HTML vs. ~25 KB clean markdown).
~5–10× lower latency on the cheap path vs. browser-everything approaches (~300 ms vs. 2–5 s).
Zero per-call cost, no rate limits, no third party in your URL stream — unlike paid readers like Firecrawl or Jina.
SQLite response cache with TTL, so repeated reads of the same URL are free.

Benchmark

v0.2.0 ships a small benchmark harness (benchmarks/) that runs the extractor against two fixture sets and scores quality and efficiency as separate questions. The committed numbers below describe v0.1.0 extractor behavior, measured by the v0.2.0 harness — no extractor logic or thresholds changed in this release; the harness exists so future changes can be measured rather than guessed.

Two modes:

Frozen mode is the regression gate. It runs against stored HTML snapshots in benchmarks/corpus/ so results are deterministic — no network, no Playwright. It exercises Stages 1 and 2 only; Stage 3 against a static snapshot would measure nothing real.
Live mode is a reality check. It fetches current URLs through the full pipeline including Stage 3, and is allowed to drift; it never fails the build.

Frozen baseline (27 fixtures across 10 categories)

approach	quality pass	mean chars	p50 latency
raw HTML dump	14.8%	91,086	7 ms
GhostReader stages 1+2	100.0%	15,795	67 ms

Hand the model the raw HTML and 85% of fixtures fail quality (cookie banners, nav, "Create account" all leak through forbidden_phrases). Hand it GhostReader's frozen stages-1+2 output and quality is clean — at ~5.8× smaller payload. Method mix on the corpus: 88.9% raw_html, 11.1% best_effort, 0% embedded_data — the best_effort cases are all edge/partial fixtures, not happy-path articles. Stage 2 is present in the pipeline but did not materially contribute on the current corpus; the threshold sweep shows it would fire if the Stage-1 acceptance bar were raised, which is a v0.3 follow-up.

Live reality check (4 fixtures, including 2 fetch-failure edges)

The Playwright-only baseline column compares apples-to-apples: 2 expected_fetch_failure fixtures (a PDF and a real 404) are skipped from that column, because Chromium has no concept of refusing to render — it would download the PDF and render the 404 wrapper as if it were content. Both fixtures still run under GhostReader, which refuses them at the fetch boundary.

approach	quality pass	p50 latency	p95 latency	browser launches
Playwright-only baseline	100.0% (2/2)	4151 ms	4188 ms	2
GhostReader full	100.0% (4/4)	561 ms	717 ms	0

Both happy-path fixtures (a long-form Wikipedia article and the Next.js marketing homepage) were sufficient at Stage 1 — Stage 3 never fired. The PDF was refused with HTTPException 415 in 580 ms; the dead Wikipedia URL surfaced as httpx.HTTPStatusError 404 in 314 ms. Single snapshot, not a universal claim — but on this set, Playwright-only spent ~4 seconds doing what GhostReader handled in ~500 ms with near-identical extracted text size and the same quality on the comparable set.

Run it locally

python -m benchmarks.run --mode frozen           # regression gate, ~6s, no network
python -m benchmarks.run --mode live             # reality check, fetches current URLs
python -m benchmarks.run --baselines             # comparison tables vs raw-html / Playwright
python -m benchmarks.run --tune-thresholds       # post-hoc threshold sweep (frozen only)

Frozen mode is CPU-only and completes in ~6 seconds on a typical laptop. Full harness docs in benchmarks/README.md.

Run locally

python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt
playwright install chromium
uvicorn app:app --reload

Example

curl "http://127.0.0.1:8000/read?url=https://en.wikipedia.org/wiki/Web_scraping"

Response (truncated):

{
  "url": "https://en.wikipedia.org/wiki/Web_scraping",
  "title": "Web scraping - Wikipedia",
  "text_markdown": "**Web scraping**, **web harvesting**, ...",
  "method": "raw_html",
  "confidence": 0.95,
  "from_cache": false,
  "hints": { "stage": 1, "stats": { "word_count": 4093, ... } }
}

Bypass the cache:

curl -H "X-Cache-Bypass: true" "http://127.0.0.1:8000/read?url=..."

Docker

docker build -t ghostreader:0.2.0 .
docker run --rm -p 8000:8000 ghostreader:0.2.0

Configuration

All env vars are optional.

Variable	Default	Purpose
`GHOSTREADER_ENABLE_BROWSER_FALLBACK`	`true`	Set to `false` to disable Playwright stage.
`GHOSTREADER_ALLOW_DOMAINS`	(empty)	Comma-separated allowlist. Set this if any URL source is untrusted — otherwise this is an SSRF amplifier.
`GHOSTREADER_BLOCK_DOMAINS`	(empty)	Comma-separated blocklist.
`GHOSTREADER_CACHE_TTL_SECONDS`	`21600` (6h)	Cache lifetime per entry.
`GHOSTREADER_CACHE_DB`	`/tmp/ghostreader_cache.sqlite3`	SQLite cache path.
`GHOSTREADER_FETCH_RETRIES`	`2`	Retries for transient HTTP errors.
`GHOSTREADER_TIMEOUT_SECONDS`	`12`	Per-request timeout.
`GHOSTREADER_SMALL_STATIC_HTML_BYTES`	`50000`	HTML size below which a short page is accepted at stage 1 without escalation.

Notes

respect_robots=true is the default on the /read endpoint.
The cache key includes the normalized URL, the user-agent string, and the browser_fallback flag.
Browser fallback uses Playwright with service workers blocked and image/media/font requests aborted to keep render cost down.

Roadmap

The plan is to evolve this from "a script you run locally" into a library + a service, so the same core extraction logic can be called either in-process by a Python agent or over HTTP by anything else:

ghost_reader/ — pure-Python core package. No FastAPI dependency. Exports read_text(url, ...) -> ReadResult. Importable directly into your agent code for the lowest possible latency and zero network hop.
ghost_reader_service/ — thin FastAPI wrapper around the core. Same logic, exposed over HTTP. Useful when the agent isn't Python, or when you want a single shared cache across many agents.

Other things on the near-term list:

Threshold tuning informed by the v0.2.0 sweep — the curve has measurable headroom; v0.3 will move on it.
Smarter walking of JSON-LD shapes (articleBody, text, description) instead of the current "long string" heuristic. The frozen corpus showed Stage 2 idle on every fixture; this is the change most likely to wake it up.
Content-driven Playwright wait strategy (selector / network-idle with cap) instead of the fixed sleep.
Concurrency / queue layer in the service flavor.
PyPI release alongside the v0.3.0 library split, so pip install ghost-reader gives you the in-process reader directly.

License

Apache 2.0.

Maintained by Linkedin @fmsalvadori GitHub MadeInPluto

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
benchmark_target.py		benchmark_target.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GhostReader

Why it exists

What you get in practice

Benchmark

Frozen baseline (27 fixtures across 10 categories)

Live reality check (4 fixtures, including 2 fetch-failure edges)

Run it locally

Run locally

Example

Docker

Configuration

Notes

Roadmap

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GhostReader

Why it exists

What you get in practice

Benchmark

Frozen baseline (27 fixtures across 10 categories)

Live reality check (4 fixtures, including 2 fetch-failure edges)

Run it locally

Run locally

Example

Docker

Configuration

Notes

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages