Skip to content

Research: scrape-via-agent extraction quality on dense doc sites (truncation + verifier) #64

@laradji

Description

@laradji

The scrape-via-agent path shipped in #57 has two design choices that combine badly on real-world doc sites with heavy chrome (mkdocs-material, GitBook, ReadTheDocs, etc.):

  1. Byte-blind input truncation at 48 KiB. Pages are chopped at byte 49152 with no awareness of section boundaries, code blocks, or where the actual content lives in the document.
  2. Strict-by-default verifyCodeBlocks. Every fenced code block in the LLM output must appear as a literal substring of the source content. Whitespace differences cause rejection.

Together, these produce a ~66% verification failure rate on FastAPI (mkdocs-material) during the #58 smoke test — and the failures are likely caused by the truncation cutting code blocks in half, which the LLM then reconstructs as a syntactically clean version that no longer byte-matches the truncated source.

Surfaced by: #58 (FastAPI smoke test, 2026-04-11)

Empirical evidence from #58

10 FastAPI tutorial URLs scraped via Qwen3.5-4B-MLX-4bit. Of the 6 URLs processed before an unrelated timeout (#62 tracks that):

URL Original size Truncated to Result
/first-steps/ 154 KB 49 KB ❌ verification_failed
/path-params/ 159 KB 49 KB ❌ verification_failed
/query-params/ 131 KB 49 KB ❌ verification_failed
/body/ 200 KB 49 KB ❌ verification_failed
/response-model/ 200 KB 49 KB ✅ docs_extracted=2 (1 inserted, 1 lost to embedder bug)
/extra-models/ 148 KB 49 KB ✅ docs_extracted=4

Pattern: 4 fails → 2 successes. Not random — depends on what content survives the truncation. 33% success rate on mkdocs-material is not viable for a real corpus.

Note that every page is truncated — the smallest is 116 KB (security/), still 2.4× the 48 KiB cap. Mkdocs-material pages are 80% chrome (sidebar nav, breadcrumbs, search index, JS bundle preloads, footer, ToC widget) and 20% actual content. The byte-blind truncation has no way to know that the chrome is mostly at the top and the content is in the middle, so it cuts content out at random.

Problem 1 — Byte-blind truncation

internal/scraper/agent.go:

if len(content) > agentInputMaxChars {
    slog.Warn("agent.input_truncated", ...)
    content = content[:agentInputMaxChars]
}

This is the simplest possible truncation: chop at byte N. It works for clean markdown sources (which deadzone tested with in #57) but fails on dense HTML where the actual content is buried in 80% chrome.

Truncation strategies to evaluate

Strategy How it works Pros Cons
A. Byte-blind (current) content[:N] Trivial, deterministic Cuts code blocks, drops content randomly
B. HTML pre-stripping Use golang.org/x/net/html to strip <nav>, <header>, <footer>, <aside>, <script>, <style> before truncation Removes 60-80% of mkdocs-material chrome upfront, content fits in 48 KiB Requires an HTML parser dep, depends on the site using semantic HTML
C. Reader-mode extraction Use a Mozilla Readability port (e.g. go-shiori/go-readability) to extract main content before sending to LLM Best content extraction, works on most doc sites New dep, may over-strip on technical docs with multiple <article> blocks
D. Section-aware truncation Parse markdown-after-LLM (or HTML-before-LLM) by section headers, drop low-priority sections (TOC, links list, footer) Preserves complete content of high-priority sections Complex, requires hand-tuning per source family
E. Sliding window with overlap Make multiple LLM calls on overlapping chunks, merge Handles arbitrary input size Expensive (N× LLM calls), merge logic is non-trivial
F. Increase the cap to fit big pages agentInputMaxChars = 200 * 1024 Trivial, no new code Larger input → slower LLM, more wasted tokens, cap still arbitrary

My instinct: B (HTML pre-stripping) is the right v1. Concrete, scoped, single-pass, no new heavy dep (golang.org/x/net/html is lightweight and already implicitly required by net/http handling). It cuts mkdocs-material from 154 KB to maybe 30-50 KB without losing any content. C (Readability) is the right v2 if B isn't enough.

Problem 2 — Strict-byte verifyCodeBlocks

internal/scraper/agent.go:

func verifyCodeBlocks(md, source string) bool {
    for _, block := range extractFencedBlocks(md) {
        if !strings.Contains(source, block) {
            return false
        }
    }
    return true
}

strings.Contains is a literal byte substring match. This fails when:

  • The LLM normalizes whitespace (tab → 4 spaces, trailing whitespace removal, line ending normalization)
  • The HTML source has <span>-wrapped syntax-highlighted code (<span class="kw">def</span>) that the LLM unwraps
  • The source has zero-width characters or weird Unicode that the LLM cleans up
  • The LLM reformats blank lines inside code blocks
  • The LLM converts smart quotes back to straight quotes

The current behavior is "strict by default" per the locked-in #57 design — but the test in agent_test.go (TestVerifyCodeBlocks_StrictWhitespace) intentionally validates that "4-space-indented source vs tab-indented md → reject", which is exactly the failure mode that bites in production.

Verifier loosening strategies to evaluate

Strategy How it works Pros Cons
A. Strict (current) strings.Contains(source, block) Strongest hallucination protection Rejects valid extractions where the LLM only normalized formatting
B. Whitespace-normalized Compare both md and source after strings.Join(strings.Fields(s), " ") Tolerates any whitespace difference, still catches real hallucinations Allows whitespace-only "hallucinations" (probably fine — code that differs only in whitespace is the same code)
C. Token-overlap threshold For each md block, check that >X% of its tokens appear in the source. X=0.9 catches LLMs that add a word or two but rejects fully fabricated blocks Tolerates minor edits, still strong Tunable threshold = parameter we have to defend
D. Embedding similarity Embed the md block + source neighborhood, accept if cosine similarity > threshold Most robust, language-aware Expensive (extra embed calls per verification), requires more code
E. Opt-out flag Per-source field in libraries_sources.yaml like verify: false to skip the check entirely Trivial, lets users escape on a per-source basis Easy to abuse, no protection on opted-out sources

My instinct: B (whitespace-normalized) is the right v1. It's a 10-line change in verifyCodeBlocks and addresses the most common failure mode (LLM normalizes whitespace) without giving up too much hallucination protection. Code that differs only in whitespace is functionally identical. C (token overlap) is the right v2 if B over-permits.

The current TestVerifyCodeBlocks_StrictWhitespace test would need to be inverted (or split into two: one that asserts B, one that asserts strict mode if we keep it as opt-in).

Composite question: are these two problems linked?

Almost certainly yes. The hypothesis is:

  1. The 154 KB HTML page contains a code block at byte 60000.
  2. Byte-blind truncation cuts the page at byte 49152 — the code block is split.
  3. The LLM sees half a code block. It produces clean Markdown with what appears to be a complete code block — either by reconstructing it from context, or by emitting a different (smaller) code block from earlier in the page.
  4. verifyCodeBlocks looks for the LLM's clean block in the truncated source. It's not there as a byte substring — fail.

If this hypothesis is right, fixing problem 1 alone (HTML pre-stripping) would dramatically reduce the verification failure rate, because content + code blocks would fit comfortably in 48 KiB without truncation. Problem 2 (loosening verifier) would still help on edge cases but wouldn't be the dominant factor.

This needs to be tested empirically: run #58 again after fixing problem 1 only, measure the verification success rate, then decide if problem 2's fix is also needed.

Scope (research issue, not implementation yet)

This issue produces:

  • A short benchmark of the four options for problem 1 (A, B, C, F) on the 10 FastAPI tutorial URLs from Smoke-test the scrape-via-agent stack end-to-end with a server-rendered HTML library #58, measuring verification success rate
  • A short benchmark of the four options for problem 2 (A, B, C, D) on the same corpus, measuring false-positive rate (real hallucinations missed) and false-negative rate (valid extractions rejected)
  • A decision document at docs/research/scrape-via-agent-extraction-quality.md with the chosen approach
  • One or more follow-up implementation issues to ship the fixes

Out of scope

  • Implementing any fix. This issue is about picking the right approach.
  • Adding more doc sources to test against. Stick with FastAPI's 10 URLs as the canonical bench until the design lands.
  • Switching to an entirely new extraction approach (e.g. JSON API + structured parser). That's Research: JSON-based source kind for structured-API doc sites #1's scope.
  • Per-source verifier configuration. Premature.

Acceptance criteria

  • docs/research/scrape-via-agent-extraction-quality.md exists, contains the benchmark numbers, the chosen approach, and the rationale
  • At least one follow-up implementation issue is filed (or "decision: ship as-is, accept the failure rate" with reasoning)
  • After the implementation lands, smoke test Smoke-test the scrape-via-agent stack end-to-end with a server-rendered HTML library #58 has a >80% verification success rate on the 10 FastAPI URLs

Related

Metadata

Metadata

Assignees

Labels

P3Low — nice-to-have, when time allowsresearchResearch / spike

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions