You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The scrape-via-agent path shipped in #57 has two design choices that combine badly on real-world doc sites with heavy chrome (mkdocs-material, GitBook, ReadTheDocs, etc.):
Byte-blind input truncation at 48 KiB. Pages are chopped at byte 49152 with no awareness of section boundaries, code blocks, or where the actual content lives in the document.
Strict-by-default verifyCodeBlocks. Every fenced code block in the LLM output must appear as a literal substring of the source content. Whitespace differences cause rejection.
Together, these produce a ~66% verification failure rate on FastAPI (mkdocs-material) during the #58 smoke test — and the failures are likely caused by the truncation cutting code blocks in half, which the LLM then reconstructs as a syntactically clean version that no longer byte-matches the truncated source.
10 FastAPI tutorial URLs scraped via Qwen3.5-4B-MLX-4bit. Of the 6 URLs processed before an unrelated timeout (#62 tracks that):
URL
Original size
Truncated to
Result
/first-steps/
154 KB
49 KB
❌ verification_failed
/path-params/
159 KB
49 KB
❌ verification_failed
/query-params/
131 KB
49 KB
❌ verification_failed
/body/
200 KB
49 KB
❌ verification_failed
/response-model/
200 KB
49 KB
✅ docs_extracted=2 (1 inserted, 1 lost to embedder bug)
/extra-models/
148 KB
49 KB
✅ docs_extracted=4
Pattern: 4 fails → 2 successes. Not random — depends on what content survives the truncation. 33% success rate on mkdocs-material is not viable for a real corpus.
Note that every page is truncated — the smallest is 116 KB (security/), still 2.4× the 48 KiB cap. Mkdocs-material pages are 80% chrome (sidebar nav, breadcrumbs, search index, JS bundle preloads, footer, ToC widget) and 20% actual content. The byte-blind truncation has no way to know that the chrome is mostly at the top and the content is in the middle, so it cuts content out at random.
This is the simplest possible truncation: chop at byte N. It works for clean markdown sources (which deadzone tested with in #57) but fails on dense HTML where the actual content is buried in 80% chrome.
Truncation strategies to evaluate
Strategy
How it works
Pros
Cons
A. Byte-blind (current)
content[:N]
Trivial, deterministic
Cuts code blocks, drops content randomly
B. HTML pre-stripping
Use golang.org/x/net/html to strip <nav>, <header>, <footer>, <aside>, <script>, <style> before truncation
Removes 60-80% of mkdocs-material chrome upfront, content fits in 48 KiB
Requires an HTML parser dep, depends on the site using semantic HTML
C. Reader-mode extraction
Use a Mozilla Readability port (e.g. go-shiori/go-readability) to extract main content before sending to LLM
Best content extraction, works on most doc sites
New dep, may over-strip on technical docs with multiple <article> blocks
D. Section-aware truncation
Parse markdown-after-LLM (or HTML-before-LLM) by section headers, drop low-priority sections (TOC, links list, footer)
Preserves complete content of high-priority sections
Complex, requires hand-tuning per source family
E. Sliding window with overlap
Make multiple LLM calls on overlapping chunks, merge
Handles arbitrary input size
Expensive (N× LLM calls), merge logic is non-trivial
F. Increase the cap to fit big pages
agentInputMaxChars = 200 * 1024
Trivial, no new code
Larger input → slower LLM, more wasted tokens, cap still arbitrary
My instinct:B (HTML pre-stripping) is the right v1. Concrete, scoped, single-pass, no new heavy dep (golang.org/x/net/html is lightweight and already implicitly required by net/http handling). It cuts mkdocs-material from 154 KB to maybe 30-50 KB without losing any content. C (Readability) is the right v2 if B isn't enough.
strings.Contains is a literal byte substring match. This fails when:
The LLM normalizes whitespace (tab → 4 spaces, trailing whitespace removal, line ending normalization)
The HTML source has <span>-wrapped syntax-highlighted code (<span class="kw">def</span>) that the LLM unwraps
The source has zero-width characters or weird Unicode that the LLM cleans up
The LLM reformats blank lines inside code blocks
The LLM converts smart quotes back to straight quotes
The current behavior is "strict by default" per the locked-in #57 design — but the test in agent_test.go (TestVerifyCodeBlocks_StrictWhitespace) intentionally validates that "4-space-indented source vs tab-indented md → reject", which is exactly the failure mode that bites in production.
Verifier loosening strategies to evaluate
Strategy
How it works
Pros
Cons
A. Strict (current)
strings.Contains(source, block)
Strongest hallucination protection
Rejects valid extractions where the LLM only normalized formatting
B. Whitespace-normalized
Compare both md and source after strings.Join(strings.Fields(s), " ")
Tolerates any whitespace difference, still catches real hallucinations
Allows whitespace-only "hallucinations" (probably fine — code that differs only in whitespace is the same code)
C. Token-overlap threshold
For each md block, check that >X% of its tokens appear in the source. X=0.9 catches LLMs that add a word or two but rejects fully fabricated blocks
Tolerates minor edits, still strong
Tunable threshold = parameter we have to defend
D. Embedding similarity
Embed the md block + source neighborhood, accept if cosine similarity > threshold
Most robust, language-aware
Expensive (extra embed calls per verification), requires more code
E. Opt-out flag
Per-source field in libraries_sources.yaml like verify: false to skip the check entirely
Trivial, lets users escape on a per-source basis
Easy to abuse, no protection on opted-out sources
My instinct:B (whitespace-normalized) is the right v1. It's a 10-line change in verifyCodeBlocks and addresses the most common failure mode (LLM normalizes whitespace) without giving up too much hallucination protection. Code that differs only in whitespace is functionally identical. C (token overlap) is the right v2 if B over-permits.
The current TestVerifyCodeBlocks_StrictWhitespace test would need to be inverted (or split into two: one that asserts B, one that asserts strict mode if we keep it as opt-in).
Composite question: are these two problems linked?
Almost certainly yes. The hypothesis is:
The 154 KB HTML page contains a code block at byte 60000.
Byte-blind truncation cuts the page at byte 49152 — the code block is split.
The LLM sees half a code block. It produces clean Markdown with what appears to be a complete code block — either by reconstructing it from context, or by emitting a different (smaller) code block from earlier in the page.
verifyCodeBlocks looks for the LLM's clean block in the truncated source. It's not there as a byte substring — fail.
If this hypothesis is right, fixing problem 1 alone (HTML pre-stripping) would dramatically reduce the verification failure rate, because content + code blocks would fit comfortably in 48 KiB without truncation. Problem 2 (loosening verifier) would still help on edge cases but wouldn't be the dominant factor.
This needs to be tested empirically: run #58 again after fixing problem 1 only, measure the verification success rate, then decide if problem 2's fix is also needed.
A short benchmark of the four options for problem 2 (A, B, C, D) on the same corpus, measuring false-positive rate (real hallucinations missed) and false-negative rate (valid extractions rejected)
A decision document at docs/research/scrape-via-agent-extraction-quality.md with the chosen approach
One or more follow-up implementation issues to ship the fixes
Out of scope
Implementing any fix. This issue is about picking the right approach.
Adding more doc sources to test against. Stick with FastAPI's 10 URLs as the canonical bench until the design lands.
The
scrape-via-agentpath shipped in #57 has two design choices that combine badly on real-world doc sites with heavy chrome (mkdocs-material, GitBook, ReadTheDocs, etc.):verifyCodeBlocks. Every fenced code block in the LLM output must appear as a literal substring of the source content. Whitespace differences cause rejection.Together, these produce a ~66% verification failure rate on FastAPI (mkdocs-material) during the #58 smoke test — and the failures are likely caused by the truncation cutting code blocks in half, which the LLM then reconstructs as a syntactically clean version that no longer byte-matches the truncated source.
Surfaced by: #58 (FastAPI smoke test, 2026-04-11)
Empirical evidence from #58
10 FastAPI tutorial URLs scraped via
Qwen3.5-4B-MLX-4bit. Of the 6 URLs processed before an unrelated timeout (#62 tracks that):/first-steps//path-params//query-params//body//response-model//extra-models/Pattern: 4 fails → 2 successes. Not random — depends on what content survives the truncation. 33% success rate on mkdocs-material is not viable for a real corpus.
Note that every page is truncated — the smallest is 116 KB (
security/), still 2.4× the 48 KiB cap. Mkdocs-material pages are 80% chrome (sidebar nav, breadcrumbs, search index, JS bundle preloads, footer, ToC widget) and 20% actual content. The byte-blind truncation has no way to know that the chrome is mostly at the top and the content is in the middle, so it cuts content out at random.Problem 1 — Byte-blind truncation
internal/scraper/agent.go:This is the simplest possible truncation: chop at byte N. It works for clean markdown sources (which deadzone tested with in #57) but fails on dense HTML where the actual content is buried in 80% chrome.
Truncation strategies to evaluate
content[:N]golang.org/x/net/htmlto strip<nav>,<header>,<footer>,<aside>,<script>,<style>before truncationgo-shiori/go-readability) to extract main content before sending to LLM<article>blocksagentInputMaxChars = 200 * 1024My instinct: B (HTML pre-stripping) is the right v1. Concrete, scoped, single-pass, no new heavy dep (
golang.org/x/net/htmlis lightweight and already implicitly required by net/http handling). It cuts mkdocs-material from 154 KB to maybe 30-50 KB without losing any content. C (Readability) is the right v2 if B isn't enough.Problem 2 — Strict-byte
verifyCodeBlocksinternal/scraper/agent.go:strings.Containsis a literal byte substring match. This fails when:<span>-wrapped syntax-highlighted code (<span class="kw">def</span>) that the LLM unwrapsThe current behavior is "strict by default" per the locked-in #57 design — but the test in
agent_test.go(TestVerifyCodeBlocks_StrictWhitespace) intentionally validates that "4-space-indented source vs tab-indented md → reject", which is exactly the failure mode that bites in production.Verifier loosening strategies to evaluate
strings.Contains(source, block)strings.Join(strings.Fields(s), " ")libraries_sources.yamllikeverify: falseto skip the check entirelyMy instinct: B (whitespace-normalized) is the right v1. It's a 10-line change in
verifyCodeBlocksand addresses the most common failure mode (LLM normalizes whitespace) without giving up too much hallucination protection. Code that differs only in whitespace is functionally identical. C (token overlap) is the right v2 if B over-permits.The current
TestVerifyCodeBlocks_StrictWhitespacetest would need to be inverted (or split into two: one that asserts B, one that asserts strict mode if we keep it as opt-in).Composite question: are these two problems linked?
Almost certainly yes. The hypothesis is:
verifyCodeBlockslooks for the LLM's clean block in the truncated source. It's not there as a byte substring — fail.If this hypothesis is right, fixing problem 1 alone (HTML pre-stripping) would dramatically reduce the verification failure rate, because content + code blocks would fit comfortably in 48 KiB without truncation. Problem 2 (loosening verifier) would still help on edge cases but wouldn't be the dominant factor.
This needs to be tested empirically: run #58 again after fixing problem 1 only, measure the verification success rate, then decide if problem 2's fix is also needed.
Scope (research issue, not implementation yet)
This issue produces:
docs/research/scrape-via-agent-extraction-quality.mdwith the chosen approachOut of scope
Acceptance criteria
docs/research/scrape-via-agent-extraction-quality.mdexists, contains the benchmark numbers, the chosen approach, and the rationaleRelated
scrape-via-agentPR. The strict-byte verifier and byte-blind truncation are feat: add scrape-via-agent source kind for LLM-backed doc extraction #57 design decisions.harden scrape-via-agent error handling. Sister follow-up to feat: add scrape-via-agent source kind for LLM-backed doc extraction #57. Independent fix, can land in either order.