Research: scrape-via-agent extraction quality on dense doc sites (truncation + verifier)

The `scrape-via-agent` path shipped in #57 has two design choices that combine badly on real-world doc sites with heavy chrome (mkdocs-material, GitBook, ReadTheDocs, etc.):

1. **Byte-blind input truncation** at 48 KiB. Pages are chopped at byte 49152 with no awareness of section boundaries, code blocks, or where the actual content lives in the document.
2. **Strict-by-default `verifyCodeBlocks`**. Every fenced code block in the LLM output must appear as a literal substring of the source content. Whitespace differences cause rejection.

Together, these produce a **~66% verification failure rate** on FastAPI (mkdocs-material) during the #58 smoke test — and the failures are likely caused by the truncation cutting code blocks in half, which the LLM then reconstructs as a syntactically clean version that no longer byte-matches the truncated source.

**Surfaced by:** #58 (FastAPI smoke test, 2026-04-11)

## Empirical evidence from #58

10 FastAPI tutorial URLs scraped via `Qwen3.5-4B-MLX-4bit`. Of the 6 URLs processed before an unrelated timeout (#62 tracks that):

| URL | Original size | Truncated to | Result |
|---|---|---|---|
| `/first-steps/` | 154 KB | 49 KB | ❌ verification_failed |
| `/path-params/` | 159 KB | 49 KB | ❌ verification_failed |
| `/query-params/` | 131 KB | 49 KB | ❌ verification_failed |
| `/body/` | 200 KB | 49 KB | ❌ verification_failed |
| `/response-model/` | 200 KB | 49 KB | ✅ docs_extracted=2 (1 inserted, 1 lost to embedder bug) |
| `/extra-models/` | 148 KB | 49 KB | ✅ docs_extracted=4 |

**Pattern:** 4 fails → 2 successes. Not random — depends on what content survives the truncation. 33% success rate on mkdocs-material is not viable for a real corpus.

Note that **every page is truncated** — the smallest is 116 KB (`security/`), still 2.4× the 48 KiB cap. Mkdocs-material pages are 80% chrome (sidebar nav, breadcrumbs, search index, JS bundle preloads, footer, ToC widget) and 20% actual content. The byte-blind truncation has no way to know that the chrome is mostly at the top and the content is in the middle, so it cuts content out at random.

## Problem 1 — Byte-blind truncation

`internal/scraper/agent.go`:

```go
if len(content) > agentInputMaxChars {
    slog.Warn("agent.input_truncated", ...)
    content = content[:agentInputMaxChars]
}
```

This is the simplest possible truncation: chop at byte N. It works for clean markdown sources (which deadzone tested with in #57) but fails on dense HTML where the actual content is buried in 80% chrome.

### Truncation strategies to evaluate

| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| **A. Byte-blind (current)** | `content[:N]` | Trivial, deterministic | Cuts code blocks, drops content randomly |
| **B. HTML pre-stripping** | Use `golang.org/x/net/html` to strip `<nav>`, `<header>`, `<footer>`, `<aside>`, `<script>`, `<style>` before truncation | Removes 60-80% of mkdocs-material chrome upfront, content fits in 48 KiB | Requires an HTML parser dep, depends on the site using semantic HTML |
| **C. Reader-mode extraction** | Use a Mozilla Readability port (e.g. `go-shiori/go-readability`) to extract main content before sending to LLM | Best content extraction, works on most doc sites | New dep, may over-strip on technical docs with multiple `<article>` blocks |
| **D. Section-aware truncation** | Parse markdown-after-LLM (or HTML-before-LLM) by section headers, drop low-priority sections (TOC, links list, footer) | Preserves complete content of high-priority sections | Complex, requires hand-tuning per source family |
| **E. Sliding window with overlap** | Make multiple LLM calls on overlapping chunks, merge | Handles arbitrary input size | Expensive (N× LLM calls), merge logic is non-trivial |
| **F. Increase the cap to fit big pages** | `agentInputMaxChars = 200 * 1024` | Trivial, no new code | Larger input → slower LLM, more wasted tokens, cap still arbitrary |

**My instinct:** **B (HTML pre-stripping) is the right v1**. Concrete, scoped, single-pass, no new heavy dep (`golang.org/x/net/html` is lightweight and already implicitly required by net/http handling). It cuts mkdocs-material from 154 KB to maybe 30-50 KB without losing any content. **C (Readability) is the right v2** if B isn't enough.

## Problem 2 — Strict-byte `verifyCodeBlocks`

`internal/scraper/agent.go`:

```go
func verifyCodeBlocks(md, source string) bool {
    for _, block := range extractFencedBlocks(md) {
        if !strings.Contains(source, block) {
            return false
        }
    }
    return true
}
```

`strings.Contains` is a literal byte substring match. This fails when:

- The LLM normalizes whitespace (tab → 4 spaces, trailing whitespace removal, line ending normalization)
- The HTML source has `<span>`-wrapped syntax-highlighted code (`<span class="kw">def</span>`) that the LLM unwraps
- The source has zero-width characters or weird Unicode that the LLM cleans up
- The LLM reformats blank lines inside code blocks
- The LLM converts smart quotes back to straight quotes

The current behavior is "strict by default" per the locked-in #57 design — but the test in `agent_test.go` (`TestVerifyCodeBlocks_StrictWhitespace`) intentionally validates that "4-space-indented source vs tab-indented md → reject", which is exactly the failure mode that bites in production.

### Verifier loosening strategies to evaluate

| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| **A. Strict (current)** | `strings.Contains(source, block)` | Strongest hallucination protection | Rejects valid extractions where the LLM only normalized formatting |
| **B. Whitespace-normalized** | Compare both md and source after `strings.Join(strings.Fields(s), " ")` | Tolerates any whitespace difference, still catches real hallucinations | Allows whitespace-only "hallucinations" (probably fine — code that differs only in whitespace is the same code) |
| **C. Token-overlap threshold** | For each md block, check that >X% of its tokens appear in the source. X=0.9 catches LLMs that add a word or two but rejects fully fabricated blocks | Tolerates minor edits, still strong | Tunable threshold = parameter we have to defend |
| **D. Embedding similarity** | Embed the md block + source neighborhood, accept if cosine similarity > threshold | Most robust, language-aware | Expensive (extra embed calls per verification), requires more code |
| **E. Opt-out flag** | Per-source field in `libraries_sources.yaml` like `verify: false` to skip the check entirely | Trivial, lets users escape on a per-source basis | Easy to abuse, no protection on opted-out sources |

**My instinct:** **B (whitespace-normalized) is the right v1**. It's a 10-line change in `verifyCodeBlocks` and addresses the most common failure mode (LLM normalizes whitespace) without giving up too much hallucination protection. Code that differs only in whitespace is functionally identical. **C (token overlap) is the right v2** if B over-permits.

The current `TestVerifyCodeBlocks_StrictWhitespace` test would need to be inverted (or split into two: one that asserts B, one that asserts strict mode if we keep it as opt-in).

## Composite question: are these two problems linked?

Almost certainly **yes**. The hypothesis is:

1. The 154 KB HTML page contains a code block at byte 60000.
2. Byte-blind truncation cuts the page at byte 49152 — the code block is split.
3. The LLM sees half a code block. It produces clean Markdown with what appears to be a complete code block — either by reconstructing it from context, or by emitting a different (smaller) code block from earlier in the page.
4. `verifyCodeBlocks` looks for the LLM's clean block in the truncated source. It's not there as a byte substring — fail.

If this hypothesis is right, **fixing problem 1 alone (HTML pre-stripping) would dramatically reduce the verification failure rate**, because content + code blocks would fit comfortably in 48 KiB without truncation. Problem 2 (loosening verifier) would still help on edge cases but wouldn't be the dominant factor.

This needs to be **tested empirically**: run #58 again after fixing problem 1 only, measure the verification success rate, then decide if problem 2's fix is also needed.

## Scope (research issue, not implementation yet)

This issue produces:

- [ ] A short benchmark of the four options for problem 1 (A, B, C, F) on the 10 FastAPI tutorial URLs from #58, measuring verification success rate
- [ ] A short benchmark of the four options for problem 2 (A, B, C, D) on the same corpus, measuring false-positive rate (real hallucinations missed) and false-negative rate (valid extractions rejected)
- [ ] A decision document at `docs/research/scrape-via-agent-extraction-quality.md` with the chosen approach
- [ ] One or more follow-up implementation issues to ship the fixes

## Out of scope

- **Implementing any fix.** This issue is about picking the right approach.
- **Adding more doc sources** to test against. Stick with FastAPI's 10 URLs as the canonical bench until the design lands.
- **Switching to an entirely new extraction approach** (e.g. JSON API + structured parser). That's #1's scope.
- **Per-source verifier configuration.** Premature.

## Acceptance criteria

- [ ] `docs/research/scrape-via-agent-extraction-quality.md` exists, contains the benchmark numbers, the chosen approach, and the rationale
- [ ] At least one follow-up implementation issue is filed (or "decision: ship as-is, accept the failure rate" with reasoning)
- [ ] After the implementation lands, smoke test #58 has a >80% verification success rate on the 10 FastAPI URLs

## Related

- **#58** — FastAPI smoke test where this was empirically observed.
- **#57** — original `scrape-via-agent` PR. The strict-byte verifier and byte-blind truncation are #57 design decisions.
- **#62 (or whatever number lands)** — `harden scrape-via-agent error handling`. Sister follow-up to #57. Independent fix, can land in either order.
- **#49** — chunking strategy research. Related but different layer: #49 is about chunking the markdown AFTER extraction, this issue is about getting the markdown OUT cleanly in the first place.
- **#1** — JSON-based source kind research. The "right" way to handle structured doc sites long-term, but doesn't apply to mkdocs-material (which has no public JSON API).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: scrape-via-agent extraction quality on dense doc sites (truncation + verifier) #64

Empirical evidence from #58

Problem 1 — Byte-blind truncation

Truncation strategies to evaluate

Problem 2 — Strict-byte `verifyCodeBlocks`

Verifier loosening strategies to evaluate

Composite question: are these two problems linked?

Scope (research issue, not implementation yet)

Out of scope

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

URL	Original size	Truncated to	Result
`/first-steps/`	154 KB	49 KB	❌ verification_failed
`/path-params/`	159 KB	49 KB	❌ verification_failed
`/query-params/`	131 KB	49 KB	❌ verification_failed
`/body/`	200 KB	49 KB	❌ verification_failed
`/response-model/`	200 KB	49 KB	✅ docs_extracted=2 (1 inserted, 1 lost to embedder bug)
`/extra-models/`	148 KB	49 KB	✅ docs_extracted=4

Strategy	How it works	Pros	Cons
A. Byte-blind (current)	`content[:N]`	Trivial, deterministic	Cuts code blocks, drops content randomly
B. HTML pre-stripping	Use `golang.org/x/net/html` to strip `<nav>`, `<header>`, `<footer>`, `<aside>`, `<script>`, `<style>` before truncation	Removes 60-80% of mkdocs-material chrome upfront, content fits in 48 KiB	Requires an HTML parser dep, depends on the site using semantic HTML
C. Reader-mode extraction	Use a Mozilla Readability port (e.g. `go-shiori/go-readability`) to extract main content before sending to LLM	Best content extraction, works on most doc sites	New dep, may over-strip on technical docs with multiple `<article>` blocks
D. Section-aware truncation	Parse markdown-after-LLM (or HTML-before-LLM) by section headers, drop low-priority sections (TOC, links list, footer)	Preserves complete content of high-priority sections	Complex, requires hand-tuning per source family
E. Sliding window with overlap	Make multiple LLM calls on overlapping chunks, merge	Handles arbitrary input size	Expensive (N× LLM calls), merge logic is non-trivial
F. Increase the cap to fit big pages	`agentInputMaxChars = 200 * 1024`	Trivial, no new code	Larger input → slower LLM, more wasted tokens, cap still arbitrary

Strategy	How it works	Pros	Cons
A. Strict (current)	`strings.Contains(source, block)`	Strongest hallucination protection	Rejects valid extractions where the LLM only normalized formatting
B. Whitespace-normalized	Compare both md and source after `strings.Join(strings.Fields(s), " ")`	Tolerates any whitespace difference, still catches real hallucinations	Allows whitespace-only "hallucinations" (probably fine — code that differs only in whitespace is the same code)
C. Token-overlap threshold	For each md block, check that >X% of its tokens appear in the source. X=0.9 catches LLMs that add a word or two but rejects fully fabricated blocks	Tolerates minor edits, still strong	Tunable threshold = parameter we have to defend
D. Embedding similarity	Embed the md block + source neighborhood, accept if cosine similarity > threshold	Most robust, language-aware	Expensive (extra embed calls per verification), requires more code
E. Opt-out flag	Per-source field in `libraries_sources.yaml` like `verify: false` to skip the check entirely	Trivial, lets users escape on a per-source basis	Easy to abuse, no protection on opted-out sources

Research: scrape-via-agent extraction quality on dense doc sites (truncation + verifier) #64

Description

Empirical evidence from #58

Problem 1 — Byte-blind truncation

Truncation strategies to evaluate

Problem 2 — Strict-byte verifyCodeBlocks

Verifier loosening strategies to evaluate

Composite question: are these two problems linked?

Scope (research issue, not implementation yet)

Out of scope

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Problem 2 — Strict-byte `verifyCodeBlocks`