Parent: none
Depends on: none (research, no implementation gate)
Problem
Go is the language deadzone itself is written in, and the natural candidate for the first "fully self-referential" lib in the registry. But none of the three existing source kinds fit the Go stdlib cleanly:
github-md (raw markdown HTTP): golang/go ships only 2 .md files in /doc/ (confirmed 2026-04-17: README.md + godebug.md). The Go language spec is /doc/go_spec.html, not markdown. Thin coverage
github-rst (ReST): Go doesn't ship RST. Miss
scrape-via-agent (LLM extraction on HTML): works against go.dev/doc/ but burns LLM tokens on every re-scrape, and the structure of pkg.go.dev pages is highly redundant (sidebar, nav, code blocks → token bloat)
The Go stdlib's real documentation source is godoc: comments embedded in .go source files in src/**, rendered by pkg.go.dev. Neither github-md nor scrape-via-agent maps onto that cleanly.
Goal
Pick one ingestion approach for Go stdlib + popular third-party Go libraries (context7-scale target: hundreds of Go modules eventually), document the decision in docs/research/, and file follow-up feature issues for implementation.
Areas to investigate
1. New kind: godoc that walks source
Parses .go files directly, extracts package/type/func comments via go/doc stdlib, produces one chunk per exported identifier (+ one for the package overview). Pros: precise, semantic chunks; doesn't depend on any external service; stable across Go versions. Cons: a new scraper code path, probably larger than #27 scrape-via-agent was; vendoring go/doc is trivial but the orchestration of "which packages" is not.
2. scrape-via-agent against pkg.go.dev HTML
Pros: no new code, reuses #27's infrastructure, agent can summarize or re-chunk. Cons: each re-scrape = LLM tokens × N pages; pkg.go.dev has nav/sidebar noise that the agent needs to strip; the page-per-package model inflates URL count for the stdlib (150+ packages)
3. Fetch raw godoc JSON from pkg.go.dev/<pkg>?format=json if such a thing exists
Short spike: does pkg.go.dev expose a structured data endpoint? If yes, zero-LLM, zero-HTML-scraping. If no, dead option.
4. Hybrid: github-md for the handful of Go-team narrative docs + godoc kind for packages
Two sources per lib_id. Keeps narrative content (language-spec.md, effective-go.md — if they ever land in .md form) separable from API docs.
Deliverables
Acceptance criteria
Out of scope (fenced)
- No implementation — this is research-only. The follow-up issue does implementation
- No single-lib scope — if the picked option requires new scraper infrastructure, it targets Go stdlib AND third-party Go libs with the same shape, not just golang/go
- No pkg.go.dev scraping heuristics in this issue — that's the follow-up
- No comparison with context7's Go coverage — if relevant, reference it in prose but don't benchmark
- No embedder changes — the embedder is orthogonal; Go docs are regular text
Related
Parent: none
Depends on: none (research, no implementation gate)
Problem
Go is the language deadzone itself is written in, and the natural candidate for the first "fully self-referential" lib in the registry. But none of the three existing source kinds fit the Go stdlib cleanly:
github-md(raw markdown HTTP):golang/goships only 2.mdfiles in/doc/(confirmed 2026-04-17:README.md+godebug.md). The Go language spec is/doc/go_spec.html, not markdown. Thin coveragegithub-rst(ReST): Go doesn't ship RST. Missscrape-via-agent(LLM extraction on HTML): works againstgo.dev/doc/but burns LLM tokens on every re-scrape, and the structure of pkg.go.dev pages is highly redundant (sidebar, nav, code blocks → token bloat)The Go stdlib's real documentation source is godoc: comments embedded in
.gosource files insrc/**, rendered bypkg.go.dev. Neithergithub-mdnorscrape-via-agentmaps onto that cleanly.Goal
Pick one ingestion approach for Go stdlib + popular third-party Go libraries (context7-scale target: hundreds of Go modules eventually), document the decision in
docs/research/, and file follow-up feature issues for implementation.Areas to investigate
1. New
kind: godocthat walks sourceParses
.gofiles directly, extracts package/type/func comments viago/docstdlib, produces one chunk per exported identifier (+ one for the package overview). Pros: precise, semantic chunks; doesn't depend on any external service; stable across Go versions. Cons: a new scraper code path, probably larger than #27 scrape-via-agent was; vendoringgo/docis trivial but the orchestration of "which packages" is not.2.
scrape-via-agentagainstpkg.go.devHTMLPros: no new code, reuses #27's infrastructure, agent can summarize or re-chunk. Cons: each re-scrape = LLM tokens × N pages;
pkg.go.devhas nav/sidebar noise that the agent needs to strip; the page-per-package model inflates URL count for the stdlib (150+ packages)3. Fetch raw godoc JSON from
pkg.go.dev/<pkg>?format=jsonif such a thing existsShort spike: does
pkg.go.devexpose a structured data endpoint? If yes, zero-LLM, zero-HTML-scraping. If no, dead option.4. Hybrid:
github-mdfor the handful of Go-team narrative docs +godockind for packagesTwo sources per lib_id. Keeps narrative content (language-spec.md, effective-go.md — if they ever land in .md form) separable from API docs.
Deliverables
docs/research/go-stdlib-ingestion.md— decision log following the pattern ofbatch-scrape-actions.mdandembedder-choice.md: options-considered table, sanity-check-at-scale section, picked option with "holds at 200 Go modules" and "holds at 2000 Go modules" columnsdocs/research/ingestion-architecture.mddecision 3 (source kinds) noting that Go stdlib is tracked separately and linking to the new research docAcceptance criteria
Out of scope (fenced)
Related
scrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27 (scrape-via-agent) — defines option 2's infrastructuredocs/research/ingestion-architecture.mddecision 3 — the three existing kinds