Skip to content

Research: ingest Go stdlib + pkg.go.dev docs into the deadzone registry #133

@laradji

Description

@laradji

Parent: none
Depends on: none (research, no implementation gate)

Problem

Go is the language deadzone itself is written in, and the natural candidate for the first "fully self-referential" lib in the registry. But none of the three existing source kinds fit the Go stdlib cleanly:

  • github-md (raw markdown HTTP): golang/go ships only 2 .md files in /doc/ (confirmed 2026-04-17: README.md + godebug.md). The Go language spec is /doc/go_spec.html, not markdown. Thin coverage
  • github-rst (ReST): Go doesn't ship RST. Miss
  • scrape-via-agent (LLM extraction on HTML): works against go.dev/doc/ but burns LLM tokens on every re-scrape, and the structure of pkg.go.dev pages is highly redundant (sidebar, nav, code blocks → token bloat)

The Go stdlib's real documentation source is godoc: comments embedded in .go source files in src/**, rendered by pkg.go.dev. Neither github-md nor scrape-via-agent maps onto that cleanly.

Goal

Pick one ingestion approach for Go stdlib + popular third-party Go libraries (context7-scale target: hundreds of Go modules eventually), document the decision in docs/research/, and file follow-up feature issues for implementation.

Areas to investigate

1. New kind: godoc that walks source

Parses .go files directly, extracts package/type/func comments via go/doc stdlib, produces one chunk per exported identifier (+ one for the package overview). Pros: precise, semantic chunks; doesn't depend on any external service; stable across Go versions. Cons: a new scraper code path, probably larger than #27 scrape-via-agent was; vendoring go/doc is trivial but the orchestration of "which packages" is not.

2. scrape-via-agent against pkg.go.dev HTML

Pros: no new code, reuses #27's infrastructure, agent can summarize or re-chunk. Cons: each re-scrape = LLM tokens × N pages; pkg.go.dev has nav/sidebar noise that the agent needs to strip; the page-per-package model inflates URL count for the stdlib (150+ packages)

3. Fetch raw godoc JSON from pkg.go.dev/<pkg>?format=json if such a thing exists

Short spike: does pkg.go.dev expose a structured data endpoint? If yes, zero-LLM, zero-HTML-scraping. If no, dead option.

4. Hybrid: github-md for the handful of Go-team narrative docs + godoc kind for packages

Two sources per lib_id. Keeps narrative content (language-spec.md, effective-go.md — if they ever land in .md form) separable from API docs.

Deliverables

  • docs/research/go-stdlib-ingestion.md — decision log following the pattern of batch-scrape-actions.md and embedder-choice.md: options-considered table, sanity-check-at-scale section, picked option with "holds at 200 Go modules" and "holds at 2000 Go modules" columns
  • Follow-up feature issue filed against the picked option (new kind? new scraper flag? config bump?)
  • If the picked option is option 3 (pkg.go.dev JSON spike) and the spike returns "yes", file a trivial feature issue for the config addition; if "no", file whichever option 1/2/4 was second-best
  • One-line addition to docs/research/ingestion-architecture.md decision 3 (source kinds) noting that Go stdlib is tracked separately and linking to the new research doc

Acceptance criteria

  • All 4 areas above evaluated in the research doc (table form, not prose)
  • Explicit "picked option" with reasoning, dated-locked
  • Sanity checks: "holds at 1 Go module", "holds at 150 stdlib packages", "holds at 2000 third-party Go modules (context7-scale)" — each column ✅ / ⚠️ / ❌
  • At least one follow-up issue filed

Out of scope (fenced)

  • No implementation — this is research-only. The follow-up issue does implementation
  • No single-lib scope — if the picked option requires new scraper infrastructure, it targets Go stdlib AND third-party Go libs with the same shape, not just golang/go
  • No pkg.go.dev scraping heuristics in this issue — that's the follow-up
  • No comparison with context7's Go coverage — if relevant, reference it in prose but don't benchmark
  • No embedder changes — the embedder is orthogonal; Go docs are regular text

Related

Metadata

Metadata

Assignees

Labels

P2Normal — clear value, not urgentresearchResearch / spike

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions