Investigate ways to discover the URL set for a documentation source automatically, instead of hand-curating per-lib URL lists in libraries_sources.yaml. At the target Context7-scale corpus (~33k libs eventually, 2-3k near term), manual curation is unsustainable — the project cannot ship if every new lib needs a maintainer to hand-pick a list of files.
Parent: #15
Why now
The current libraries_sources.yaml flow (#51) asks the maintainer to write something like:
- lib_id: /modelcontextprotocol/go-sdk
kind: github-md
urls:
- https://raw.githubusercontent.com/.../README.md
- https://raw.githubusercontent.com/.../docs/quick_start.md
- ...
This works at 1 lib, gets old at 50, and becomes impossible at 3000. The maintainer has to:
- Find the lib's documentation source
- Read its directory layout
- Decide which files belong in the corpus (skip examples? skip drafts? include guides?)
- Hand-paste 10-30 URLs into the YAML
- Re-do the same work every time the upstream layout changes
At Context7's scale (~33k libs), the answer is some combination of automation, conventions, and crawling. This issue is the research that picks the right combination for Deadzone.
Areas to investigate
1. llms.txt — first-class support if adoption justifies it
llms.txt is a proposed convention: a file at https://example.com/llms.txt that points an LLM at the human-friendly documentation. Plus an extended llms-full.txt that inlines the docs. If a project ships one, scraping becomes trivial — fetch the file, follow the references, done.
Open questions:
- 2026 adoption: which target libs actually publish one? Spike a check across the top 100 libs we'd want to index.
- Format spec stability: is it stable enough to commit to?
- Fallback story: what % of libs we care about have one? If <30%, llms.txt is a "first try" but not the main mechanism.
2. GitHub tree API + glob patterns
For sources hosted on GitHub (the majority of OSS docs), the GitHub tree API gives us a recursive file listing for free:
gh api repos/owner/repo/git/trees/HEAD?recursive=1 | jq '.tree[].path'
Combined with glob patterns in libraries_sources.yaml:
- lib_id: /modelcontextprotocol/go-sdk
kind: github-glob
repo: modelcontextprotocol/go-sdk
include:
- "README.md"
- "docs/**/*.md"
exclude:
- "docs/internal/**"
- "docs/draft-*.md"
This is the most concrete win — it covers a huge fraction of OSS doc sources (anything with markdown in a public GitHub repo) and only requires per-lib config of include/exclude patterns, not a full URL list.
3. Sitemap.xml
For hosted doc sites (mkdocs, docusaurus, hugo, vitepress emit one), sitemap.xml lists every page on the site. Combined with scrape-via-agent (#27), this gives us automated discovery of HTML doc pages too.
- lib_id: /facebook/react
kind: sitemap-crawl
sitemap: https://react.dev/sitemap.xml
include:
- "/reference/**"
- "/learn/**"
exclude:
- "/blog/**"
4. Recursive crawl from a root URL
Follow relative .md or HTML links starting from a README. Risk: link sprawl, off-corpus drift, infinite loops. Probably the least desirable option — too much heuristic risk for the value.
5. Hybrid declarative descriptor
The end state is probably a per-lib YAML that picks one of the discovery strategies above:
- lib_id: /hashicorp/terraform-provider-aws
discovery:
method: sitemap-crawl
url: https://registry.terraform.io/providers/hashicorp/aws/latest/sitemap.xml
include: ["**/docs/resources/**"]
kind: scrape-via-agent # how to extract content once discovered
- lib_id: /modelcontextprotocol/go-sdk
discovery:
method: github-glob
repo: modelcontextprotocol/go-sdk
include: ["README.md", "docs/**/*.md"]
kind: github-md
The split between discovery (how to find URLs) and kind (how to extract content) is the key abstraction. They're orthogonal — you can discover via GitHub tree and extract via raw markdown, or discover via sitemap and extract via the agent.
Output
A research note in docs/research/source-discovery.md that:
- Surveys real adoption of
llms.txt and the other formats across the kinds of libs we want to index
- Picks a v1 mechanism (or hybrid) and justifies it
- Documents the YAML schema split between
discovery and kind
- Files concrete follow-up issues for each discovery method we adopt
Scope
Acceptance criteria
Out of scope
- Implementation. This is research → ADR → followup issues. No code in this issue.
- Cross-source crawling and link-following. That's the option-4 rabbit hole. Probably never viable for our use case.
- AI-driven source discovery (asking an LLM "find the docs for $lib"). Too unreliable for the indexing path.
Related
Investigate ways to discover the URL set for a documentation source automatically, instead of hand-curating per-lib URL lists in
libraries_sources.yaml. At the target Context7-scale corpus (~33k libs eventually, 2-3k near term), manual curation is unsustainable — the project cannot ship if every new lib needs a maintainer to hand-pick a list of files.Parent: #15
Why now
The current
libraries_sources.yamlflow (#51) asks the maintainer to write something like:This works at 1 lib, gets old at 50, and becomes impossible at 3000. The maintainer has to:
At Context7's scale (~33k libs), the answer is some combination of automation, conventions, and crawling. This issue is the research that picks the right combination for Deadzone.
Areas to investigate
1.
llms.txt— first-class support if adoption justifies itllms.txt is a proposed convention: a file at
https://example.com/llms.txtthat points an LLM at the human-friendly documentation. Plus an extendedllms-full.txtthat inlines the docs. If a project ships one, scraping becomes trivial — fetch the file, follow the references, done.Open questions:
2. GitHub tree API + glob patterns
For sources hosted on GitHub (the majority of OSS docs), the GitHub tree API gives us a recursive file listing for free:
Combined with glob patterns in
libraries_sources.yaml:This is the most concrete win — it covers a huge fraction of OSS doc sources (anything with markdown in a public GitHub repo) and only requires per-lib config of include/exclude patterns, not a full URL list.
3. Sitemap.xml
For hosted doc sites (mkdocs, docusaurus, hugo, vitepress emit one),
sitemap.xmllists every page on the site. Combined withscrape-via-agent(#27), this gives us automated discovery of HTML doc pages too.4. Recursive crawl from a root URL
Follow relative
.mdor HTML links starting from a README. Risk: link sprawl, off-corpus drift, infinite loops. Probably the least desirable option — too much heuristic risk for the value.5. Hybrid declarative descriptor
The end state is probably a per-lib YAML that picks one of the discovery strategies above:
The split between discovery (how to find URLs) and kind (how to extract content) is the key abstraction. They're orthogonal — you can discover via GitHub tree and extract via raw markdown, or discover via sitemap and extract via the agent.
Output
A research note in
docs/research/source-discovery.mdthat:llms.txtand the other formats across the kinds of libs we want to indexdiscoveryandkindScope
modelcontextprotocol/go-sdkand confirm we can replace the current hardcoded URL listdocs/research/source-discovery.mdAcceptance criteria
docs/research/Out of scope
Related
scrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27 —scrape-via-agent. Discovery and extraction are orthogonal: this issue handles "what URLs to fetch", Addscrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27 handles "how to extract content from each URL".libraries_sources.yamlconfig file #51 —libraries_sources.yamlstop-gap, the schema this issue extends with discovery descriptors.