Skip to content

Research: automated source discovery (llms.txt, GitHub tree, sitemap) #46

@laradji

Description

@laradji

Investigate ways to discover the URL set for a documentation source automatically, instead of hand-curating per-lib URL lists in libraries_sources.yaml. At the target Context7-scale corpus (~33k libs eventually, 2-3k near term), manual curation is unsustainable — the project cannot ship if every new lib needs a maintainer to hand-pick a list of files.

Parent: #15

Why now

The current libraries_sources.yaml flow (#51) asks the maintainer to write something like:

- lib_id: /modelcontextprotocol/go-sdk
  kind: github-md
  urls:
    - https://raw.githubusercontent.com/.../README.md
    - https://raw.githubusercontent.com/.../docs/quick_start.md
    - ...

This works at 1 lib, gets old at 50, and becomes impossible at 3000. The maintainer has to:

  1. Find the lib's documentation source
  2. Read its directory layout
  3. Decide which files belong in the corpus (skip examples? skip drafts? include guides?)
  4. Hand-paste 10-30 URLs into the YAML
  5. Re-do the same work every time the upstream layout changes

At Context7's scale (~33k libs), the answer is some combination of automation, conventions, and crawling. This issue is the research that picks the right combination for Deadzone.

Areas to investigate

1. llms.txt — first-class support if adoption justifies it

llms.txt is a proposed convention: a file at https://example.com/llms.txt that points an LLM at the human-friendly documentation. Plus an extended llms-full.txt that inlines the docs. If a project ships one, scraping becomes trivial — fetch the file, follow the references, done.

Open questions:

  • 2026 adoption: which target libs actually publish one? Spike a check across the top 100 libs we'd want to index.
  • Format spec stability: is it stable enough to commit to?
  • Fallback story: what % of libs we care about have one? If <30%, llms.txt is a "first try" but not the main mechanism.

2. GitHub tree API + glob patterns

For sources hosted on GitHub (the majority of OSS docs), the GitHub tree API gives us a recursive file listing for free:

gh api repos/owner/repo/git/trees/HEAD?recursive=1 | jq '.tree[].path'

Combined with glob patterns in libraries_sources.yaml:

- lib_id: /modelcontextprotocol/go-sdk
  kind: github-glob
  repo: modelcontextprotocol/go-sdk
  include:
    - "README.md"
    - "docs/**/*.md"
  exclude:
    - "docs/internal/**"
    - "docs/draft-*.md"

This is the most concrete win — it covers a huge fraction of OSS doc sources (anything with markdown in a public GitHub repo) and only requires per-lib config of include/exclude patterns, not a full URL list.

3. Sitemap.xml

For hosted doc sites (mkdocs, docusaurus, hugo, vitepress emit one), sitemap.xml lists every page on the site. Combined with scrape-via-agent (#27), this gives us automated discovery of HTML doc pages too.

- lib_id: /facebook/react
  kind: sitemap-crawl
  sitemap: https://react.dev/sitemap.xml
  include:
    - "/reference/**"
    - "/learn/**"
  exclude:
    - "/blog/**"

4. Recursive crawl from a root URL

Follow relative .md or HTML links starting from a README. Risk: link sprawl, off-corpus drift, infinite loops. Probably the least desirable option — too much heuristic risk for the value.

5. Hybrid declarative descriptor

The end state is probably a per-lib YAML that picks one of the discovery strategies above:

- lib_id: /hashicorp/terraform-provider-aws
  discovery:
    method: sitemap-crawl
    url: https://registry.terraform.io/providers/hashicorp/aws/latest/sitemap.xml
    include: ["**/docs/resources/**"]
  kind: scrape-via-agent  # how to extract content once discovered

- lib_id: /modelcontextprotocol/go-sdk
  discovery:
    method: github-glob
    repo: modelcontextprotocol/go-sdk
    include: ["README.md", "docs/**/*.md"]
  kind: github-md

The split between discovery (how to find URLs) and kind (how to extract content) is the key abstraction. They're orthogonal — you can discover via GitHub tree and extract via raw markdown, or discover via sitemap and extract via the agent.

Output

A research note in docs/research/source-discovery.md that:

  1. Surveys real adoption of llms.txt and the other formats across the kinds of libs we want to index
  2. Picks a v1 mechanism (or hybrid) and justifies it
  3. Documents the YAML schema split between discovery and kind
  4. Files concrete follow-up issues for each discovery method we adopt

Scope

  • Spike: scan ~100 representative libs and check llms.txt adoption rate
  • Spike: prototype the GitHub tree API + glob filter on modelcontextprotocol/go-sdk and confirm we can replace the current hardcoded URL list
  • Spike: prototype sitemap parsing on a docusaurus site (e.g. react.dev or vitejs.dev)
  • Document the decision in docs/research/source-discovery.md
  • File implementation issues for the adopted methods

Acceptance criteria

  • Research note exists in docs/research/
  • At least one concrete discovery method has been spiked end-to-end against a real source
  • Implementation issue(s) filed for the adopted method(s), referenced from this issue
  • The proposed YAML schema is documented and reviewable

Out of scope

  • Implementation. This is research → ADR → followup issues. No code in this issue.
  • Cross-source crawling and link-following. That's the option-4 rabbit hole. Probably never viable for our use case.
  • AI-driven source discovery (asking an LLM "find the docs for $lib"). Too unreliable for the indexing path.

Related

Metadata

Metadata

Assignees

Labels

P2Normal — clear value, not urgentresearchResearch / spike

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions