Research: automated source discovery (llms.txt, GitHub tree, sitemap)

Investigate ways to discover the URL set for a documentation source automatically, instead of hand-curating per-lib URL lists in `libraries_sources.yaml`. At the target Context7-scale corpus (~33k libs eventually, 2-3k near term), manual curation is unsustainable — the project cannot ship if every new lib needs a maintainer to hand-pick a list of files.

**Parent:** #15

## Why now

The current `libraries_sources.yaml` flow (#51) asks the maintainer to write something like:

```yaml
- lib_id: /modelcontextprotocol/go-sdk
  kind: github-md
  urls:
    - https://raw.githubusercontent.com/.../README.md
    - https://raw.githubusercontent.com/.../docs/quick_start.md
    - ...
```

This works at 1 lib, gets old at 50, and becomes impossible at 3000. The maintainer has to:

1. Find the lib's documentation source
2. Read its directory layout
3. Decide which files belong in the corpus (skip examples? skip drafts? include guides?)
4. Hand-paste 10-30 URLs into the YAML
5. Re-do the same work every time the upstream layout changes

At Context7's scale (~33k libs), the answer is some combination of automation, conventions, and crawling. This issue is the research that picks the right combination for Deadzone.

## Areas to investigate

### 1. `llms.txt` — first-class support if adoption justifies it

[llms.txt](https://llmstxt.org/) is a proposed convention: a file at `https://example.com/llms.txt` that points an LLM at the human-friendly documentation. Plus an extended `llms-full.txt` that inlines the docs. If a project ships one, scraping becomes trivial — fetch the file, follow the references, done.

**Open questions:**
- 2026 adoption: which target libs actually publish one? Spike a check across the top 100 libs we'd want to index.
- Format spec stability: is it stable enough to commit to?
- Fallback story: what % of libs we care about have one? If <30%, llms.txt is a "first try" but not the main mechanism.

### 2. GitHub tree API + glob patterns

For sources hosted on GitHub (the majority of OSS docs), the GitHub tree API gives us a recursive file listing for free:

```bash
gh api repos/owner/repo/git/trees/HEAD?recursive=1 | jq '.tree[].path'
```

Combined with glob patterns in `libraries_sources.yaml`:

```yaml
- lib_id: /modelcontextprotocol/go-sdk
  kind: github-glob
  repo: modelcontextprotocol/go-sdk
  include:
    - "README.md"
    - "docs/**/*.md"
  exclude:
    - "docs/internal/**"
    - "docs/draft-*.md"
```

This is the most concrete win — it covers a huge fraction of OSS doc sources (anything with markdown in a public GitHub repo) and only requires per-lib config of include/exclude patterns, not a full URL list.

### 3. Sitemap.xml

For hosted doc sites (mkdocs, docusaurus, hugo, vitepress emit one), `sitemap.xml` lists every page on the site. Combined with `scrape-via-agent` (#27), this gives us automated discovery of HTML doc pages too.

```yaml
- lib_id: /facebook/react
  kind: sitemap-crawl
  sitemap: https://react.dev/sitemap.xml
  include:
    - "/reference/**"
    - "/learn/**"
  exclude:
    - "/blog/**"
```

### 4. Recursive crawl from a root URL

Follow relative `.md` or HTML links starting from a README. Risk: link sprawl, off-corpus drift, infinite loops. Probably the least desirable option — too much heuristic risk for the value.

### 5. Hybrid declarative descriptor

The end state is probably a per-lib YAML that picks one of the discovery strategies above:

```yaml
- lib_id: /hashicorp/terraform-provider-aws
  discovery:
    method: sitemap-crawl
    url: https://registry.terraform.io/providers/hashicorp/aws/latest/sitemap.xml
    include: ["**/docs/resources/**"]
  kind: scrape-via-agent  # how to extract content once discovered

- lib_id: /modelcontextprotocol/go-sdk
  discovery:
    method: github-glob
    repo: modelcontextprotocol/go-sdk
    include: ["README.md", "docs/**/*.md"]
  kind: github-md
```

The split between **discovery** (how to find URLs) and **kind** (how to extract content) is the key abstraction. They're orthogonal — you can discover via GitHub tree and extract via raw markdown, or discover via sitemap and extract via the agent.

## Output

A research note in `docs/research/source-discovery.md` that:

1. Surveys real adoption of `llms.txt` and the other formats across the kinds of libs we want to index
2. Picks a v1 mechanism (or hybrid) and justifies it
3. Documents the YAML schema split between `discovery` and `kind`
4. Files concrete follow-up issues for each discovery method we adopt

## Scope

- [ ] Spike: scan ~100 representative libs and check llms.txt adoption rate
- [ ] Spike: prototype the GitHub tree API + glob filter on `modelcontextprotocol/go-sdk` and confirm we can replace the current hardcoded URL list
- [ ] Spike: prototype sitemap parsing on a docusaurus site (e.g. react.dev or vitejs.dev)
- [ ] Document the decision in `docs/research/source-discovery.md`
- [ ] File implementation issues for the adopted methods

## Acceptance criteria

- [ ] Research note exists in `docs/research/`
- [ ] At least one concrete discovery method has been spiked end-to-end against a real source
- [ ] Implementation issue(s) filed for the adopted method(s), referenced from this issue
- [ ] The proposed YAML schema is documented and reviewable

## Out of scope

- **Implementation.** This is research → ADR → followup issues. No code in this issue.
- **Cross-source crawling and link-following.** That's the option-4 rabbit hole. Probably never viable for our use case.
- **AI-driven source discovery** (asking an LLM "find the docs for $lib"). Too unreliable for the indexing path.

## Related

- **Parent:** #15
- **#27** — `scrape-via-agent`. Discovery and extraction are orthogonal: this issue handles "what URLs to fetch", #27 handles "how to extract content from each URL".
- **#28** — per-lib artifacts. Discovery output → scrape → artifact. Discovery is the upstream of the pipeline.
- **#51** — `libraries_sources.yaml` stop-gap, the schema this issue extends with discovery descriptors.
- **#52** — long-term registry research, which is upstream of this issue (registry feeds discovery).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: automated source discovery (llms.txt, GitHub tree, sitemap) #46

Why now

Areas to investigate

1. `llms.txt` — first-class support if adoption justifies it

2. GitHub tree API + glob patterns

3. Sitemap.xml

4. Recursive crawl from a root URL

5. Hybrid declarative descriptor

Output

Scope

Acceptance criteria

Out of scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Research: automated source discovery (llms.txt, GitHub tree, sitemap) #46

Description

Why now

Areas to investigate

1. llms.txt — first-class support if adoption justifies it

2. GitHub tree API + glob patterns

3. Sitemap.xml

4. Recursive crawl from a root URL

5. Hybrid declarative descriptor

Output

Scope

Acceptance criteria

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `llms.txt` — first-class support if adoption justifies it