Research: automated freshness detection and refresh triggers at Context7-scale

Investigate how Deadzone detects that an indexed library is stale (its upstream documentation has changed) and decides when to re-scrape it. Today the answer is "the maintainer notices and runs the scraper manually", which works at 1 lib and breaks at 50.

**Parent:** #15

## Why now

Deadzone targets a Context7-scale corpus (~33k libs eventually, 2-3k near term). At that size:

- A maintainer cannot personally monitor 3000 doc sources for changes
- "Wait for a user to file an issue" doesn't scale either — most stale docs are silently wrong
- Re-scraping everything periodically is wasteful (most libs don't change between runs) and slow if `scrape-via-agent` is involved (#27 means each scrape costs LLM tokens)
- Without a freshness signal, the corpus drifts from upstream over time and the search results become unreliable

Context7 refreshes every 10-15 days as a baseline policy. Deadzone needs an equivalent — or better, a signal-driven refresh that only re-scrapes what actually changed.

This issue is the research that picks the mechanism.

## Areas to investigate

### 1. Per-source change signal — the cheap detection layer

Different sources have different "is this stale" signals. The question is which ones are cheap enough to check at the cadence we want.

**For `kind: github-md` and `kind: github-glob` (#46):**
- `gh api repos/owner/repo/commits?path=docs/&per_page=1` returns the latest commit SHA touching the indexed paths
- Compare against the SHA we recorded at last scrape time
- Single API call per lib, no auth required for public repos
- **Cheapest signal in the entire system. Likely the right primary mechanism for any GitHub-hosted source.**

**For `kind: scrape-via-agent` against HTML doc sites:**
- HTTP `HEAD` request → check `Last-Modified` and `ETag` headers
- Many doc sites don't set these reliably; fall back to a content hash on the rendered page
- Per-URL check, more expensive than the GitHub case

**For `sitemap-crawl` sources (post-#46):**
- `sitemap.xml` typically has `<lastmod>` per URL
- Check the sitemap once, diff against the previous one, only re-fetch URLs whose lastmod changed
- Best of both worlds for HTML sources that publish a proper sitemap

### 2. Where to store the per-lib freshness state

Options:

- **`meta_libs` table in main DB** — extends the table introduced in #29, adds an `upstream_signature` column (commit SHA, ETag set, content hash, etc.)
- **Per-artifact meta** — each `artifacts/<lib>.db` carries its last-known signature. Good cohesion, no separate state table.
- **External freshness file** — `freshness.yaml` alongside `libraries_sources.yaml`, committed or generated. Simpler but separate from the artifact lifecycle.

The artifact-meta path is probably cleanest — it travels with the artifact and has no separate failure mode.

### 3. Refresh trigger — when is "now" for re-scraping

Three options, increasing in automation:

1. **CLI subcommand**: `deadzone freshness check` lists libs that have drifted, maintainer decides what to refresh manually. No automation, just visibility. Sufficient at small scale.
2. **CLI subcommand with `--refresh`**: same check, plus automatically re-scrape everything stale. Still requires the maintainer to run it.
3. **Scheduled CI workflow**: GitHub Actions cron runs the check daily, opens a PR with refreshed artifacts for any drifted libs. Fully automated.

Option 3 is the end state. Options 1-2 are stepping stones. We should ship option 1 first, then layer on 2 and 3 as the corpus grows. (#53 already implements a naive version of option 3 — weekly cron — that #47 will eventually replace with smarter per-lib triggers.)

### 4. Refresh policy — re-scrape everything vs. only-stale

Once the signal exists, the refresh loop should be **incremental by default**: only re-scrape libs whose signature has actually changed. Combined with #29 (skip-unchanged consolidation), the typical refresh becomes "check 3000 libs for drift, find that 12 are stale, re-scrape those 12, consolidate". Cost is bounded by the change rate, not the corpus size.

### 5. Manual override — when the user knows better

Always keep the manual escape hatch: `deadzone scrape -lib /x/y --force` skips the freshness check and re-scrapes regardless. Useful when the signature didn't change but the content somehow did (e.g. the page was edited in place without a commit, or the maintainer wants to re-scrape with an updated extraction prompt).

## Output

A research note in `docs/research/freshness-detection.md` that:

1. Surveys the cheap-signal options for each source `kind`
2. Picks a primary mechanism per kind and a fallback
3. Documents the schema for storing `upstream_signature` (in artifact meta or a sibling table)
4. Files concrete follow-up issues for the implementation

## Scope

- [ ] Spike: `gh api repos/.../commits?path=docs/` against `modelcontextprotocol/go-sdk` and confirm the SHA changes when docs change
- [ ] Spike: `HEAD` request + ETag inspection on a few representative HTML doc sites (react.dev, terraform registry, mkdocs site) to see who actually emits ETag/Last-Modified
- [ ] Spike: parse a sitemap.xml `<lastmod>` and reason about the diff strategy
- [ ] Design the `upstream_signature` storage shape — in the artifact meta or a separate table
- [ ] Document the decision in `docs/research/freshness-detection.md`
- [ ] File implementation issues for the adopted mechanism per `kind`

## Acceptance criteria

- [ ] Research note exists in `docs/research/`
- [ ] At least one signal mechanism spiked end-to-end against a real source
- [ ] Storage schema for the per-lib upstream signature is documented
- [ ] Implementation follow-up issue(s) filed
- [ ] The CLI subcommand surface (`deadzone freshness check`, etc.) is sketched

## Out of scope

- **Implementation.** This is research → ADR → followup issues.
- **Diffing the doc content semantically** to detect "this changed but not in a way that matters". Too clever, brittle.
- **Real-time webhooks from upstream sources** — most don't expose them, and polling is enough at our cadence.

## Related

- **Parent:** #15
- **#28** — per-lib artifacts. The artifact meta is probably where the freshness signature lives.
- **#29** — checksum-based skip for consolidate. Complementary: this issue tracks "did upstream change", #29 tracks "did we already merge this artifact".
- **#46** — automated source discovery. Discovery and freshness are upstream/downstream of each other: discovery tells us what to fetch, freshness tells us when to fetch.
- **#53** — batch scrape pipeline (GH Actions matrix). Already implements a naive weekly cron version of refresh; this issue replaces it with smarter per-lib triggers when ready.
- **`docs/research/context7-analysis.md`** — Context7 refreshes every 10-15 days; we should match or beat that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: automated freshness detection and refresh triggers at Context7-scale #47

Why now

Areas to investigate

1. Per-source change signal — the cheap detection layer

2. Where to store the per-lib freshness state

3. Refresh trigger — when is "now" for re-scraping

4. Refresh policy — re-scrape everything vs. only-stale

5. Manual override — when the user knows better

Output

Scope

Acceptance criteria

Out of scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Research: automated freshness detection and refresh triggers at Context7-scale #47

Description

Why now

Areas to investigate

1. Per-source change signal — the cheap detection layer

2. Where to store the per-lib freshness state

3. Refresh trigger — when is "now" for re-scraping

4. Refresh policy — re-scrape everything vs. only-stale

5. Manual override — when the user knows better

Output

Scope

Acceptance criteria

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions