Skip to content

Research: automated freshness detection and refresh triggers at Context7-scale #47

@laradji

Description

@laradji

Investigate how Deadzone detects that an indexed library is stale (its upstream documentation has changed) and decides when to re-scrape it. Today the answer is "the maintainer notices and runs the scraper manually", which works at 1 lib and breaks at 50.

Parent: #15

Why now

Deadzone targets a Context7-scale corpus (~33k libs eventually, 2-3k near term). At that size:

  • A maintainer cannot personally monitor 3000 doc sources for changes
  • "Wait for a user to file an issue" doesn't scale either — most stale docs are silently wrong
  • Re-scraping everything periodically is wasteful (most libs don't change between runs) and slow if scrape-via-agent is involved (Add scrape-via-agent source kind: LLM-backed extraction for any non-raw doc source #27 means each scrape costs LLM tokens)
  • Without a freshness signal, the corpus drifts from upstream over time and the search results become unreliable

Context7 refreshes every 10-15 days as a baseline policy. Deadzone needs an equivalent — or better, a signal-driven refresh that only re-scrapes what actually changed.

This issue is the research that picks the mechanism.

Areas to investigate

1. Per-source change signal — the cheap detection layer

Different sources have different "is this stale" signals. The question is which ones are cheap enough to check at the cadence we want.

For kind: github-md and kind: github-glob (#46):

  • gh api repos/owner/repo/commits?path=docs/&per_page=1 returns the latest commit SHA touching the indexed paths
  • Compare against the SHA we recorded at last scrape time
  • Single API call per lib, no auth required for public repos
  • Cheapest signal in the entire system. Likely the right primary mechanism for any GitHub-hosted source.

For kind: scrape-via-agent against HTML doc sites:

  • HTTP HEAD request → check Last-Modified and ETag headers
  • Many doc sites don't set these reliably; fall back to a content hash on the rendered page
  • Per-URL check, more expensive than the GitHub case

For sitemap-crawl sources (post-#46):

  • sitemap.xml typically has <lastmod> per URL
  • Check the sitemap once, diff against the previous one, only re-fetch URLs whose lastmod changed
  • Best of both worlds for HTML sources that publish a proper sitemap

2. Where to store the per-lib freshness state

Options:

  • meta_libs table in main DB — extends the table introduced in Artifact checksum metadata for skip-unchanged consolidation #29, adds an upstream_signature column (commit SHA, ETag set, content hash, etc.)
  • Per-artifact meta — each artifacts/<lib>.db carries its last-known signature. Good cohesion, no separate state table.
  • External freshness filefreshness.yaml alongside libraries_sources.yaml, committed or generated. Simpler but separate from the artifact lifecycle.

The artifact-meta path is probably cleanest — it travels with the artifact and has no separate failure mode.

3. Refresh trigger — when is "now" for re-scraping

Three options, increasing in automation:

  1. CLI subcommand: deadzone freshness check lists libs that have drifted, maintainer decides what to refresh manually. No automation, just visibility. Sufficient at small scale.
  2. CLI subcommand with --refresh: same check, plus automatically re-scrape everything stale. Still requires the maintainer to run it.
  3. Scheduled CI workflow: GitHub Actions cron runs the check daily, opens a PR with refreshed artifacts for any drifted libs. Fully automated.

Option 3 is the end state. Options 1-2 are stepping stones. We should ship option 1 first, then layer on 2 and 3 as the corpus grows. (#53 already implements a naive version of option 3 — weekly cron — that #47 will eventually replace with smarter per-lib triggers.)

4. Refresh policy — re-scrape everything vs. only-stale

Once the signal exists, the refresh loop should be incremental by default: only re-scrape libs whose signature has actually changed. Combined with #29 (skip-unchanged consolidation), the typical refresh becomes "check 3000 libs for drift, find that 12 are stale, re-scrape those 12, consolidate". Cost is bounded by the change rate, not the corpus size.

5. Manual override — when the user knows better

Always keep the manual escape hatch: deadzone scrape -lib /x/y --force skips the freshness check and re-scrapes regardless. Useful when the signature didn't change but the content somehow did (e.g. the page was edited in place without a commit, or the maintainer wants to re-scrape with an updated extraction prompt).

Output

A research note in docs/research/freshness-detection.md that:

  1. Surveys the cheap-signal options for each source kind
  2. Picks a primary mechanism per kind and a fallback
  3. Documents the schema for storing upstream_signature (in artifact meta or a sibling table)
  4. Files concrete follow-up issues for the implementation

Scope

  • Spike: gh api repos/.../commits?path=docs/ against modelcontextprotocol/go-sdk and confirm the SHA changes when docs change
  • Spike: HEAD request + ETag inspection on a few representative HTML doc sites (react.dev, terraform registry, mkdocs site) to see who actually emits ETag/Last-Modified
  • Spike: parse a sitemap.xml <lastmod> and reason about the diff strategy
  • Design the upstream_signature storage shape — in the artifact meta or a separate table
  • Document the decision in docs/research/freshness-detection.md
  • File implementation issues for the adopted mechanism per kind

Acceptance criteria

  • Research note exists in docs/research/
  • At least one signal mechanism spiked end-to-end against a real source
  • Storage schema for the per-lib upstream signature is documented
  • Implementation follow-up issue(s) filed
  • The CLI subcommand surface (deadzone freshness check, etc.) is sketched

Out of scope

  • Implementation. This is research → ADR → followup issues.
  • Diffing the doc content semantically to detect "this changed but not in a way that matters". Too clever, brittle.
  • Real-time webhooks from upstream sources — most don't expose them, and polling is enough at our cadence.

Related

Metadata

Metadata

Assignees

Labels

P2Normal — clear value, not urgentresearchResearch / spike

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions