You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigate how Deadzone detects that an indexed library is stale (its upstream documentation has changed) and decides when to re-scrape it. Today the answer is "the maintainer notices and runs the scraper manually", which works at 1 lib and breaks at 50.
Without a freshness signal, the corpus drifts from upstream over time and the search results become unreliable
Context7 refreshes every 10-15 days as a baseline policy. Deadzone needs an equivalent — or better, a signal-driven refresh that only re-scrapes what actually changed.
This issue is the research that picks the mechanism.
Areas to investigate
1. Per-source change signal — the cheap detection layer
Different sources have different "is this stale" signals. The question is which ones are cheap enough to check at the cadence we want.
Per-artifact meta — each artifacts/<lib>.db carries its last-known signature. Good cohesion, no separate state table.
External freshness file — freshness.yaml alongside libraries_sources.yaml, committed or generated. Simpler but separate from the artifact lifecycle.
The artifact-meta path is probably cleanest — it travels with the artifact and has no separate failure mode.
3. Refresh trigger — when is "now" for re-scraping
Three options, increasing in automation:
CLI subcommand: deadzone freshness check lists libs that have drifted, maintainer decides what to refresh manually. No automation, just visibility. Sufficient at small scale.
CLI subcommand with --refresh: same check, plus automatically re-scrape everything stale. Still requires the maintainer to run it.
Scheduled CI workflow: GitHub Actions cron runs the check daily, opens a PR with refreshed artifacts for any drifted libs. Fully automated.
Option 3 is the end state. Options 1-2 are stepping stones. We should ship option 1 first, then layer on 2 and 3 as the corpus grows. (#53 already implements a naive version of option 3 — weekly cron — that #47 will eventually replace with smarter per-lib triggers.)
4. Refresh policy — re-scrape everything vs. only-stale
Once the signal exists, the refresh loop should be incremental by default: only re-scrape libs whose signature has actually changed. Combined with #29 (skip-unchanged consolidation), the typical refresh becomes "check 3000 libs for drift, find that 12 are stale, re-scrape those 12, consolidate". Cost is bounded by the change rate, not the corpus size.
5. Manual override — when the user knows better
Always keep the manual escape hatch: deadzone scrape -lib /x/y --force skips the freshness check and re-scrapes regardless. Useful when the signature didn't change but the content somehow did (e.g. the page was edited in place without a commit, or the maintainer wants to re-scrape with an updated extraction prompt).
Output
A research note in docs/research/freshness-detection.md that:
Surveys the cheap-signal options for each source kind
Picks a primary mechanism per kind and a fallback
Documents the schema for storing upstream_signature (in artifact meta or a sibling table)
Files concrete follow-up issues for the implementation
Scope
Spike: gh api repos/.../commits?path=docs/ against modelcontextprotocol/go-sdk and confirm the SHA changes when docs change
Spike: HEAD request + ETag inspection on a few representative HTML doc sites (react.dev, terraform registry, mkdocs site) to see who actually emits ETag/Last-Modified
Spike: parse a sitemap.xml <lastmod> and reason about the diff strategy
Design the upstream_signature storage shape — in the artifact meta or a separate table
Document the decision in docs/research/freshness-detection.md
File implementation issues for the adopted mechanism per kind
Acceptance criteria
Research note exists in docs/research/
At least one signal mechanism spiked end-to-end against a real source
Storage schema for the per-lib upstream signature is documented
Implementation follow-up issue(s) filed
The CLI subcommand surface (deadzone freshness check, etc.) is sketched
Out of scope
Implementation. This is research → ADR → followup issues.
Diffing the doc content semantically to detect "this changed but not in a way that matters". Too clever, brittle.
Real-time webhooks from upstream sources — most don't expose them, and polling is enough at our cadence.
Investigate how Deadzone detects that an indexed library is stale (its upstream documentation has changed) and decides when to re-scrape it. Today the answer is "the maintainer notices and runs the scraper manually", which works at 1 lib and breaks at 50.
Parent: #15
Why now
Deadzone targets a Context7-scale corpus (~33k libs eventually, 2-3k near term). At that size:
scrape-via-agentis involved (Addscrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27 means each scrape costs LLM tokens)Context7 refreshes every 10-15 days as a baseline policy. Deadzone needs an equivalent — or better, a signal-driven refresh that only re-scrapes what actually changed.
This issue is the research that picks the mechanism.
Areas to investigate
1. Per-source change signal — the cheap detection layer
Different sources have different "is this stale" signals. The question is which ones are cheap enough to check at the cadence we want.
For
kind: github-mdandkind: github-glob(#46):gh api repos/owner/repo/commits?path=docs/&per_page=1returns the latest commit SHA touching the indexed pathsFor
kind: scrape-via-agentagainst HTML doc sites:HEADrequest → checkLast-ModifiedandETagheadersFor
sitemap-crawlsources (post-#46):sitemap.xmltypically has<lastmod>per URL2. Where to store the per-lib freshness state
Options:
meta_libstable in main DB — extends the table introduced in Artifact checksum metadata for skip-unchanged consolidation #29, adds anupstream_signaturecolumn (commit SHA, ETag set, content hash, etc.)artifacts/<lib>.dbcarries its last-known signature. Good cohesion, no separate state table.freshness.yamlalongsidelibraries_sources.yaml, committed or generated. Simpler but separate from the artifact lifecycle.The artifact-meta path is probably cleanest — it travels with the artifact and has no separate failure mode.
3. Refresh trigger — when is "now" for re-scraping
Three options, increasing in automation:
deadzone freshness checklists libs that have drifted, maintainer decides what to refresh manually. No automation, just visibility. Sufficient at small scale.--refresh: same check, plus automatically re-scrape everything stale. Still requires the maintainer to run it.Option 3 is the end state. Options 1-2 are stepping stones. We should ship option 1 first, then layer on 2 and 3 as the corpus grows. (#53 already implements a naive version of option 3 — weekly cron — that #47 will eventually replace with smarter per-lib triggers.)
4. Refresh policy — re-scrape everything vs. only-stale
Once the signal exists, the refresh loop should be incremental by default: only re-scrape libs whose signature has actually changed. Combined with #29 (skip-unchanged consolidation), the typical refresh becomes "check 3000 libs for drift, find that 12 are stale, re-scrape those 12, consolidate". Cost is bounded by the change rate, not the corpus size.
5. Manual override — when the user knows better
Always keep the manual escape hatch:
deadzone scrape -lib /x/y --forceskips the freshness check and re-scrapes regardless. Useful when the signature didn't change but the content somehow did (e.g. the page was edited in place without a commit, or the maintainer wants to re-scrape with an updated extraction prompt).Output
A research note in
docs/research/freshness-detection.mdthat:kindupstream_signature(in artifact meta or a sibling table)Scope
gh api repos/.../commits?path=docs/againstmodelcontextprotocol/go-sdkand confirm the SHA changes when docs changeHEADrequest + ETag inspection on a few representative HTML doc sites (react.dev, terraform registry, mkdocs site) to see who actually emits ETag/Last-Modified<lastmod>and reason about the diff strategyupstream_signaturestorage shape — in the artifact meta or a separate tabledocs/research/freshness-detection.mdkindAcceptance criteria
docs/research/deadzone freshness check, etc.) is sketchedOut of scope
Related
docs/research/context7-analysis.md— Context7 refreshes every 10-15 days; we should match or beat that.