You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parent:#53 Depends on: none currently blocking Supersedes: the implementation sketch in #53's original body (which assumed the packs per-artifact release flow paused by #101)
Inter-job transport: actions/cache@v5, notactions/upload-artifact. Each matrix slot caches artifacts/<slug>/; the consolidate job restores all lib caches.
Publish conditional on inputs.tag: with a tag, consolidate chains deadzone dbrelease; without a tag, workflow stops at a consolidated deadzone.db cache.
Do NOT touch internal/packs/upload.go — per-artifact distribution stays paused.
Scraping the full registry on a laptop costs minutes of wall time and blocks the operator; GH-hosted Linux runners do it free and in parallel. The cache layer (decision #3) turns the matrix into a de-facto freshness shim — libs whose config hash hasn't changed are skipped instantly on re-runs.
Acceptance criteria
.github/workflows/scrape-pack.yml exists and triggers on workflow_dispatch with inputs:
lib (optional string) — filter to a single lib_id (empty = all)
tag (optional string) — if non-empty, triggers deadzone dbrelease at the end
Job expand-libs emits a JSON array of {lib_id, version, slug} resolved from libraries_sources.yaml, consumable by matrix: via fromJSON. This requires a minimal -list flag on deadzone scrape that calls scraper.LoadConfig + scraper.Resolve (in internal/scraper/config.go), prints the JSON to stdout, and os.Exit(0). This is the only new Go surface in the issue.
Job scrape runs strategy.matrix over the expanded list with max-parallel: 20 and fail-fast: false
Each scrape slot reuses the pre-existing cache keys from .github/workflows/ci.yml lines 114–130 verbatim — do NOT introduce new keys for the embedder or ORT library:
Each scrape slot adds a new cache entry keyed on: artifact-<slug>-<version>-${{ hashFiles('libraries_sources.yaml') }}-${{ hashFiles('internal/embed/hugot.go') }}
with path artifacts/<slug>/
Cache hit → slot skips deadzone scrape via if: steps.cache.outputs.cache-hit != 'true' guard on the scrape step
Cache miss → slot runs just scrape lib=<id> (add -version <v> flag wiring to the justfile recipe or invoke go run directly, whichever is cleaner — implementer's call)
Job consolidate (needs: scrape) restores all N lib caches into artifacts/ and runs just consolidate
If inputs.tag is non-empty, the consolidate job chains mise exec -- go run -tags ORT ./cmd/deadzone dbrelease -db deadzone.db -tag <tag>. Reuse cmd/deadzone/dbrelease.go verbatim — do NOT reimplement the upload.
A final summary step (can be in the consolidate job) writes a markdown table to $GITHUB_STEP_SUMMARY with columns: lib, version, status (scraped / cached / failed)
README.md → Build from source section gains one line: "The full registry can also be scraped from GitHub Actions via the scrape-pack workflow (see .github/workflows/scrape-pack.yml)."
CLAUDE.md → Build & run section gains one line: "Batch rescrape: gh workflow run scrape-pack.yml -f tag=<tag> scrapes + consolidates + dbreleases. Omit -f tag=… to stop at the consolidated-db cache."
Code skeleton (sketch, not prescriptive — finalize in implementation)
name: scrape-packon:
workflow_dispatch:
inputs:
lib: { description: 'Filter lib_id (empty = all)', required: false }tag: { description: 'Release tag (empty = no publish)', required: false }permissions:
contents: write # dbrelease needs this when tag != ''concurrency:
group: scrape-packcancel-in-progress: falsejobs:
expand-libs:
runs-on: ubuntu-latestoutputs:
libs: ${{ steps.list.outputs.libs }}steps:
- uses: actions/checkout@v6
- uses: actions/setup-go@v6with: { go-version-file: go.mod }
- id: listrun: | libs=$(mise exec -- go run -tags ORT ./cmd/deadzone scrape -config libraries_sources.yaml -list) echo "libs=$libs" >> "$GITHUB_OUTPUT"scrape:
needs: expand-libsruns-on: ubuntu-lateststrategy:
matrix:
entry: ${{ fromJSON(needs.expand-libs.outputs.libs) }}fail-fast: falsemax-parallel: 20steps:
- uses: actions/checkout@v6
- uses: actions/setup-go@v6with: { go-version-file: go.mod }
- uses: ./.github/actions/install-native-deps
- uses: actions/cache@v5 # hugot model — verbatim from ci.yml L114-119with:
path: ${{ env.DEADZONE_HUGOT_CACHE }}key: hugot-model-${{ runner.os }}-${{ hashFiles('internal/embed/hugot.go') }}
- uses: actions/cache@v5 # ORT lib — verbatim from ci.yml L124-129with:
path: ${{ env.DEADZONE_ORT_CACHE }}key: ort-lib-${{ runner.os }}-${{ hashFiles('internal/ort/ort.go') }}
- uses: actions/cache@v5 # per-lib artifact cacheid: artifact-cachewith:
path: artifacts/${{ matrix.entry.slug }}key: artifact-${{ matrix.entry.slug }}-${{ matrix.entry.version }}-${{ hashFiles('libraries_sources.yaml') }}-${{ hashFiles('internal/embed/hugot.go') }}
- if: steps.artifact-cache.outputs.cache-hit != 'true'run: just scrape lib=${{ matrix.entry.lib_id }} # -version wiring TBDconsolidate:
needs: scraperuns-on: ubuntu-lateststeps:
- uses: actions/checkout@v6
- uses: actions/setup-go@v6with: { go-version-file: go.mod }
- uses: ./.github/actions/install-native-deps# Pattern C: fan-in via REST cache API — see docs/research/batch-scrape-actions.md §4
- name: Restore all lib cachesenv: { GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} }run: | # read needs.expand-libs.outputs.libs, loop, fetch each cache archive by key # fallback to Pattern B if this proves unworkable — document in PR
- run: just consolidate db=deadzone.db
- if: inputs.tag != ''run: mise exec -- go run -tags ORT ./cmd/deadzone dbrelease -db deadzone.db -tag ${{ inputs.tag }}
- name: Summaryrun: echo "| lib | version | status |" >> $GITHUB_STEP_SUMMARY
Concrete file pointers
Files to create / modify:
.github/workflows/scrape-pack.yml (new)
cmd/deadzone/scrape.go — add minimal -list flag that calls scraper.Resolve and prints JSON
README.md — 1 line in Build from source
CLAUDE.md — 1 line in Build & run
Files to read as reference (do NOT refactor):
cmd/deadzone/scrape.go — flag surface: -lib, -version, -config, -artifacts, -parallel-github-md, -parallel-scrape-via-agent
cmd/deadzone/consolidate.go — flag surface: -db, -artifacts
cmd/deadzone/dbrelease.go — invocation pattern for publish step
internal/scraper/config.go — LoadConfig + Resolve (called by the new -list flag)
internal/packs/paths.go — packs.Slug(libID) for constructing cache-key slugs
internal/packs/releaser.go — GHReleaser (already wired into dbrelease.go)
.github/workflows/ci.yml L114–130 — cache keys to copy verbatim
justfile — recipes scrape, consolidate, dbrelease
docs/research/batch-scrape-actions.md — decision log with fan-in pattern analysis
Test commands (literal, for agent self-check)
mise exec -- go build -tags ORT ./... — compiles
just test -short — passes
mise exec -- go run -tags ORT ./cmd/deadzone scrape -list — emits JSON array on stdout, exits 0 (after -list flag is added)
Dry-run sans tag: gh workflow run scrape-pack.yml --ref <branch> -f lib=/modelcontextprotocol/go-sdk → run complete, no release pushed, summary shows 1 scraped, 0 cached, 0 failed
Dry-run cache hit (re-run same dispatch): summary shows 0 scraped, 1 cached, 0 failed
Full E2E with publish on a scratch tag: gh workflow run scrape-pack.yml --ref <branch> -f tag=v0.0.0-testpack → gh release view v0.0.0-testpack contains deadzone.db + deadzone.db.sha256. Delete the release afterwards.
Do NOT revive internal/packs/upload.go — stays errPerArtifactDisabled
No self-hosted runner support — separate issue if ever needed
No per-PR ephemeral packs — different shape, file separately
No changes to release.yml — existing binary release flow is orthogonal
No refactor of deadzone scrape / consolidate / dbrelease beyond the minimal -list addition
No new Go dependency
No upload-artifact between jobs — transport is cache only. If Pattern B (fan-in staging) proves necessary, document the deviation in the PR body; do not make it the default path
No new cache key for hugot model or ORT lib — reuse verbatim from ci.yml L114–130
Open sub-decisions for the implementer
Fan-in pattern: Pattern C (REST cache API) preferred; Pattern B (matrix + staging) acceptable with PR body justification. Pattern A collapses into B — do not build it.
-version wiring in the scrape slot: either extend the justfile recipe to accept a version= kwarg, or invoke go run directly. Implementer picks based on which makes the workflow YAML cleanest.
-list flag default output format: JSON array of {lib_id, version, slug}. Add fields only if the matrix consumer needs them.
Parent: #53
Depends on: none currently blocking
Supersedes: the implementation sketch in #53's original body (which assumed the
packsper-artifact release flow paused by #101)Decision (locked 2026-04-16)
deadzone.dbviadeadzone dbrelease— aligned with scraper/packs: folder-per-lib layout + retire per-artifact release, ship deadzone.db only #101, no revival of the paused per-artifact flow.workflow_dispatchonly — no cron until Research: automated freshness detection and refresh triggers at Context7-scale #47 (freshness detection) lands.actions/cache@v5, notactions/upload-artifact. Each matrix slot cachesartifacts/<slug>/; the consolidate job restores all lib caches.inputs.tag: with a tag, consolidate chainsdeadzone dbrelease; without a tag, workflow stops at a consolidateddeadzone.dbcache.internal/packs/upload.go— per-artifact distribution stays paused.Full reasoning and architecture:
docs/research/batch-scrape-actions.md.Why
Scraping the full registry on a laptop costs minutes of wall time and blocks the operator; GH-hosted Linux runners do it free and in parallel. The cache layer (decision #3) turns the matrix into a de-facto freshness shim — libs whose config hash hasn't changed are skipped instantly on re-runs.
Acceptance criteria
.github/workflows/scrape-pack.ymlexists and triggers onworkflow_dispatchwith inputs:lib(optional string) — filter to a singlelib_id(empty = all)tag(optional string) — if non-empty, triggersdeadzone dbreleaseat the endexpand-libsemits a JSON array of{lib_id, version, slug}resolved fromlibraries_sources.yaml, consumable bymatrix:viafromJSON. This requires a minimal-listflag ondeadzone scrapethat callsscraper.LoadConfig+scraper.Resolve(ininternal/scraper/config.go), prints the JSON to stdout, andos.Exit(0). This is the only new Go surface in the issue.scraperunsstrategy.matrixover the expanded list withmax-parallel: 20andfail-fast: false.github/workflows/ci.ymllines 114–130 verbatim — do NOT introduce new keys for the embedder or ORT library:hugot-model-${{ runner.os }}-${{ hashFiles('internal/embed/hugot.go') }}ort-lib-${{ runner.os }}-${{ hashFiles('internal/ort/ort.go') }}artifact-<slug>-<version>-${{ hashFiles('libraries_sources.yaml') }}-${{ hashFiles('internal/embed/hugot.go') }}with path
artifacts/<slug>/deadzone scrapeviaif: steps.cache.outputs.cache-hit != 'true'guard on the scrape stepjust scrape lib=<id>(add-version <v>flag wiring to the justfile recipe or invokego rundirectly, whichever is cleaner — implementer's call)consolidate(needs:scrape) restores all N lib caches intoartifacts/and runsjust consolidatedocs/research/batch-scrape-actions.md§4. Preferred: Pattern C (REST API). Fallback: Pattern B (consolidate as matrix + staging upload-artifact). If Pattern B is chosen, document the deviation from decision Research: vector DB consumption in MCP server (hybrid FTS5 + ANN) #3 in the PR body.inputs.tagis non-empty, the consolidate job chainsmise exec -- go run -tags ORT ./cmd/deadzone dbrelease -db deadzone.db -tag <tag>. Reusecmd/deadzone/dbrelease.goverbatim — do NOT reimplement the upload.$GITHUB_STEP_SUMMARYwith columns:lib,version,status(scraped/cached/failed)README.md→ Build from source section gains one line: "The full registry can also be scraped from GitHub Actions via thescrape-packworkflow (see.github/workflows/scrape-pack.yml)."CLAUDE.md→ Build & run section gains one line: "Batch rescrape:gh workflow run scrape-pack.yml -f tag=<tag>scrapes + consolidates + dbreleases. Omit-f tag=…to stop at the consolidated-db cache."Code skeleton (sketch, not prescriptive — finalize in implementation)
Concrete file pointers
Files to create / modify:
.github/workflows/scrape-pack.yml(new)cmd/deadzone/scrape.go— add minimal-listflag that callsscraper.Resolveand prints JSONREADME.md— 1 line in Build from sourceCLAUDE.md— 1 line in Build & runFiles to read as reference (do NOT refactor):
cmd/deadzone/scrape.go— flag surface:-lib,-version,-config,-artifacts,-parallel-github-md,-parallel-scrape-via-agentcmd/deadzone/consolidate.go— flag surface:-db,-artifactscmd/deadzone/dbrelease.go— invocation pattern for publish stepinternal/scraper/config.go—LoadConfig+Resolve(called by the new-listflag)internal/packs/paths.go—packs.Slug(libID)for constructing cache-key slugsinternal/packs/releaser.go—GHReleaser(already wired intodbrelease.go).github/workflows/ci.ymlL114–130 — cache keys to copy verbatimjustfile— recipesscrape,consolidate,dbreleasedocs/research/batch-scrape-actions.md— decision log with fan-in pattern analysisTest commands (literal, for agent self-check)
mise exec -- go build -tags ORT ./...— compilesjust test -short— passesmise exec -- go run -tags ORT ./cmd/deadzone scrape -list— emits JSON array on stdout, exits 0 (after-listflag is added)gh workflow run scrape-pack.yml --ref <branch> -f lib=/modelcontextprotocol/go-sdk→ run complete, no release pushed, summary shows1 scraped, 0 cached, 0 failed0 scraped, 1 cached, 0 failedgh workflow run scrape-pack.yml --ref <branch> -f tag=v0.0.0-testpack→gh release view v0.0.0-testpackcontainsdeadzone.db+deadzone.db.sha256. Delete the release afterwards.Out of scope (fenced)
internal/packs/upload.go— stayserrPerArtifactDisabledrelease.yml— existing binary release flow is orthogonaldeadzone scrape/consolidate/dbreleasebeyond the minimal-listadditionupload-artifactbetween jobs — transport is cache only. If Pattern B (fan-in staging) proves necessary, document the deviation in the PR body; do not make it the default pathci.ymlL114–130Open sub-decisions for the implementer
-versionwiring in the scrape slot: either extend thejustfilerecipe to accept aversion=kwarg, or invokego rundirectly. Implementer picks based on which makes the workflow YAML cleanest.-listflag default output format: JSON array of{lib_id, version, slug}. Add fields only if the matrix consumer needs them.