feat: .github/workflows/scrape-pack.yml — matrix scrape + cache + consolidate producing deadzone.db

**Parent:** #53
**Depends on:** none currently blocking
**Supersedes:** the implementation sketch in #53's original body (which assumed the `packs` per-artifact release flow paused by #101)

## Decision (locked 2026-04-16)

1. **Publish target**: single `deadzone.db` via `deadzone dbrelease` — aligned with #101, no revival of the paused per-artifact flow.
2. **Trigger**: `workflow_dispatch` only — no cron until #47 (freshness detection) lands.
3. **Inter-job transport**: `actions/cache@v5`, **not** `actions/upload-artifact`. Each matrix slot caches `artifacts/<slug>/`; the consolidate job restores all lib caches.
4. **Publish conditional on `inputs.tag`**: with a tag, consolidate chains `deadzone dbrelease`; without a tag, workflow stops at a consolidated `deadzone.db` cache.
5. **Do NOT touch `internal/packs/upload.go`** — per-artifact distribution stays paused.

Full reasoning and architecture: [`docs/research/batch-scrape-actions.md`](docs/research/batch-scrape-actions.md).

## Why

Scraping the full registry on a laptop costs minutes of wall time and blocks the operator; GH-hosted Linux runners do it free and in parallel. The cache layer (decision #3) turns the matrix into a de-facto freshness shim — libs whose config hash hasn't changed are skipped instantly on re-runs.

## Acceptance criteria

- [ ] `.github/workflows/scrape-pack.yml` exists and triggers on `workflow_dispatch` with inputs:
  - `lib` (optional string) — filter to a single `lib_id` (empty = all)
  - `tag` (optional string) — if non-empty, triggers `deadzone dbrelease` at the end
- [ ] Job `expand-libs` emits a JSON array of `{lib_id, version, slug}` resolved from `libraries_sources.yaml`, consumable by `matrix:` via `fromJSON`. This requires a minimal `-list` flag on `deadzone scrape` that calls `scraper.LoadConfig` + `scraper.Resolve` (in `internal/scraper/config.go`), prints the JSON to stdout, and `os.Exit(0)`. **This is the only new Go surface in the issue.**
- [ ] Job `scrape` runs `strategy.matrix` over the expanded list with `max-parallel: 20` and `fail-fast: false`
- [ ] Each scrape slot reuses the pre-existing cache keys from `.github/workflows/ci.yml` lines 114–130 **verbatim** — do NOT introduce new keys for the embedder or ORT library:
  - `hugot-model-${{ runner.os }}-${{ hashFiles('internal/embed/hugot.go') }}`
  - `ort-lib-${{ runner.os }}-${{ hashFiles('internal/ort/ort.go') }}`
- [ ] Each scrape slot adds a **new** cache entry keyed on:
  `artifact-<slug>-<version>-${{ hashFiles('libraries_sources.yaml') }}-${{ hashFiles('internal/embed/hugot.go') }}`
  with path `artifacts/<slug>/`
- [ ] Cache hit → slot skips `deadzone scrape` via `if: steps.cache.outputs.cache-hit != 'true'` guard on the scrape step
- [ ] Cache miss → slot runs `just scrape lib=<id>` (add `-version <v>` flag wiring to the justfile recipe or invoke `go run` directly, whichever is cleaner — implementer's call)
- [ ] Job `consolidate` (needs: `scrape`) restores all N lib caches into `artifacts/` and runs `just consolidate`
  - **Fan-in pattern** is a sub-decision — see [`docs/research/batch-scrape-actions.md`](docs/research/batch-scrape-actions.md) §4. Preferred: Pattern C (REST API). Fallback: Pattern B (consolidate as matrix + staging upload-artifact). If Pattern B is chosen, document the deviation from decision #3 in the PR body.
- [ ] If `inputs.tag` is non-empty, the consolidate job chains `mise exec -- go run -tags ORT ./cmd/deadzone dbrelease -db deadzone.db -tag <tag>`. Reuse `cmd/deadzone/dbrelease.go` verbatim — do NOT reimplement the upload.
- [ ] A final summary step (can be in the consolidate job) writes a markdown table to `$GITHUB_STEP_SUMMARY` with columns: `lib`, `version`, `status` (`scraped` / `cached` / `failed`)
- [ ] `README.md` → Build from source section gains **one** line: "The full registry can also be scraped from GitHub Actions via the `scrape-pack` workflow (see `.github/workflows/scrape-pack.yml`)."
- [ ] `CLAUDE.md` → Build & run section gains **one** line: "Batch rescrape: `gh workflow run scrape-pack.yml -f tag=<tag>` scrapes + consolidates + dbreleases. Omit `-f tag=…` to stop at the consolidated-db cache."

## Code skeleton (sketch, not prescriptive — finalize in implementation)

```yaml
name: scrape-pack
on:
  workflow_dispatch:
    inputs:
      lib: { description: 'Filter lib_id (empty = all)', required: false }
      tag: { description: 'Release tag (empty = no publish)', required: false }
permissions:
  contents: write  # dbrelease needs this when tag != ''
concurrency:
  group: scrape-pack
  cancel-in-progress: false
jobs:
  expand-libs:
    runs-on: ubuntu-latest
    outputs:
      libs: ${{ steps.list.outputs.libs }}
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v6
        with: { go-version-file: go.mod }
      - id: list
        run: |
          libs=$(mise exec -- go run -tags ORT ./cmd/deadzone scrape -config libraries_sources.yaml -list)
          echo "libs=$libs" >> "$GITHUB_OUTPUT"
  scrape:
    needs: expand-libs
    runs-on: ubuntu-latest
    strategy:
      matrix:
        entry: ${{ fromJSON(needs.expand-libs.outputs.libs) }}
      fail-fast: false
      max-parallel: 20
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v6
        with: { go-version-file: go.mod }
      - uses: ./.github/actions/install-native-deps
      - uses: actions/cache@v5  # hugot model — verbatim from ci.yml L114-119
        with:
          path: ${{ env.DEADZONE_HUGOT_CACHE }}
          key: hugot-model-${{ runner.os }}-${{ hashFiles('internal/embed/hugot.go') }}
      - uses: actions/cache@v5  # ORT lib — verbatim from ci.yml L124-129
        with:
          path: ${{ env.DEADZONE_ORT_CACHE }}
          key: ort-lib-${{ runner.os }}-${{ hashFiles('internal/ort/ort.go') }}
      - uses: actions/cache@v5  # per-lib artifact cache
        id: artifact-cache
        with:
          path: artifacts/${{ matrix.entry.slug }}
          key: artifact-${{ matrix.entry.slug }}-${{ matrix.entry.version }}-${{ hashFiles('libraries_sources.yaml') }}-${{ hashFiles('internal/embed/hugot.go') }}
      - if: steps.artifact-cache.outputs.cache-hit != 'true'
        run: just scrape lib=${{ matrix.entry.lib_id }}  # -version wiring TBD
  consolidate:
    needs: scrape
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v6
        with: { go-version-file: go.mod }
      - uses: ./.github/actions/install-native-deps
      # Pattern C: fan-in via REST cache API — see docs/research/batch-scrape-actions.md §4
      - name: Restore all lib caches
        env: { GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} }
        run: |
          # read needs.expand-libs.outputs.libs, loop, fetch each cache archive by key
          # fallback to Pattern B if this proves unworkable — document in PR
      - run: just consolidate db=deadzone.db
      - if: inputs.tag != ''
        run: mise exec -- go run -tags ORT ./cmd/deadzone dbrelease -db deadzone.db -tag ${{ inputs.tag }}
      - name: Summary
        run: echo "| lib | version | status |" >> $GITHUB_STEP_SUMMARY
```

## Concrete file pointers

Files to **create / modify**:
- `.github/workflows/scrape-pack.yml` (new)
- `cmd/deadzone/scrape.go` — add minimal `-list` flag that calls `scraper.Resolve` and prints JSON
- `README.md` — 1 line in Build from source
- `CLAUDE.md` — 1 line in Build & run

Files to **read as reference** (do NOT refactor):
- `cmd/deadzone/scrape.go` — flag surface: `-lib`, `-version`, `-config`, `-artifacts`, `-parallel-github-md`, `-parallel-scrape-via-agent`
- `cmd/deadzone/consolidate.go` — flag surface: `-db`, `-artifacts`
- `cmd/deadzone/dbrelease.go` — invocation pattern for publish step
- `internal/scraper/config.go` — `LoadConfig` + `Resolve` (called by the new `-list` flag)
- `internal/packs/paths.go` — `packs.Slug(libID)` for constructing cache-key slugs
- `internal/packs/releaser.go` — `GHReleaser` (already wired into `dbrelease.go`)
- `.github/workflows/ci.yml` L114–130 — cache keys to copy verbatim
- `justfile` — recipes `scrape`, `consolidate`, `dbrelease`
- `docs/research/batch-scrape-actions.md` — decision log with fan-in pattern analysis

## Test commands (literal, for agent self-check)

- `mise exec -- go build -tags ORT ./...` — compiles
- `just test -short` — passes
- `mise exec -- go run -tags ORT ./cmd/deadzone scrape -list` — emits JSON array on stdout, exits 0 (after `-list` flag is added)
- Dry-run sans tag: `gh workflow run scrape-pack.yml --ref <branch> -f lib=/modelcontextprotocol/go-sdk` → run complete, no release pushed, summary shows `1 scraped, 0 cached, 0 failed`
- Dry-run cache hit (re-run same dispatch): summary shows `0 scraped, 1 cached, 0 failed`
- Full E2E with publish on a scratch tag: `gh workflow run scrape-pack.yml --ref <branch> -f tag=v0.0.0-testpack` → `gh release view v0.0.0-testpack` contains `deadzone.db` + `deadzone.db.sha256`. Delete the release afterwards.

## Out of scope (fenced)

- **No cron trigger** — deferred until #47 ships
- **Do NOT revive `internal/packs/upload.go`** — stays `errPerArtifactDisabled`
- **No self-hosted runner support** — separate issue if ever needed
- **No per-PR ephemeral packs** — different shape, file separately
- **No changes to `release.yml`** — existing binary release flow is orthogonal
- **No refactor of `deadzone scrape` / `consolidate` / `dbrelease`** beyond the minimal `-list` addition
- **No new Go dependency**
- **No `upload-artifact` between jobs** — transport is cache only. If Pattern B (fan-in staging) proves necessary, document the deviation in the PR body; do not make it the default path
- **No new cache key for hugot model or ORT lib** — reuse verbatim from `ci.yml` L114–130

## Open sub-decisions for the implementer

- **Fan-in pattern**: Pattern C (REST cache API) preferred; Pattern B (matrix + staging) acceptable with PR body justification. Pattern A collapses into B — do not build it.
- **`-version` wiring in the scrape slot**: either extend the `justfile` recipe to accept a `version=` kwarg, or invoke `go run` directly. Implementer picks based on which makes the workflow YAML cleanest.
- **`-list` flag default output format**: JSON array of `{lib_id, version, slug}`. Add fields only if the matrix consumer needs them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: .github/workflows/scrape-pack.yml — matrix scrape + cache + consolidate producing deadzone.db #126

Decision (locked 2026-04-16)

Why

Acceptance criteria

Code skeleton (sketch, not prescriptive — finalize in implementation)

Concrete file pointers

Test commands (literal, for agent self-check)

Out of scope (fenced)

Open sub-decisions for the implementer

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: .github/workflows/scrape-pack.yml — matrix scrape + cache + consolidate producing deadzone.db #126

Description

Decision (locked 2026-04-16)

Why

Acceptance criteria

Code skeleton (sketch, not prescriptive — finalize in implementation)

Concrete file pointers

Test commands (literal, for agent self-check)

Out of scope (fenced)

Open sub-decisions for the implementer

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions