Skip to content

feat: .github/workflows/scrape-pack.yml — matrix scrape + cache + consolidate producing deadzone.db #126

@laradji

Description

@laradji

Parent: #53
Depends on: none currently blocking
Supersedes: the implementation sketch in #53's original body (which assumed the packs per-artifact release flow paused by #101)

Decision (locked 2026-04-16)

  1. Publish target: single deadzone.db via deadzone dbrelease — aligned with scraper/packs: folder-per-lib layout + retire per-artifact release, ship deadzone.db only #101, no revival of the paused per-artifact flow.
  2. Trigger: workflow_dispatch only — no cron until Research: automated freshness detection and refresh triggers at Context7-scale #47 (freshness detection) lands.
  3. Inter-job transport: actions/cache@v5, not actions/upload-artifact. Each matrix slot caches artifacts/<slug>/; the consolidate job restores all lib caches.
  4. Publish conditional on inputs.tag: with a tag, consolidate chains deadzone dbrelease; without a tag, workflow stops at a consolidated deadzone.db cache.
  5. Do NOT touch internal/packs/upload.go — per-artifact distribution stays paused.

Full reasoning and architecture: docs/research/batch-scrape-actions.md.

Why

Scraping the full registry on a laptop costs minutes of wall time and blocks the operator; GH-hosted Linux runners do it free and in parallel. The cache layer (decision #3) turns the matrix into a de-facto freshness shim — libs whose config hash hasn't changed are skipped instantly on re-runs.

Acceptance criteria

  • .github/workflows/scrape-pack.yml exists and triggers on workflow_dispatch with inputs:
    • lib (optional string) — filter to a single lib_id (empty = all)
    • tag (optional string) — if non-empty, triggers deadzone dbrelease at the end
  • Job expand-libs emits a JSON array of {lib_id, version, slug} resolved from libraries_sources.yaml, consumable by matrix: via fromJSON. This requires a minimal -list flag on deadzone scrape that calls scraper.LoadConfig + scraper.Resolve (in internal/scraper/config.go), prints the JSON to stdout, and os.Exit(0). This is the only new Go surface in the issue.
  • Job scrape runs strategy.matrix over the expanded list with max-parallel: 20 and fail-fast: false
  • Each scrape slot reuses the pre-existing cache keys from .github/workflows/ci.yml lines 114–130 verbatim — do NOT introduce new keys for the embedder or ORT library:
    • hugot-model-${{ runner.os }}-${{ hashFiles('internal/embed/hugot.go') }}
    • ort-lib-${{ runner.os }}-${{ hashFiles('internal/ort/ort.go') }}
  • Each scrape slot adds a new cache entry keyed on:
    artifact-<slug>-<version>-${{ hashFiles('libraries_sources.yaml') }}-${{ hashFiles('internal/embed/hugot.go') }}
    with path artifacts/<slug>/
  • Cache hit → slot skips deadzone scrape via if: steps.cache.outputs.cache-hit != 'true' guard on the scrape step
  • Cache miss → slot runs just scrape lib=<id> (add -version <v> flag wiring to the justfile recipe or invoke go run directly, whichever is cleaner — implementer's call)
  • Job consolidate (needs: scrape) restores all N lib caches into artifacts/ and runs just consolidate
  • If inputs.tag is non-empty, the consolidate job chains mise exec -- go run -tags ORT ./cmd/deadzone dbrelease -db deadzone.db -tag <tag>. Reuse cmd/deadzone/dbrelease.go verbatim — do NOT reimplement the upload.
  • A final summary step (can be in the consolidate job) writes a markdown table to $GITHUB_STEP_SUMMARY with columns: lib, version, status (scraped / cached / failed)
  • README.md → Build from source section gains one line: "The full registry can also be scraped from GitHub Actions via the scrape-pack workflow (see .github/workflows/scrape-pack.yml)."
  • CLAUDE.md → Build & run section gains one line: "Batch rescrape: gh workflow run scrape-pack.yml -f tag=<tag> scrapes + consolidates + dbreleases. Omit -f tag=… to stop at the consolidated-db cache."

Code skeleton (sketch, not prescriptive — finalize in implementation)

name: scrape-pack
on:
  workflow_dispatch:
    inputs:
      lib: { description: 'Filter lib_id (empty = all)', required: false }
      tag: { description: 'Release tag (empty = no publish)', required: false }
permissions:
  contents: write  # dbrelease needs this when tag != ''
concurrency:
  group: scrape-pack
  cancel-in-progress: false
jobs:
  expand-libs:
    runs-on: ubuntu-latest
    outputs:
      libs: ${{ steps.list.outputs.libs }}
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v6
        with: { go-version-file: go.mod }
      - id: list
        run: |
          libs=$(mise exec -- go run -tags ORT ./cmd/deadzone scrape -config libraries_sources.yaml -list)
          echo "libs=$libs" >> "$GITHUB_OUTPUT"
  scrape:
    needs: expand-libs
    runs-on: ubuntu-latest
    strategy:
      matrix:
        entry: ${{ fromJSON(needs.expand-libs.outputs.libs) }}
      fail-fast: false
      max-parallel: 20
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v6
        with: { go-version-file: go.mod }
      - uses: ./.github/actions/install-native-deps
      - uses: actions/cache@v5  # hugot model — verbatim from ci.yml L114-119
        with:
          path: ${{ env.DEADZONE_HUGOT_CACHE }}
          key: hugot-model-${{ runner.os }}-${{ hashFiles('internal/embed/hugot.go') }}
      - uses: actions/cache@v5  # ORT lib — verbatim from ci.yml L124-129
        with:
          path: ${{ env.DEADZONE_ORT_CACHE }}
          key: ort-lib-${{ runner.os }}-${{ hashFiles('internal/ort/ort.go') }}
      - uses: actions/cache@v5  # per-lib artifact cache
        id: artifact-cache
        with:
          path: artifacts/${{ matrix.entry.slug }}
          key: artifact-${{ matrix.entry.slug }}-${{ matrix.entry.version }}-${{ hashFiles('libraries_sources.yaml') }}-${{ hashFiles('internal/embed/hugot.go') }}
      - if: steps.artifact-cache.outputs.cache-hit != 'true'
        run: just scrape lib=${{ matrix.entry.lib_id }}  # -version wiring TBD
  consolidate:
    needs: scrape
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v6
        with: { go-version-file: go.mod }
      - uses: ./.github/actions/install-native-deps
      # Pattern C: fan-in via REST cache API — see docs/research/batch-scrape-actions.md §4
      - name: Restore all lib caches
        env: { GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} }
        run: |
          # read needs.expand-libs.outputs.libs, loop, fetch each cache archive by key
          # fallback to Pattern B if this proves unworkable — document in PR
      - run: just consolidate db=deadzone.db
      - if: inputs.tag != ''
        run: mise exec -- go run -tags ORT ./cmd/deadzone dbrelease -db deadzone.db -tag ${{ inputs.tag }}
      - name: Summary
        run: echo "| lib | version | status |" >> $GITHUB_STEP_SUMMARY

Concrete file pointers

Files to create / modify:

  • .github/workflows/scrape-pack.yml (new)
  • cmd/deadzone/scrape.go — add minimal -list flag that calls scraper.Resolve and prints JSON
  • README.md — 1 line in Build from source
  • CLAUDE.md — 1 line in Build & run

Files to read as reference (do NOT refactor):

  • cmd/deadzone/scrape.go — flag surface: -lib, -version, -config, -artifacts, -parallel-github-md, -parallel-scrape-via-agent
  • cmd/deadzone/consolidate.go — flag surface: -db, -artifacts
  • cmd/deadzone/dbrelease.go — invocation pattern for publish step
  • internal/scraper/config.goLoadConfig + Resolve (called by the new -list flag)
  • internal/packs/paths.gopacks.Slug(libID) for constructing cache-key slugs
  • internal/packs/releaser.goGHReleaser (already wired into dbrelease.go)
  • .github/workflows/ci.yml L114–130 — cache keys to copy verbatim
  • justfile — recipes scrape, consolidate, dbrelease
  • docs/research/batch-scrape-actions.md — decision log with fan-in pattern analysis

Test commands (literal, for agent self-check)

  • mise exec -- go build -tags ORT ./... — compiles
  • just test -short — passes
  • mise exec -- go run -tags ORT ./cmd/deadzone scrape -list — emits JSON array on stdout, exits 0 (after -list flag is added)
  • Dry-run sans tag: gh workflow run scrape-pack.yml --ref <branch> -f lib=/modelcontextprotocol/go-sdk → run complete, no release pushed, summary shows 1 scraped, 0 cached, 0 failed
  • Dry-run cache hit (re-run same dispatch): summary shows 0 scraped, 1 cached, 0 failed
  • Full E2E with publish on a scratch tag: gh workflow run scrape-pack.yml --ref <branch> -f tag=v0.0.0-testpackgh release view v0.0.0-testpack contains deadzone.db + deadzone.db.sha256. Delete the release afterwards.

Out of scope (fenced)

  • No cron trigger — deferred until Research: automated freshness detection and refresh triggers at Context7-scale #47 ships
  • Do NOT revive internal/packs/upload.go — stays errPerArtifactDisabled
  • No self-hosted runner support — separate issue if ever needed
  • No per-PR ephemeral packs — different shape, file separately
  • No changes to release.yml — existing binary release flow is orthogonal
  • No refactor of deadzone scrape / consolidate / dbrelease beyond the minimal -list addition
  • No new Go dependency
  • No upload-artifact between jobs — transport is cache only. If Pattern B (fan-in staging) proves necessary, document the deviation in the PR body; do not make it the default path
  • No new cache key for hugot model or ORT lib — reuse verbatim from ci.yml L114–130

Open sub-decisions for the implementer

  • Fan-in pattern: Pattern C (REST cache API) preferred; Pattern B (matrix + staging) acceptable with PR body justification. Pattern A collapses into B — do not build it.
  • -version wiring in the scrape slot: either extend the justfile recipe to accept a version= kwarg, or invoke go run directly. Implementer picks based on which makes the workflow YAML cleanest.
  • -list flag default output format: JSON array of {lib_id, version, slug}. Add fields only if the matrix consumer needs them.

Metadata

Metadata

Assignees

Labels

P3Low — nice-to-have, when time allowsfeatureNew feature

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions