Skip to content

feat(ci): add scrape-pack workflow for batch registry scrapes#127

Merged
laradji merged 2 commits intomainfrom
code/feat-feat-githubworkflowsscrape-pack-5zk
Apr 16, 2026
Merged

feat(ci): add scrape-pack workflow for batch registry scrapes#127
laradji merged 2 commits intomainfrom
code/feat-feat-githubworkflowsscrape-pack-5zk

Conversation

@laradji
Copy link
Copy Markdown
Owner

@laradji laradji commented Apr 16, 2026

Summary

Adds a workflow_dispatch-only GitHub Actions workflow that scrapes every resolved lib in parallel on GH-hosted runners, consolidates the artifacts into a single deadzone.db, and optionally publishes it to a GitHub Release.

Design decisions are pinned in docs/research/batch-scrape-actions.md and tracked in #126.

Changes

  • .github/workflows/scrape-pack.yml — three-job pipeline:
    • expand-libs — resolves libraries_sources.yaml into a JSON matrix via the new scrape -list flag
    • scrape — matrix job (max-parallel: 20, fail-fast: false) with per-lib artifact cache keyed on libraries_sources.yaml + internal/embed/hugot.go hashes; uses upload-artifact as inter-job scratch transport (Pattern B — Pattern C via REST cache API is not buildable, see research doc §4)
    • consolidate — runs always() on partial matrix failures, fetches staged artifacts, runs deadzone consolidate, fires deadzone dbrelease only when inputs.tag is non-empty, and writes a per-slot status table to $GITHUB_STEP_SUMMARY
  • cmd/deadzone/scrape.go — new -list flag emits the resolved {lib_id, version, slug} matrix as single-line JSON and short-circuits before embedder/agent setup (no model cache or network needed for listing)
  • justfilescrape recipe now accepts version=X to pin to a single expanded version
  • README.md — documents the gh workflow run scrape-pack.yml -f tag=<tag> entry point

Concurrency & safety

  • concurrency: scrape-pack queues dispatches serially — parallel runs would fight over the same cache keys and dbrelease --clobber the same asset names
  • Empty tag input stops at the consolidated-db cache (no side effects on the releases page)
  • permissions: contents: write is scoped only to enable the release-asset write path

Test plan

  • gh workflow run scrape-pack.yml (no tag) completes and produces a consolidated-db cache + summary table
  • gh workflow run scrape-pack.yml -f lib=/hashicorp/terraform restricts the matrix to a single lib
  • gh workflow run scrape-pack.yml -f tag=vX.Y.Z uploads deadzone.db to the existing release vX.Y.Z
  • Induced scrape failure in one matrix slot still lets consolidate run and surfaces the slot as failed in the summary
  • just scrape lib=/org/project version=1.14 runs locally against the new justfile signature
  • go run ./cmd/deadzone scrape -list -config libraries_sources.yaml emits valid single-line JSON

Fixes #126

Nacer Laradji added 2 commits April 16, 2026 21:06
upload-artifact@v4 computes the archive root as the LCA of matched
files. With path: artifacts/<slug>, the LCA collapsed to that dir and
stripped the slug prefix from every entry, so every slot's artifact.db
would have collided at the same root after download-artifact
merge-multiple in consolidate — db.Consolidate would have seen only
the last slot's payload.

Upload/download now use path: artifacts/, and a per-slot
artifacts/.pack-root sentinel pins the LCA explicitly so the anchor
does not rely on artifacts/manifest.yaml happening to be present.
db.Consolidate's <dir>/*/artifact.db glob ignores the sentinel.
@laradji laradji merged commit 0e12bc2 into main Apr 16, 2026
4 checks passed
@laradji laradji deleted the code/feat-feat-githubworkflowsscrape-pack-5zk branch April 16, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: .github/workflows/scrape-pack.yml — matrix scrape + cache + consolidate producing deadzone.db

1 participant