Skip to content

Research: Batch scrape pipeline via GitHub Actions matrix #53

@laradji

Description

@laradji

Revised 2026-04-16 (post-#101)

The original sketch below is historical context. #101 (merged 2026-04-13) paused per-artifact distribution — the packs rolling release this workflow was designed to push into no longer exists. The research conclusion has shifted: the matrix still makes sense, but it now produces per-lib caches (via actions/cache, not upload-artifact) that are consolidated into a single deadzone.db and published via deadzone dbrelease only when the operator passes a tag input.

Leave this issue open until #126 ships; close as completed at that point.


Use GitHub Actions as the execution environment for the per-lib scrape pipeline, replacing "developer runs the scraper on their laptop" with "the workflow runs on GH-hosted Linux runners and uploads artifacts to GitHub Releases".

Parent: #15
Composes with: #27, #28 (✅ done in #56), #30, #47, #51 (✅ done in #54)

Why GH Actions is a near-perfect fit

For a public repo, GitHub gives us for free:

  • Linux runner compute — unlimited minutes, no quota, no card on file
  • Cacheactions/cache@v5 already keys the MiniLM ONNX weights on internal/embed/hugot.go, so warm starts are ~5s
  • Storage — GH Releases assets up to ~2 GB each, hundreds of GB per release before GitHub asks questions
  • Orchestrationstrategy.matrix natively fans jobs out across libs
  • Cron triggers — built-in scheduled workflows
  • Secrets — for the scrape-via-agent (Add scrape-via-agent source kind: LLM-backed extraction for any non-raw doc source #27) LLM endpoint API key
  • Distribution — GitHub Releases CDN, no extra infrastructure

The result: a fully serverless, fully free batch scrape + index + publish pipeline that runs on a schedule and produces the per-lib artifacts users consume.

How it composes with existing issues

.github/workflows/scrape-pack.yml         (cron + workflow_dispatch)
│
├── job: load-sources
│   └── reads libraries_sources.yaml (#51 ✅ done)  ← input
│
├── job: scrape (matrix: per lib)
│   ├── checkouts the repo
│   ├── caches the embedding model
│   ├── go run ./cmd/scraper -lib /org/project    ← #27 (scrape-via-agent) + #51 ✅
│   └── uploads .db as a workflow artifact ← #28 ✅ per-lib artifact
│
└── job: publish
    ├── downloads all matrix artifacts
    ├── gh release upload packs *.db              ← #30
    └── updates artifacts/manifest.yaml

Each existing issue plays its role:

The workflow doesn't introduce new architecture, it just wires the existing pieces together inside a runner.

Sketch of the workflow

name: scrape-pack

on:
  schedule:
    - cron: '0 4 * * 1'  # weekly Monday 04:00 UTC
  workflow_dispatch:
    inputs:
      lib:
        description: 'Single lib_id to scrape (empty = all)'
        required: false

permissions:
  contents: write  # to push to releases

jobs:
  load-sources:
    runs-on: ubuntu-latest
    outputs:
      libs: ${{ steps.parse.outputs.libs }}
    steps:
      - uses: actions/checkout@v6
      - id: parse
        run: |
          # Reads libraries_sources.yaml, emits a JSON list of lib_ids for the matrix
          libs=$(yq -o=json '[.libraries[].lib_id]' libraries_sources.yaml)
          echo "libs=$libs" >> "$GITHUB_OUTPUT"

  scrape:
    needs: load-sources
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      max-parallel: 20
      matrix:
        lib: ${{ fromJSON(needs.load-sources.outputs.libs) }}
    env:
      DEADZONE_HUGOT_CACHE: ${{ github.workspace }}/.deadzone-cache/models
      DEADZONE_AGENT_ENDPOINT: ${{ secrets.DEADZONE_AGENT_ENDPOINT }}
      DEADZONE_AGENT_ENDPOINT_MODEL: ${{ secrets.DEADZONE_AGENT_ENDPOINT_MODEL }}
      DEADZONE_AGENT_ENDPOINT_API_KEY: ${{ secrets.DEADZONE_AGENT_ENDPOINT_API_KEY }}
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v6
        with:
          go-version-file: go.mod
      - uses: actions/cache@v5
        with:
          path: ${{ env.DEADZONE_HUGOT_CACHE }}
          key: hugot-model-${{ runner.os }}-${{ hashFiles('internal/embed/hugot.go') }}
      - name: Scrape ${{ matrix.lib }}
        run: go run ./cmd/scraper -config libraries_sources.yaml -lib '${{ matrix.lib }}' -artifacts artifacts/
      - uses: actions/upload-artifact@v4
        with:
          name: db-${{ matrix.lib }}
          path: artifacts/

  publish:
    needs: scrape
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/download-artifact@v4
        with:
          path: artifacts/
          merge-multiple: true
      - name: Upload to packs release
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          gh release upload packs artifacts/*.db --clobber
          # Optionally regenerate artifacts/manifest.yaml here and commit

This is a sketch — the real implementation needs more error handling, manifest regeneration, and probably splits the publish job into "release upload" + "manifest commit" so the manifest PR is reviewable. The -config and -artifacts flags shown above are the ones shipped by #54 (#51) and #56 (#28) respectively.

Sizing / constraints

  • Embedding throughput: MiniLM-L6 on a 4 vCPU ubuntu-latest runner does ~50–100 docs/second. A 50-doc lib finishes in ~1s of inference; the matrix wall time is dominated by checkout + setup-go + cache restore (~30s) + the scrape itself.
  • Matrix concurrency: GitHub allows up to 256 concurrent jobs per workflow on free plans (less practical due to org-wide queue). 20 parallel jobs is a sensible default.
  • Wall time per job: max 6h enforced by GitHub. With one lib per matrix job, this is plenty (a 1000-doc lib finishes in ~30s of inference).
  • Total wall time: with max-parallel: 20, scraping 200 libs takes ~10 min if each lib is fast. Heavier libs scale linearly.
  • Rate limits: the bottleneck is not GH Actions. It's the LLM endpoint behind scrape-via-agent (Add scrape-via-agent source kind: LLM-backed extraction for any non-raw doc source #27) if used. Self-hosted Ollama removes this limit entirely; paid APIs add a cost dimension that the matrix should respect.
  • Cache persistence: actions/cache is keyed on hugot.go, so the model download happens once per change to the embedder pin and is reused across runs. ~90 MB cache.

Open questions to resolve before implementing

Acceptance criteria

Dependencies (must land first)

So this issue is now gated by 2 other features all landing first (down from 3 — #28 has shipped). It's a "post-foundation" feature, not a v1 thing.

Out of scope

  • GitHub Pages hosting for serving the artifacts as a static site — different distribution model, not the GH Releases approach
  • Per-PR scrape preview — interesting but adds complexity (one ephemeral pack per PR); file separately if useful
  • Cost reporting / runtime budgets — public repo Linux Actions are free, no budget needed
  • Self-hosted GPU runners — unrealistic for a free OSS project; maybe a separate research issue if anyone ever proposes it

Related

Metadata

Metadata

Assignees

Labels

P3Low — nice-to-have, when time allowsresearchResearch / spike

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions