Research: Batch scrape pipeline via GitHub Actions matrix

> ## Revised 2026-04-16 (post-#101)
>
> **The original sketch below is historical context.** #101 (merged 2026-04-13) paused per-artifact distribution — the `packs` rolling release this workflow was designed to push into no longer exists. The research conclusion has shifted: the matrix still makes sense, but it now produces per-lib caches (via `actions/cache`, not `upload-artifact`) that are consolidated into a single `deadzone.db` and published via `deadzone dbrelease` only when the operator passes a `tag` input.
>
> - **Decision log**: [`docs/research/batch-scrape-actions.md`](../blob/main/docs/research/batch-scrape-actions.md)
> - **Implementation issue** (agent-executable): #126
> - **Label changes**: dropped `feature` (now on #126); kept `research` + `P3`; milestone stays `0.2`
>
> Leave this issue open until #126 ships; close as `completed` at that point.

---

Use GitHub Actions as the execution environment for the per-lib scrape pipeline, replacing "developer runs the scraper on their laptop" with "the workflow runs on GH-hosted Linux runners and uploads artifacts to GitHub Releases".

**Parent:** #15
**Composes with:** #27, ~~#28~~ (✅ done in #56), #30, #47, ~~#51~~ (✅ done in #54)

## Why GH Actions is a near-perfect fit

For a public repo, GitHub gives us **for free**:

- **Linux runner compute** — unlimited minutes, no quota, no card on file
- **Cache** — `actions/cache@v5` already keys the MiniLM ONNX weights on `internal/embed/hugot.go`, so warm starts are ~5s
- **Storage** — GH Releases assets up to ~2 GB each, hundreds of GB per release before GitHub asks questions
- **Orchestration** — `strategy.matrix` natively fans jobs out across libs
- **Cron triggers** — built-in scheduled workflows
- **Secrets** — for the `scrape-via-agent` (#27) LLM endpoint API key
- **Distribution** — GitHub Releases CDN, no extra infrastructure

The result: a fully serverless, fully free batch scrape + index + publish pipeline that runs on a schedule and produces the per-lib artifacts users consume.

## How it composes with existing issues

```
.github/workflows/scrape-pack.yml         (cron + workflow_dispatch)
│
├── job: load-sources
│   └── reads libraries_sources.yaml (#51 ✅ done)  ← input
│
├── job: scrape (matrix: per lib)
│   ├── checkouts the repo
│   ├── caches the embedding model
│   ├── go run ./cmd/scraper -lib /org/project    ← #27 (scrape-via-agent) + #51 ✅
│   └── uploads .db as a workflow artifact ← #28 ✅ per-lib artifact
│
└── job: publish
    ├── downloads all matrix artifacts
    ├── gh release upload packs *.db              ← #30
    └── updates artifacts/manifest.yaml
```

Each existing issue plays its role:

- **#51** ✅ done — `libraries_sources.yaml` is the input the workflow reads to know what to scrape (already exists in main since #54)
- **#27** scrape-via-agent — handles HTML sources via the LLM endpoint (Ollama, vLLM, OpenAI-compatible)
- **#28** ✅ done — the matrix produces one .db per lib in `artifacts/`, exactly the format the workflow needs (shipped in #56)
- **#30** GH Releases distribution — the publish job pushes to the rolling `packs` release
- **#47** freshness — the cron schedule + per-lib filtering becomes the simplest implementation of "only re-scrape what changed"

The workflow doesn't introduce new architecture, it just **wires the existing pieces together inside a runner**.

## Sketch of the workflow

```yaml
name: scrape-pack

on:
  schedule:
    - cron: '0 4 * * 1'  # weekly Monday 04:00 UTC
  workflow_dispatch:
    inputs:
      lib:
        description: 'Single lib_id to scrape (empty = all)'
        required: false

permissions:
  contents: write  # to push to releases

jobs:
  load-sources:
    runs-on: ubuntu-latest
    outputs:
      libs: ${{ steps.parse.outputs.libs }}
    steps:
      - uses: actions/checkout@v6
      - id: parse
        run: |
          # Reads libraries_sources.yaml, emits a JSON list of lib_ids for the matrix
          libs=$(yq -o=json '[.libraries[].lib_id]' libraries_sources.yaml)
          echo "libs=$libs" >> "$GITHUB_OUTPUT"

  scrape:
    needs: load-sources
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      max-parallel: 20
      matrix:
        lib: ${{ fromJSON(needs.load-sources.outputs.libs) }}
    env:
      DEADZONE_HUGOT_CACHE: ${{ github.workspace }}/.deadzone-cache/models
      DEADZONE_AGENT_ENDPOINT: ${{ secrets.DEADZONE_AGENT_ENDPOINT }}
      DEADZONE_AGENT_ENDPOINT_MODEL: ${{ secrets.DEADZONE_AGENT_ENDPOINT_MODEL }}
      DEADZONE_AGENT_ENDPOINT_API_KEY: ${{ secrets.DEADZONE_AGENT_ENDPOINT_API_KEY }}
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v6
        with:
          go-version-file: go.mod
      - uses: actions/cache@v5
        with:
          path: ${{ env.DEADZONE_HUGOT_CACHE }}
          key: hugot-model-${{ runner.os }}-${{ hashFiles('internal/embed/hugot.go') }}
      - name: Scrape ${{ matrix.lib }}
        run: go run ./cmd/scraper -config libraries_sources.yaml -lib '${{ matrix.lib }}' -artifacts artifacts/
      - uses: actions/upload-artifact@v4
        with:
          name: db-${{ matrix.lib }}
          path: artifacts/

  publish:
    needs: scrape
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/download-artifact@v4
        with:
          path: artifacts/
          merge-multiple: true
      - name: Upload to packs release
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          gh release upload packs artifacts/*.db --clobber
          # Optionally regenerate artifacts/manifest.yaml here and commit
```

This is a sketch — the real implementation needs more error handling, manifest regeneration, and probably splits the publish job into "release upload" + "manifest commit" so the manifest PR is reviewable. The `-config` and `-artifacts` flags shown above are the ones shipped by #54 (#51) and #56 (#28) respectively.

## Sizing / constraints

- **Embedding throughput**: MiniLM-L6 on a 4 vCPU `ubuntu-latest` runner does ~50–100 docs/second. A 50-doc lib finishes in ~1s of inference; the matrix wall time is dominated by checkout + setup-go + cache restore (~30s) + the scrape itself.
- **Matrix concurrency**: GitHub allows up to 256 concurrent jobs per workflow on free plans (less practical due to org-wide queue). 20 parallel jobs is a sensible default.
- **Wall time per job**: max 6h enforced by GitHub. With one lib per matrix job, this is plenty (a 1000-doc lib finishes in ~30s of inference).
- **Total wall time**: with `max-parallel: 20`, scraping 200 libs takes ~10 min if each lib is fast. Heavier libs scale linearly.
- **Rate limits**: the bottleneck is **not** GH Actions. It's the LLM endpoint behind `scrape-via-agent` (#27) if used. Self-hosted Ollama removes this limit entirely; paid APIs add a cost dimension that the matrix should respect.
- **Cache persistence**: `actions/cache` is keyed on `hugot.go`, so the model download happens once per change to the embedder pin and is reused across runs. ~90 MB cache.

## Open questions to resolve before implementing

- **Where does `manifest.yaml` get regenerated** — in the same workflow or a downstream PR-creating workflow? (Affects review surface for #30.)
- **How to handle failed libs** — `fail-fast: false` lets other libs proceed, but a failed lib should still be visible. Probably a summary job that posts an issue comment or fails the run if too many libs failed.
- **Cron frequency** — weekly? daily? per-lib hint from #47? Start with weekly and tighten when #47 lands.
- **Concurrency policy** — `concurrency: scrape-pack` to prevent two cron runs from racing. Probably needed.
- **Self-hosted runner support** — for very large corpora or non-public libs, self-hosted runners would let users run the same workflow on their own infra. Worth supporting via a `runs-on` matrix or label switch.

## Acceptance criteria

- [ ] `.github/workflows/scrape-pack.yml` exists, runs on cron + workflow_dispatch
- [ ] Reads `libraries_sources.yaml` (#51 ✅ done) and fans out via matrix
- [ ] Each matrix job produces a per-lib `.db` artifact (#28 ✅ done)
- [ ] The publish job uploads all artifacts to the `packs` rolling release (#30)
- [ ] First end-to-end run produces a complete pack matching what a local `just scrape` would produce
- [ ] Workflow is idempotent — re-running it doesn't break the release or the artifacts
- [ ] Secrets are documented in CLAUDE.md (or a `docs/contributing.md`) so future maintainers can configure them

## Dependencies (must land first)

- ~~**#51** `libraries_sources.yaml`~~ ✅ **done in #54** — the input file now exists in main
- ~~**#28** per-lib artifacts~~ ✅ **done in #56** — `cmd/scraper -artifacts` and `cmd/consolidate` both exist in main
- **#30** GH Releases distribution — without the publish target the workflow has nothing to push to
- **#27** scrape-via-agent (only required for non-raw-markdown sources) — without it the matrix can only handle `github-md` sources

So this issue is now **gated by 2 other features** all landing first (down from 3 — #28 has shipped). It's a "post-foundation" feature, not a v1 thing.

## Out of scope

- **GitHub Pages hosting** for serving the artifacts as a static site — different distribution model, not the GH Releases approach
- **Per-PR scrape preview** — interesting but adds complexity (one ephemeral pack per PR); file separately if useful
- **Cost reporting / runtime budgets** — public repo Linux Actions are free, no budget needed
- **Self-hosted GPU runners** — unrealistic for a free OSS project; maybe a separate research issue if anyone ever proposes it

## Related

- **Parent:** #15 — doc ingestion architecture
- **Depends on:** #27, #30 — two remaining features must land before this can be implemented
- **#51** ✅ done in #54 — provided the input file format this workflow consumes
- **#28** ✅ done in #56 — provided the per-lib artifact output format and the `cmd/consolidate` step the workflow chains into
- **Composes with:** #47 freshness — the cron schedule is the simplest version of freshness; #47 will eventually replace the naive weekly cron with smarter per-lib triggers
- **Idea origin:** flagged during the session that filed #51 + #52 — using GH Actions free Linux compute as the batch scrape execution environment is too well-aligned with the existing architecture to ignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: Batch scrape pipeline via GitHub Actions matrix #53

Revised 2026-04-16 (post-#101)

Why GH Actions is a near-perfect fit

How it composes with existing issues

Sketch of the workflow

Sizing / constraints

Open questions to resolve before implementing

Acceptance criteria

Dependencies (must land first)

Out of scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Research: Batch scrape pipeline via GitHub Actions matrix #53

Description

Revised 2026-04-16 (post-#101)

Why GH Actions is a near-perfect fit

How it composes with existing issues

Sketch of the workflow

Sizing / constraints

Open questions to resolve before implementing

Acceptance criteria

Dependencies (must land first)

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions