You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The original sketch below is historical context.#101 (merged 2026-04-13) paused per-artifact distribution — the packs rolling release this workflow was designed to push into no longer exists. The research conclusion has shifted: the matrix still makes sense, but it now produces per-lib caches (via actions/cache, not upload-artifact) that are consolidated into a single deadzone.db and published via deadzone dbrelease only when the operator passes a tag input.
Leave this issue open until #126 ships; close as completed at that point.
Use GitHub Actions as the execution environment for the per-lib scrape pipeline, replacing "developer runs the scraper on their laptop" with "the workflow runs on GH-hosted Linux runners and uploads artifacts to GitHub Releases".
Distribution — GitHub Releases CDN, no extra infrastructure
The result: a fully serverless, fully free batch scrape + index + publish pipeline that runs on a schedule and produces the per-lib artifacts users consume.
How it composes with existing issues
.github/workflows/scrape-pack.yml (cron + workflow_dispatch)
│
├── job: load-sources
│ └── reads libraries_sources.yaml (#51 ✅ done) ← input
│
├── job: scrape (matrix: per lib)
│ ├── checkouts the repo
│ ├── caches the embedding model
│ ├── go run ./cmd/scraper -lib /org/project ← #27 (scrape-via-agent) + #51 ✅
│ └── uploads .db as a workflow artifact ← #28 ✅ per-lib artifact
│
└── job: publish
├── downloads all matrix artifacts
├── gh release upload packs *.db ← #30
└── updates artifacts/manifest.yaml
This is a sketch — the real implementation needs more error handling, manifest regeneration, and probably splits the publish job into "release upload" + "manifest commit" so the manifest PR is reviewable. The -config and -artifacts flags shown above are the ones shipped by #54 (#51) and #56 (#28) respectively.
Sizing / constraints
Embedding throughput: MiniLM-L6 on a 4 vCPU ubuntu-latest runner does ~50–100 docs/second. A 50-doc lib finishes in ~1s of inference; the matrix wall time is dominated by checkout + setup-go + cache restore (~30s) + the scrape itself.
Matrix concurrency: GitHub allows up to 256 concurrent jobs per workflow on free plans (less practical due to org-wide queue). 20 parallel jobs is a sensible default.
Wall time per job: max 6h enforced by GitHub. With one lib per matrix job, this is plenty (a 1000-doc lib finishes in ~30s of inference).
Total wall time: with max-parallel: 20, scraping 200 libs takes ~10 min if each lib is fast. Heavier libs scale linearly.
Cache persistence: actions/cache is keyed on hugot.go, so the model download happens once per change to the embedder pin and is reused across runs. ~90 MB cache.
How to handle failed libs — fail-fast: false lets other libs proceed, but a failed lib should still be visible. Probably a summary job that posts an issue comment or fails the run if too many libs failed.
Concurrency policy — concurrency: scrape-pack to prevent two cron runs from racing. Probably needed.
Self-hosted runner support — for very large corpora or non-public libs, self-hosted runners would let users run the same workflow on their own infra. Worth supporting via a runs-on matrix or label switch.
Acceptance criteria
.github/workflows/scrape-pack.yml exists, runs on cron + workflow_dispatch
Use GitHub Actions as the execution environment for the per-lib scrape pipeline, replacing "developer runs the scraper on their laptop" with "the workflow runs on GH-hosted Linux runners and uploads artifacts to GitHub Releases".
Parent: #15
Composes with: #27,
#28(✅ done in #56), #30, #47,#51(✅ done in #54)Why GH Actions is a near-perfect fit
For a public repo, GitHub gives us for free:
actions/cache@v5already keys the MiniLM ONNX weights oninternal/embed/hugot.go, so warm starts are ~5sstrategy.matrixnatively fans jobs out across libsscrape-via-agent(Addscrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27) LLM endpoint API keyThe result: a fully serverless, fully free batch scrape + index + publish pipeline that runs on a schedule and produces the per-lib artifacts users consume.
How it composes with existing issues
Each existing issue plays its role:
libraries_sources.yamlconfig file #51 ✅ done —libraries_sources.yamlis the input the workflow reads to know what to scrape (already exists in main since refactor: extract hardcoded library URLs into YAML registry #54)scrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27 scrape-via-agent — handles HTML sources via the LLM endpoint (Ollama, vLLM, OpenAI-compatible)artifacts/, exactly the format the workflow needs (shipped in refactor: per-lib database artifacts with consolidate pipeline #56)packsreleaseThe workflow doesn't introduce new architecture, it just wires the existing pieces together inside a runner.
Sketch of the workflow
This is a sketch — the real implementation needs more error handling, manifest regeneration, and probably splits the publish job into "release upload" + "manifest commit" so the manifest PR is reviewable. The
-configand-artifactsflags shown above are the ones shipped by #54 (#51) and #56 (#28) respectively.Sizing / constraints
ubuntu-latestrunner does ~50–100 docs/second. A 50-doc lib finishes in ~1s of inference; the matrix wall time is dominated by checkout + setup-go + cache restore (~30s) + the scrape itself.max-parallel: 20, scraping 200 libs takes ~10 min if each lib is fast. Heavier libs scale linearly.scrape-via-agent(Addscrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27) if used. Self-hosted Ollama removes this limit entirely; paid APIs add a cost dimension that the matrix should respect.actions/cacheis keyed onhugot.go, so the model download happens once per change to the embedder pin and is reused across runs. ~90 MB cache.Open questions to resolve before implementing
manifest.yamlget regenerated — in the same workflow or a downstream PR-creating workflow? (Affects review surface for Distribute per-lib artifacts via GitHub Releases (upload + download commands) #30.)fail-fast: falselets other libs proceed, but a failed lib should still be visible. Probably a summary job that posts an issue comment or fails the run if too many libs failed.concurrency: scrape-packto prevent two cron runs from racing. Probably needed.runs-onmatrix or label switch.Acceptance criteria
.github/workflows/scrape-pack.ymlexists, runs on cron + workflow_dispatchlibraries_sources.yaml(Replace hardcoded library URLs with alibraries_sources.yamlconfig file #51 ✅ done) and fans out via matrix.dbartifact (Per-lib database artifacts: split scraping output for isolated updates #28 ✅ done)packsrolling release (Distribute per-lib artifacts via GitHub Releases (upload + download commands) #30)just scrapewould producedocs/contributing.md) so future maintainers can configure themDependencies (must land first)
Replace hardcoded library URLs with a✅ done in refactor: extract hardcoded library URLs into YAML registry #54 — the input file now exists in mainlibraries_sources.yamlconfig file #51libraries_sources.yamlPer-lib database artifacts: split scraping output for isolated updates #28 per-lib artifacts✅ done in refactor: per-lib database artifacts with consolidate pipeline #56 —cmd/scraper -artifactsandcmd/consolidateboth exist in mainscrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27 scrape-via-agent (only required for non-raw-markdown sources) — without it the matrix can only handlegithub-mdsourcesSo this issue is now gated by 2 other features all landing first (down from 3 — #28 has shipped). It's a "post-foundation" feature, not a v1 thing.
Out of scope
Related
scrape-via-agentsource kind: LLM-backed extraction for any non-raw doc source #27, Distribute per-lib artifacts via GitHub Releases (upload + download commands) #30 — two remaining features must land before this can be implementedlibraries_sources.yamlconfig file #51 ✅ done in refactor: extract hardcoded library URLs into YAML registry #54 — provided the input file format this workflow consumescmd/consolidatestep the workflow chains intolibraries_sources.yamlconfig file #51 + Research: library registry — format, schema, maintenance, sharing #52 — using GH Actions free Linux compute as the batch scrape execution environment is too well-aligned with the existing architecture to ignore