Skip to content

feat(website): python seeder with pango lineages and test suite#1203

Open
fhennig wants to merge 13 commits into
mainfrom
feat/pango-lineage-seeder
Open

feat(website): python seeder with pango lineages and test suite#1203
fhennig wants to merge 13 commits into
mainfrom
feat/pango-lineage-seeder

Conversation

@fhennig
Copy link
Copy Markdown
Contributor

@fhennig fhennig commented May 7, 2026

Dependencies

⚠️ This PR depends on #1201 (backend support for POST /users/sync and collection upsert via PUT /collections/{id}). Merge that first.

... Also, I still didn't review this yet myself (Felix) - It's just vibe coded so far.

Summary

  • Replaces the JS seed.mjs with a unified Python seeder in collection-seeding/ (renamed from example-data/)
  • Modular source architecture — new data sources can be added as modules in sources/
  • Two sources implemented:
    • covid-resistance-mutations — port of the original JS resistance mutation data (3CLpro, RdRp, Spike mAb)
    • covid-pango-lineages — fetches ~4,976 pango lineage definitions from corneliusroemer/pango-sequences, one collection per lineage with nucleotide substitutions as variants
  • CLI uses argparse subcommands (covid-pango-lineages, covid-resistance-mutations); no subcommand runs all sources
  • Upserts collections (create or update by name) via POST /collections and PUT /collections/{id}
  • Calls POST /users/sync before any backend interaction to obtain the internal user ID (using the genspectrum-bot account, GitHub ID 218605180)
  • Uses pixi for dependency management with a multi-stage Docker build (pixi builder → python:3.13-slim)
  • Typed throughout with TypedDict (Collection, Variant, FilterObject, ExistingCollection)
  • 38 tests covering HTTP interactions, mutation name math, collection building, and upsert orchestration

Test plan

  • pixi run -e test test — all 38 tests pass
  • pixi run seed — seeds resistance mutations + first 10 lineages against a local backend
  • pixi run seed again — all collections updated (upsert)
  • pixi run seed-all-lineages — seeds all ~4,976 lineages
  • docker build -t collection-seeder . && docker run --rm -e BACKEND_URL=http://host.docker.internal:8080 collection-seeder

🤖 Generated with Claude Code

fhennig and others added 10 commits May 7, 2026 11:07
Adds example-data/lineages/seed.py, a Python script that fetches pango
lineage definitions from the upstream summary JSON and creates one
backend collection per lineage (nucleotide substitutions as variants).

Mirrors the patterns of seed.mjs: idempotent, supports --wait,
--url, --user-id, and --limit (default 10 for testing, 0 for all).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…seeder

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the split JS/Python approach with a single Python codebase:

- seed.py: main entry point with argparse subcommands
  (covid-resistance-mutations, covid-pango-lineages)
- backend.py: shared BackendClient (wait, fetch, create)
- sources/resistance_mutations.py: port of seed.mjs resistance data
- sources/pango_lineages.py: pango lineage fetcher
- Dockerfile updated to run python3 seed.py

Running without a subcommand seeds all sources. --limit only applies
to the covid-pango-lineages subcommand (default: 10, 0 = all).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- pixi.toml with [workspace] config, python 3.13, requests via PyPI
- pixi.lock committed for reproducibility
- Dockerfile updated to multi-stage: pixi builder copies site-packages
  into python:3.13-slim final image
- Defines tasks: seed, seed-lineages, seed-all-lineages, seed-resistance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BackendClient now calls POST /users/sync (githubId=9999999999,
name="GenSpectrum Team") to obtain the internal user id before any
collection API calls. wait_for_backend() uses this call for polling.
Removes the --user-id CLI flag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collections are now always created or updated (matched by name).
Adds BackendClient.update_collection() using PUT /collections/{id}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ackend

38 tests across 4 files:
- test_backend.py: BackendClient (responses library for HTTP mocking)
- test_resistance_mutations.py: mature_name offset math, collection structure
- test_pango_lineages.py: collection building, variant filtering, HTTP fetch
- test_seed.py: seed_source create/update/mixed upsert logic

Run with: pixi run -e test test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboards Ready Ready Preview, Comment May 7, 2026 11:18am

Request Review

@fhennig fhennig changed the title feat(collection-seeding): Python seeder with pango lineages and test suite feat(website): Python seeder with pango lineages and test suite May 7, 2026
@fhennig fhennig changed the title feat(website): Python seeder with pango lineages and test suite feat(website): python seeder with pango lineages and test suite May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant