perf(vcf-to-proteindb): per-chromosome multiprocessing (~3.4× on 4-chrom test) by ypriverol · Pull Request #104 · bigbio/pgatk

ypriverol · 2026-05-13T19:37:01Z

Follow-up to #99 (closed by PR #102). #102 brought vcf-to-proteindb from ~12 h on chr22 down to ~30 min by killing redundant SQLite queries; this PR addresses the next ceiling — sequential per-chromosome processing — by fanning the per-variant loop across multiprocessing workers.

Dependency

This branch is cut from feat/bug-fixed-iteration (PR #102). Once #102 merges into dev the diff here will narrow to just the 2 perf commits below.

Strategy

The profile (see below) showed ~36 s of one-time setup per vcf_to_proteindb invocation plus per-variant work dominated by Biopython Seq.translate — CPU-bound, no shared state across chromosomes. multiprocessing.Pool with one task per chromosome was the obvious next move.

Profile-driven design (50 K gnomAD chr22 slice, 49.7 s total)

Time	Category	Component
19.5 s (39%)	one-time setup	`parse_gtf` → `gffutils.create_db`
13.7 s (27%)	one-time setup	`SeqIO.index` initial FASTA scan
2.2 s (4%)	one-time setup	`vcf_from_file`
5.2 s (10%)	per-variant CPU	`get_orfs_vcf` → Biopython translate
6.4 s (13%)	per-variant	`vcf_to_proteindb` loop body
1.6 s (3%)	per-variant	`str.partition` (16 M calls, info_kv parser)
1.1 s (2%)	per-variant	sqlite3 cursor.execute (_FeatureCache from #102 is working)

Conclusion: the bottleneck is per-variant work in the inner loop. Per-chromosome parallelism turns ∑(per-chrom-work) sequential into max(per-chrom-work) wall-clock.

Implementation

New module-level helpers: _split_vcf_by_chrom(vcf_file) and _vcf_to_proteindb_worker(default_params, pipeline_args, ...). Worker is module-level so multiprocessing.Pool can pickle it.
Existing per-variant loop preserved: moved verbatim into _vcf_to_proteindb_chunk() taking an explicit output path. No semantic changes; the public vcf_to_proteindb() now dispatches sequential vs parallel based on workers and number of chromosomes.
Pre-annotation done in the main process: annoate_vcf (the bedtools-intersect path for VCFs without CSQ) writes to Path.cwd(). Running it once in the parent avoids per-worker cwd races. Workers run only on already-annotated chunks.
mp.get_context('spawn') explicitly. Default is fork on Linux, which would inherit gffutils SQLite handles across forked processes — unsafe. Spawn starts clean Python interpreters.
-w/--workers N CLI flag (default 1, sequential, backward-compatible). Also exposed as the workers config field.
Output ordering: chunks concatenated in chromosome sort order. Within a chromosome, order is identical to sequential.

Benchmark — 4-chromosome gnomAD v4.1 exomes slice

Dataset: 4 chromosomes (chr19/20/21/22) × first 15K-50K variants each = 95,000 variants total, VEP-annotated (vep INFO field). Reference: Ensembl release 113 GRCh38, GTF filtered to those 4 chromosomes (180 MB), full cDNA FASTA (434 MB, 207K transcripts).

Machine: EBI pride-linux-vm (Linux x86_64, 8 cores, 100 GB disk).

Workers	Wall-clock	User CPU	Sys CPU	Sequences output	Speedup
1 (sequential baseline)	69.08 s	63.66 s	5.64 s	23,499	1.0×
4 (= number of chromosomes)	20.61 s	44.93 s	13.35 s	23,499	3.35×
8 (request > chunks; caps at 4)	20.41 s	44.88 s	13.20 s	23,499	3.38×

Output equivalence: MD5 of sorted unique sequence content is identical across all three runs (`7c5403192b5acb62bd2841ae3c39d62a` — confirms the parallel path produces the same protein set as sequential).

Why workers=8 doesn't beat workers=4: by design, min(workers, num_chromosomes) workers are spawned. The 4-chrom dataset caps useful parallelism at 4. On a whole-genome VCF (~24 autosomes + sex + MT), --workers 8 would run 8 chromosome chunks at a time over 3 batches.

Extrapolation to whole-genome

For Husen's reported single-thread baseline of ~30 min on chr22 (post-#102):

Workers	Whole-genome wall-clock (est.)	Notes
1	~12 h	24 × 30 min, current behaviour
4	~3 h	4 chunks at a time × 6 waves
8	~1.5 h	3 waves of 8
24+	~30 min	bounded by chr1 (largest); a worker per chromosome

Real numbers will deviate from this linear projection because chromosomes are unequal sizes (chr1 is ~5× chr22); whole-genome wall-clock will be approximately ceil(24/N) × max_chrom_time + per_worker_setup. The setup overhead (~36 s) is amortised better as variant-per-chromosome grows, so larger workloads benefit more.

What the profile says is left on the table

Cost	Status
gffutils SQLite queries	✅ Killed by #102's cache (now 2% of time)
Per-chromosome `vcf_to_proteindb` sequential	✅ This PR (3.4× empirical)
Per-worker `SeqIO.index` rebuild (~14 s × N workers)	⏳ Switch to `SeqIO.index_db` (SQLite-backed FASTA index, persistent) — would save ~14 s × N. Modest, deferred.
Per-worker `gffutils.create_db` first run (~19 s)	✅ Already cached on disk between runs; only the very first run pays
Biopython `Seq.translate` per variant	⏳ Could batch translate within a chromosome chunk; bigger refactor, deferred
`vep` field parser fails on records with commas inside pipe-separated cells (e.g. PHYLOCSF_TOO_SHORT, GERP_DIST:…)	❌ Pre-existing parser bug; ~8% of gnomAD variants currently dropped as "invalid". Worth a separate fix.

Test plan

Equivalence: pgatk/tests/test_vcf_to_proteindb_parallel.py — runs workers=1 and workers=2 against the in-repo tiny test.vcf fixture and asserts identical sequence sets (passes locally + on remote)
Backward-compat: all three existing vcf-to-proteindb tests still pass with default workers=None
- test_vcf_to_proteindb (CSQ-annotated path)
- test_vcf_to_proteindb_notannotated (bedtools-intersect path)
- test_vcf_gnomad_to_proteindb (gnomAD-style VCF)
Multi-chromosome equivalence: MD5 of sorted unique sequence content identical between workers=1, workers=4, workers=8 on the gnomAD 4-chrom 95K-variant slice
Reviewer: run pgatk vcf-to-proteindb --workers 8 ... on a whole-genome VEP-annotated VCF and report wall-clock vs single-thread

Files

pgatk/commands/vcf_to_proteindb.py        (+9, -1)   — new -w/--workers flag
pgatk/ensembl/ensembl.py                  (+124, -3) — _split_vcf_by_chrom,
                                                       _vcf_to_proteindb_worker,
                                                       _vcf_to_proteindb_chunk,
                                                       parallel dispatch in
                                                       vcf_to_proteindb()
pgatk/tests/test_vcf_to_proteindb_parallel.py  (+ new) — equivalence test
scripts/benchmark_vcf_to_proteindb.py     (+51, -6)  — cProfile support

Commits

`7c08c50` — Add cProfile support to vcf-to-proteindb benchmark script
`9235f7d` — Parallelise vcf-to-proteindb per chromosome via multiprocessing

Adds --profile-out PATH and --print-top N flags. When given, the run is wrapped in cProfile, the .prof file is written to PATH, and the top N functions are printed sorted by both cumulative time and own time. Used to identify the remaining hot path after the issue-#99 fix before designing a parallelization strategy.

Profile of 50K gnomAD chr22 variants showed ~36s one-time setup and ~14s per-variant work, with Biopython translate the dominant CPU cost. Implements per-chromosome multiprocessing.Pool to fan out the per-variant loop across cores. - New `--workers N` CLI flag (default 1, sequential, backward-compatible). - Internally: pre-annotate VCF in the main process (avoids bedtools-cwd race), split the VCF into per-chromosome temp files, dispatch to a spawn-context Pool, concatenate per-chunk outputs. - Existing per-variant loop moved verbatim into `_vcf_to_proteindb_chunk`; the public `vcf_to_proteindb` now dispatches sequential or parallel. - Workers re-construct EnsemblDataService fresh per process (no shared gffutils SQLite handles or SeqIO indices); each pays its own setup but the per-variant CPU work parallelises near-linearly. - Equivalence test verifies workers=2 produces the same set of output sequences as workers=1 on the existing test.vcf fixture.

qodo-code-review · 2026-05-13T19:37:05Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-13T19:37:08Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 588c34bf-e4f4-401b-a8ed-97db79b21483

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/vcf-to-proteindb-parallel

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…apping Three focused improvements stacking on the per-chromosome multiprocessing PR: 1. Replace SeqIO.index with SeqIO.index_db. SeqIO.index rebuilds an in-memory FASTA offset index every invocation (~14 s for the 434 MB Ensembl cDNA FASTA); SeqIO.index_db persists the index in a SQLite .idx file next to the FASTA, reused across runs and shared across workers. Pre-built once in the main process before fan-out so the first chunk pays the build cost and the rest just open the same .idx. Stale detection: .idx older than FASTA triggers a rebuild. 2. _split_vcf_by_chrom now streams the input VCF and writes per-chrom chunks directly to disk in a single pass instead of buffering all data lines in memory. Constant memory regardless of input size — makes the whole-genome path actually feasible. 3. transcript_id_mapping (versionless -> versioned dict of 207k FASTA keys) is now built lazily on the first KeyError fallback. Most real workloads never need it; saves ~200ms per worker startup. No public API change. All existing tests pass; equivalence test still confirms sequential == parallel output.

The previous commit (254e35c) inadvertently captured a re-ordering of proteindb_from_custom_VCF.fa from a single test run. The file is the OUTPUT of test_vcf_to_proteindb_notannotated (which only asserts exit_code == 0, not content); the ordering shifts non-deterministically between runs because annoate_vcf uses Python set iteration with hash randomization. Restore the prior content to keep the working tree stable across consecutive test invocations. Also gitignore *.fa.idx / *.fasta.idx: BioPython SeqIO.index_db now materialises a SQLite index next to each FASTA. These are build artifacts, rebuilt automatically when the source FASTA changes.

ypriverol · 2026-05-14T05:22:01Z

Tier-1 follow-up benchmark — `d50f8a0` vs `9235f7d`

Same workload as the PR body (4-chrom gnomAD v4.1 exomes, 95K variants, Ensembl 113 GTF/cDNA on pride-linux-vm, 8 cores). One-time .idx warm-up: 7.91 s, then:

Workers	Before Tier-1 (`9235f7d`)	After Tier-1 (`d50f8a0`)	Wall-clock Δ	vs ORIGINAL seq
1	69.08 s	20.86 s	−69.8% (3.3× faster)	3.3×
4	20.61 s	12.75 s	−38.1% (1.6× faster)	5.4×
8 (caps to 4)	20.41 s	12.86 s	−37.0% (1.6× faster)	5.4×

Output MD5 (7c5403192b5acb62bd2841ae3c39d62a) identical across all three runs and matches the pre-Tier-1 baseline — semantically zero change.

Why sequential improved 3.3×: SeqIO.index_db persists the 434 MB FASTA offset index to disk (12 MB SQLite file), so the per-invocation 14 s scan is now a sub-second open after the first build. The previous baseline rebuilt it every run.

Why parallel still improves on top: each worker also opens the pre-built .idx instead of scanning, saving 13 s × N workers of redundant setup. The compounding 5.4× is the realistic end-user speedup vs the unoptimised sequential code path.

Extrapolation update for whole-genome (~12 h originally sequential):

Mode	Estimated wall-clock
`--workers 1` post-Tier-1	~3.6 h
`--workers 8` post-Tier-1	~1 h (bounded by chr1)
`--workers 24` post-Tier-1	~30 min (bounded by chr1's per-chrom time)

ypriverol added 2 commits May 13, 2026 19:24

ypriverol changed the base branch from dev to feat/bug-fixed-iteration May 13, 2026 19:39

ypriverol added 2 commits May 13, 2026 20:49

ypriverol merged commit 6c02214 into feat/bug-fixed-iteration May 14, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vcf-to-proteindb): per-chromosome multiprocessing (~3.4× on 4-chrom test)#104

perf(vcf-to-proteindb): per-chromosome multiprocessing (~3.4× on 4-chrom test)#104
ypriverol merged 4 commits into
feat/bug-fixed-iterationfrom
perf/vcf-to-proteindb-parallel

ypriverol commented May 13, 2026

Uh oh!

qodo-code-review Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Review skipped

Uh oh!

ypriverol commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ypriverol commented May 13, 2026

Dependency

Strategy

Profile-driven design (50 K gnomAD chr22 slice, 49.7 s total)

Implementation

Benchmark — 4-chromosome gnomAD v4.1 exomes slice

Extrapolation to whole-genome

What the profile says is left on the table

Test plan

Files

Commits

Uh oh!

qodo-code-review Bot commented May 13, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

ypriverol commented May 14, 2026

Tier-1 follow-up benchmark — d50f8a0 vs 9235f7d

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Tier-1 follow-up benchmark — `d50f8a0` vs `9235f7d`