perf(vcf-to-proteindb): per-chromosome multiprocessing (~3.4× on 4-chrom test)#104
Conversation
Adds --profile-out PATH and --print-top N flags. When given, the run is wrapped in cProfile, the .prof file is written to PATH, and the top N functions are printed sorted by both cumulative time and own time. Used to identify the remaining hot path after the issue-#99 fix before designing a parallelization strategy.
Profile of 50K gnomAD chr22 variants showed ~36s one-time setup and ~14s per-variant work, with Biopython translate the dominant CPU cost. Implements per-chromosome multiprocessing.Pool to fan out the per-variant loop across cores. - New `--workers N` CLI flag (default 1, sequential, backward-compatible). - Internally: pre-annotate VCF in the main process (avoids bedtools-cwd race), split the VCF into per-chromosome temp files, dispatch to a spawn-context Pool, concatenate per-chunk outputs. - Existing per-variant loop moved verbatim into `_vcf_to_proteindb_chunk`; the public `vcf_to_proteindb` now dispatches sequential or parallel. - Workers re-construct EnsemblDataService fresh per process (no shared gffutils SQLite handles or SeqIO indices); each pays its own setup but the per-variant CPU work parallelises near-linearly. - Equivalence test verifies workers=2 produces the same set of output sequences as workers=1 on the existing test.vcf fixture.
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…apping Three focused improvements stacking on the per-chromosome multiprocessing PR: 1. Replace SeqIO.index with SeqIO.index_db. SeqIO.index rebuilds an in-memory FASTA offset index every invocation (~14 s for the 434 MB Ensembl cDNA FASTA); SeqIO.index_db persists the index in a SQLite .idx file next to the FASTA, reused across runs and shared across workers. Pre-built once in the main process before fan-out so the first chunk pays the build cost and the rest just open the same .idx. Stale detection: .idx older than FASTA triggers a rebuild. 2. _split_vcf_by_chrom now streams the input VCF and writes per-chrom chunks directly to disk in a single pass instead of buffering all data lines in memory. Constant memory regardless of input size — makes the whole-genome path actually feasible. 3. transcript_id_mapping (versionless -> versioned dict of 207k FASTA keys) is now built lazily on the first KeyError fallback. Most real workloads never need it; saves ~200ms per worker startup. No public API change. All existing tests pass; equivalence test still confirms sequential == parallel output.
The previous commit (254e35c) inadvertently captured a re-ordering of proteindb_from_custom_VCF.fa from a single test run. The file is the OUTPUT of test_vcf_to_proteindb_notannotated (which only asserts exit_code == 0, not content); the ordering shifts non-deterministically between runs because annoate_vcf uses Python set iteration with hash randomization. Restore the prior content to keep the working tree stable across consecutive test invocations. Also gitignore *.fa.idx / *.fasta.idx: BioPython SeqIO.index_db now materialises a SQLite index next to each FASTA. These are build artifacts, rebuilt automatically when the source FASTA changes.
Tier-1 follow-up benchmark —
|
| Workers | Before Tier-1 (9235f7d) |
After Tier-1 (d50f8a0) |
Wall-clock Δ | vs ORIGINAL seq |
|---|---|---|---|---|
| 1 | 69.08 s | 20.86 s | −69.8% (3.3× faster) | 3.3× |
| 4 | 20.61 s | 12.75 s | −38.1% (1.6× faster) | 5.4× |
| 8 (caps to 4) | 20.41 s | 12.86 s | −37.0% (1.6× faster) | 5.4× |
Output MD5 (7c5403192b5acb62bd2841ae3c39d62a) identical across all three runs and matches the pre-Tier-1 baseline — semantically zero change.
Why sequential improved 3.3×: SeqIO.index_db persists the 434 MB FASTA offset index to disk (12 MB SQLite file), so the per-invocation 14 s scan is now a sub-second open after the first build. The previous baseline rebuilt it every run.
Why parallel still improves on top: each worker also opens the pre-built .idx instead of scanning, saving 13 s × N workers of redundant setup. The compounding 5.4× is the realistic end-user speedup vs the unoptimised sequential code path.
Extrapolation update for whole-genome (~12 h originally sequential):
| Mode | Estimated wall-clock |
|---|---|
--workers 1 post-Tier-1 |
~3.6 h |
--workers 8 post-Tier-1 |
~1 h (bounded by chr1) |
--workers 24 post-Tier-1 |
~30 min (bounded by chr1's per-chrom time) |
Follow-up to #99 (closed by PR #102). #102 brought vcf-to-proteindb from ~12 h on chr22 down to ~30 min by killing redundant SQLite queries; this PR addresses the next ceiling — sequential per-chromosome processing — by fanning the per-variant loop across
multiprocessingworkers.Dependency
This branch is cut from
feat/bug-fixed-iteration(PR #102). Once #102 merges intodevthe diff here will narrow to just the 2 perf commits below.Strategy
The profile (see below) showed ~36 s of one-time setup per
vcf_to_proteindbinvocation plus per-variant work dominated by BiopythonSeq.translate— CPU-bound, no shared state across chromosomes.multiprocessing.Poolwith one task per chromosome was the obvious next move.Profile-driven design (50 K gnomAD chr22 slice, 49.7 s total)
parse_gtf→gffutils.create_dbSeqIO.indexinitial FASTA scanvcf_from_fileget_orfs_vcf→ Biopython translatevcf_to_proteindbloop bodystr.partition(16 M calls, info_kv parser)Conclusion: the bottleneck is per-variant work in the inner loop. Per-chromosome parallelism turns ∑(per-chrom-work) sequential into max(per-chrom-work) wall-clock.
Implementation
_split_vcf_by_chrom(vcf_file)and_vcf_to_proteindb_worker(default_params, pipeline_args, ...). Worker is module-level somultiprocessing.Poolcan pickle it._vcf_to_proteindb_chunk()taking an explicit output path. No semantic changes; the publicvcf_to_proteindb()now dispatches sequential vs parallel based onworkersand number of chromosomes.annoate_vcf(the bedtools-intersect path for VCFs without CSQ) writes toPath.cwd(). Running it once in the parent avoids per-worker cwd races. Workers run only on already-annotated chunks.mp.get_context('spawn')explicitly. Default is fork on Linux, which would inherit gffutils SQLite handles across forked processes — unsafe. Spawn starts clean Python interpreters.-w/--workers NCLI flag (default 1, sequential, backward-compatible). Also exposed as theworkersconfig field.Benchmark — 4-chromosome gnomAD v4.1 exomes slice
Dataset: 4 chromosomes (chr19/20/21/22) × first 15K-50K variants each = 95,000 variants total, VEP-annotated (
vepINFO field). Reference: Ensembl release 113 GRCh38, GTF filtered to those 4 chromosomes (180 MB), full cDNA FASTA (434 MB, 207K transcripts).Machine: EBI
pride-linux-vm(Linux x86_64, 8 cores, 100 GB disk).Output equivalence: MD5 of sorted unique sequence content is identical across all three runs (`7c5403192b5acb62bd2841ae3c39d62a` — confirms the parallel path produces the same protein set as sequential).
Why workers=8 doesn't beat workers=4: by design,
min(workers, num_chromosomes)workers are spawned. The 4-chrom dataset caps useful parallelism at 4. On a whole-genome VCF (~24 autosomes + sex + MT),--workers 8would run 8 chromosome chunks at a time over 3 batches.Extrapolation to whole-genome
For Husen's reported single-thread baseline of ~30 min on chr22 (post-#102):
Real numbers will deviate from this linear projection because chromosomes are unequal sizes (chr1 is ~5× chr22); whole-genome wall-clock will be approximately
ceil(24/N) × max_chrom_time + per_worker_setup. The setup overhead (~36 s) is amortised better as variant-per-chromosome grows, so larger workloads benefit more.What the profile says is left on the table
vcf_to_proteindbsequentialSeqIO.indexrebuild (~14 s × N workers)SeqIO.index_db(SQLite-backed FASTA index, persistent) — would save ~14 s × N. Modest, deferred.gffutils.create_dbfirst run (~19 s)Seq.translateper variantvepfield parser fails on records with commas inside pipe-separated cells (e.g. PHYLOCSF_TOO_SHORT, GERP_DIST:…)Test plan
pgatk/tests/test_vcf_to_proteindb_parallel.py— runsworkers=1andworkers=2against the in-repo tinytest.vcffixture and asserts identical sequence sets (passes locally + on remote)workers=Nonetest_vcf_to_proteindb(CSQ-annotated path)test_vcf_to_proteindb_notannotated(bedtools-intersect path)test_vcf_gnomad_to_proteindb(gnomAD-style VCF)workers=1,workers=4,workers=8on the gnomAD 4-chrom 95K-variant slicepgatk vcf-to-proteindb --workers 8 ...on a whole-genome VEP-annotated VCF and report wall-clock vs single-threadFiles
Commits