Skip to content

perf(vcf-to-proteindb): per-chromosome multiprocessing (~3.4× on 4-chrom test)#104

Merged
ypriverol merged 4 commits into
feat/bug-fixed-iterationfrom
perf/vcf-to-proteindb-parallel
May 14, 2026
Merged

perf(vcf-to-proteindb): per-chromosome multiprocessing (~3.4× on 4-chrom test)#104
ypriverol merged 4 commits into
feat/bug-fixed-iterationfrom
perf/vcf-to-proteindb-parallel

Conversation

@ypriverol
Copy link
Copy Markdown
Member

Follow-up to #99 (closed by PR #102). #102 brought vcf-to-proteindb from ~12 h on chr22 down to ~30 min by killing redundant SQLite queries; this PR addresses the next ceiling — sequential per-chromosome processing — by fanning the per-variant loop across multiprocessing workers.

Dependency

This branch is cut from feat/bug-fixed-iteration (PR #102). Once #102 merges into dev the diff here will narrow to just the 2 perf commits below.

Strategy

The profile (see below) showed ~36 s of one-time setup per vcf_to_proteindb invocation plus per-variant work dominated by Biopython Seq.translate — CPU-bound, no shared state across chromosomes. multiprocessing.Pool with one task per chromosome was the obvious next move.

Profile-driven design (50 K gnomAD chr22 slice, 49.7 s total)

Time Category Component
19.5 s (39%) one-time setup parse_gtfgffutils.create_db
13.7 s (27%) one-time setup SeqIO.index initial FASTA scan
2.2 s (4%) one-time setup vcf_from_file
5.2 s (10%) per-variant CPU get_orfs_vcf → Biopython translate
6.4 s (13%) per-variant vcf_to_proteindb loop body
1.6 s (3%) per-variant str.partition (16 M calls, info_kv parser)
1.1 s (2%) per-variant sqlite3 cursor.execute (_FeatureCache from #102 is working)

Conclusion: the bottleneck is per-variant work in the inner loop. Per-chromosome parallelism turns ∑(per-chrom-work) sequential into max(per-chrom-work) wall-clock.

Implementation

  • New module-level helpers: _split_vcf_by_chrom(vcf_file) and _vcf_to_proteindb_worker(default_params, pipeline_args, ...). Worker is module-level so multiprocessing.Pool can pickle it.
  • Existing per-variant loop preserved: moved verbatim into _vcf_to_proteindb_chunk() taking an explicit output path. No semantic changes; the public vcf_to_proteindb() now dispatches sequential vs parallel based on workers and number of chromosomes.
  • Pre-annotation done in the main process: annoate_vcf (the bedtools-intersect path for VCFs without CSQ) writes to Path.cwd(). Running it once in the parent avoids per-worker cwd races. Workers run only on already-annotated chunks.
  • mp.get_context('spawn') explicitly. Default is fork on Linux, which would inherit gffutils SQLite handles across forked processes — unsafe. Spawn starts clean Python interpreters.
  • -w/--workers N CLI flag (default 1, sequential, backward-compatible). Also exposed as the workers config field.
  • Output ordering: chunks concatenated in chromosome sort order. Within a chromosome, order is identical to sequential.

Benchmark — 4-chromosome gnomAD v4.1 exomes slice

Dataset: 4 chromosomes (chr19/20/21/22) × first 15K-50K variants each = 95,000 variants total, VEP-annotated (vep INFO field). Reference: Ensembl release 113 GRCh38, GTF filtered to those 4 chromosomes (180 MB), full cDNA FASTA (434 MB, 207K transcripts).

Machine: EBI pride-linux-vm (Linux x86_64, 8 cores, 100 GB disk).

Workers Wall-clock User CPU Sys CPU Sequences output Speedup
1 (sequential baseline) 69.08 s 63.66 s 5.64 s 23,499 1.0×
4 (= number of chromosomes) 20.61 s 44.93 s 13.35 s 23,499 3.35×
8 (request > chunks; caps at 4) 20.41 s 44.88 s 13.20 s 23,499 3.38×

Output equivalence: MD5 of sorted unique sequence content is identical across all three runs (`7c5403192b5acb62bd2841ae3c39d62a` — confirms the parallel path produces the same protein set as sequential).

Why workers=8 doesn't beat workers=4: by design, min(workers, num_chromosomes) workers are spawned. The 4-chrom dataset caps useful parallelism at 4. On a whole-genome VCF (~24 autosomes + sex + MT), --workers 8 would run 8 chromosome chunks at a time over 3 batches.

Extrapolation to whole-genome

For Husen's reported single-thread baseline of ~30 min on chr22 (post-#102):

Workers Whole-genome wall-clock (est.) Notes
1 ~12 h 24 × 30 min, current behaviour
4 ~3 h 4 chunks at a time × 6 waves
8 ~1.5 h 3 waves of 8
24+ ~30 min bounded by chr1 (largest); a worker per chromosome

Real numbers will deviate from this linear projection because chromosomes are unequal sizes (chr1 is ~5× chr22); whole-genome wall-clock will be approximately ceil(24/N) × max_chrom_time + per_worker_setup. The setup overhead (~36 s) is amortised better as variant-per-chromosome grows, so larger workloads benefit more.

What the profile says is left on the table

Cost Status
gffutils SQLite queries ✅ Killed by #102's cache (now 2% of time)
Per-chromosome vcf_to_proteindb sequential ✅ This PR (3.4× empirical)
Per-worker SeqIO.index rebuild (~14 s × N workers) ⏳ Switch to SeqIO.index_db (SQLite-backed FASTA index, persistent) — would save ~14 s × N. Modest, deferred.
Per-worker gffutils.create_db first run (~19 s) ✅ Already cached on disk between runs; only the very first run pays
Biopython Seq.translate per variant ⏳ Could batch translate within a chromosome chunk; bigger refactor, deferred
vep field parser fails on records with commas inside pipe-separated cells (e.g. PHYLOCSF_TOO_SHORT, GERP_DIST:…) ❌ Pre-existing parser bug; ~8% of gnomAD variants currently dropped as "invalid". Worth a separate fix.

Test plan

  • Equivalence: pgatk/tests/test_vcf_to_proteindb_parallel.py — runs workers=1 and workers=2 against the in-repo tiny test.vcf fixture and asserts identical sequence sets (passes locally + on remote)
  • Backward-compat: all three existing vcf-to-proteindb tests still pass with default workers=None
    • test_vcf_to_proteindb (CSQ-annotated path)
    • test_vcf_to_proteindb_notannotated (bedtools-intersect path)
    • test_vcf_gnomad_to_proteindb (gnomAD-style VCF)
  • Multi-chromosome equivalence: MD5 of sorted unique sequence content identical between workers=1, workers=4, workers=8 on the gnomAD 4-chrom 95K-variant slice
  • Reviewer: run pgatk vcf-to-proteindb --workers 8 ... on a whole-genome VEP-annotated VCF and report wall-clock vs single-thread

Files

pgatk/commands/vcf_to_proteindb.py        (+9, -1)   — new -w/--workers flag
pgatk/ensembl/ensembl.py                  (+124, -3) — _split_vcf_by_chrom,
                                                       _vcf_to_proteindb_worker,
                                                       _vcf_to_proteindb_chunk,
                                                       parallel dispatch in
                                                       vcf_to_proteindb()
pgatk/tests/test_vcf_to_proteindb_parallel.py  (+ new) — equivalence test
scripts/benchmark_vcf_to_proteindb.py     (+51, -6)  — cProfile support

Commits

  • `7c08c50` — Add cProfile support to vcf-to-proteindb benchmark script
  • `9235f7d` — Parallelise vcf-to-proteindb per chromosome via multiprocessing

ypriverol added 2 commits May 13, 2026 19:24
Adds --profile-out PATH and --print-top N flags. When given, the run is
wrapped in cProfile, the .prof file is written to PATH, and the top N
functions are printed sorted by both cumulative time and own time. Used
to identify the remaining hot path after the issue-#99 fix before
designing a parallelization strategy.
Profile of 50K gnomAD chr22 variants showed ~36s one-time setup and
~14s per-variant work, with Biopython translate the dominant CPU cost.
Implements per-chromosome multiprocessing.Pool to fan out the
per-variant loop across cores.

- New `--workers N` CLI flag (default 1, sequential, backward-compatible).
- Internally: pre-annotate VCF in the main process (avoids bedtools-cwd
  race), split the VCF into per-chromosome temp files, dispatch to a
  spawn-context Pool, concatenate per-chunk outputs.
- Existing per-variant loop moved verbatim into `_vcf_to_proteindb_chunk`;
  the public `vcf_to_proteindb` now dispatches sequential or parallel.
- Workers re-construct EnsemblDataService fresh per process (no shared
  gffutils SQLite handles or SeqIO indices); each pays its own setup
  but the per-variant CPU work parallelises near-linearly.
- Equivalence test verifies workers=2 produces the same set of output
  sequences as workers=1 on the existing test.vcf fixture.
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 588c34bf-e4f4-401b-a8ed-97db79b21483

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/vcf-to-proteindb-parallel

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ypriverol ypriverol changed the base branch from dev to feat/bug-fixed-iteration May 13, 2026 19:39
ypriverol added 2 commits May 13, 2026 20:49
…apping

Three focused improvements stacking on the per-chromosome multiprocessing PR:

1. Replace SeqIO.index with SeqIO.index_db. SeqIO.index rebuilds an
   in-memory FASTA offset index every invocation (~14 s for the 434 MB
   Ensembl cDNA FASTA); SeqIO.index_db persists the index in a SQLite
   .idx file next to the FASTA, reused across runs and shared across
   workers. Pre-built once in the main process before fan-out so the
   first chunk pays the build cost and the rest just open the same .idx.
   Stale detection: .idx older than FASTA triggers a rebuild.

2. _split_vcf_by_chrom now streams the input VCF and writes per-chrom
   chunks directly to disk in a single pass instead of buffering all
   data lines in memory. Constant memory regardless of input size —
   makes the whole-genome path actually feasible.

3. transcript_id_mapping (versionless -> versioned dict of 207k FASTA
   keys) is now built lazily on the first KeyError fallback. Most
   real workloads never need it; saves ~200ms per worker startup.

No public API change. All existing tests pass; equivalence test still
confirms sequential == parallel output.
The previous commit (254e35c) inadvertently captured a re-ordering of
proteindb_from_custom_VCF.fa from a single test run. The file is the
OUTPUT of test_vcf_to_proteindb_notannotated (which only asserts
exit_code == 0, not content); the ordering shifts non-deterministically
between runs because annoate_vcf uses Python set iteration with hash
randomization. Restore the prior content to keep the working tree
stable across consecutive test invocations.

Also gitignore *.fa.idx / *.fasta.idx: BioPython SeqIO.index_db now
materialises a SQLite index next to each FASTA. These are build
artifacts, rebuilt automatically when the source FASTA changes.
@ypriverol
Copy link
Copy Markdown
Member Author

Tier-1 follow-up benchmark — d50f8a0 vs 9235f7d

Same workload as the PR body (4-chrom gnomAD v4.1 exomes, 95K variants, Ensembl 113 GTF/cDNA on pride-linux-vm, 8 cores). One-time .idx warm-up: 7.91 s, then:

Workers Before Tier-1 (9235f7d) After Tier-1 (d50f8a0) Wall-clock Δ vs ORIGINAL seq
1 69.08 s 20.86 s −69.8% (3.3× faster) 3.3×
4 20.61 s 12.75 s −38.1% (1.6× faster) 5.4×
8 (caps to 4) 20.41 s 12.86 s −37.0% (1.6× faster) 5.4×

Output MD5 (7c5403192b5acb62bd2841ae3c39d62a) identical across all three runs and matches the pre-Tier-1 baseline — semantically zero change.

Why sequential improved 3.3×: SeqIO.index_db persists the 434 MB FASTA offset index to disk (12 MB SQLite file), so the per-invocation 14 s scan is now a sub-second open after the first build. The previous baseline rebuilt it every run.

Why parallel still improves on top: each worker also opens the pre-built .idx instead of scanning, saving 13 s × N workers of redundant setup. The compounding 5.4× is the realistic end-user speedup vs the unoptimised sequential code path.

Extrapolation update for whole-genome (~12 h originally sequential):

Mode Estimated wall-clock
--workers 1 post-Tier-1 ~3.6 h
--workers 8 post-Tier-1 ~1 h (bounded by chr1)
--workers 24 post-Tier-1 ~30 min (bounded by chr1's per-chrom time)

@ypriverol ypriverol merged commit 6c02214 into feat/bug-fixed-iteration May 14, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant