Skip to content

Docs and examples about how to use the library in different use cases#94

Merged
ypriverol merged 13 commits into
masterfrom
dev
Mar 3, 2026
Merged

Docs and examples about how to use the library in different use cases#94
ypriverol merged 13 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Mar 3, 2026

Summary by CodeRabbit

  • New Features

    • Added comprehensive "Use Cases" documentation and navigation entry.
    • Documented new CLI commands: clinvar-to-proteindb, digest-mutant-protein, map-peptide2genome, ncbi-downloader.
  • Documentation

    • Expanded CLI reference with download, ClinVar-to-proteindb, digest/map examples and options.
  • Improvements

    • Safer VCF annotation and exon-aware alt-sequence handling.
    • Better handling of compressed/encoded inputs and refined ClinVar classification filtering.
  • Tests

    • Added unit tests covering ClinVar CLNSIG parsing and exon-clipping edge cases.
  • Chores

    • Removed several planning/design documents.

@qodo-code-review
Copy link
Copy Markdown

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Bug fixes, enhancements, and comprehensive use case documentation

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add utility functions for gzip file handling and amino acid code conversion
• Improve mutation parsing with better null checks and 3-letter amino acid support
• Fix VCF annotation parsing and chromosome name normalization
• Enhance error handling for missing COSMIC file columns and malformed rows
• Add comprehensive documentation with 13 end-to-end proteogenomics use case workflows
Diagram
flowchart LR
  A["Input Files<br/>COSMIC/VCF/FASTA"] -->|"Parse & Validate"| B["Enhanced Mutation<br/>Processing"]
  B -->|"Handle Gzip<br/>& Amino Acids"| C["Improved Protein<br/>Translation"]
  C -->|"Filter & Normalize"| D["Output Protein<br/>Database"]
  E["Documentation"] -->|"13 Workflows"| F["Use Cases Guide"]
Loading

Grey Divider

File Changes

1. pgatk/cgenomes/cgenomes_proteindb.py Bug fix, enhancement, error handling +132/-36

Add gzip support and improve mutation parsing logic

pgatk/cgenomes/cgenomes_proteindb.py


2. pgatk/clinvar/clinvar_service.py Bug fix, enhancement +17/-5

Fix CLNSIG filter for compound values and CDS parsing

pgatk/clinvar/clinvar_service.py


3. pgatk/clinvar/data_downloader.py ⚙️ Configuration changes +1/-1

Update RefSeq file from protein to RNA sequences

pgatk/clinvar/data_downloader.py


View more (11)
4. pgatk/ensembl/ensembl.py Bug fix, enhancement +42/-24

Fix VCF annotation key matching and improve robustness

pgatk/ensembl/ensembl.py


5. pgatk/toolbox/vcf_utils.py 🐞 Bug fix +17/-12

Fix variant allele handling for multi-base variants

pgatk/toolbox/vcf_utils.py


6. pgatk/tests/test_clinvar/test_clinvar_service.py 🧪 Tests +17/-0

Add tests for compound CLNSIG filter values

pgatk/tests/test_clinvar/test_clinvar_service.py


7. pgatk/tests/test_clinvar/test_data_downloader.py 🧪 Tests +1/-1

Update test for RNA file instead of protein

pgatk/tests/test_clinvar/test_data_downloader.py


8. docs/index.md 📝 Documentation +1/-0

Add use cases documentation link to navigation

docs/index.md


9. docs/pgatk-cli.md 📝 Documentation +163/-0

Add documentation for new CLI commands and workflows

docs/pgatk-cli.md


10. docs/use-cases.md 📝 Documentation +1078/-0

Add 13 comprehensive end-to-end proteogenomics workflows

docs/use-cases.md


11. mkdocs.yml 📝 Documentation +1/-0

Add use cases page to documentation navigation

mkdocs.yml


12. pgatk/testdata/clinvar/mini_refseq_protein.faa Miscellaneous +2/-2

Update test data sequences

pgatk/testdata/clinvar/mini_refseq_protein.faa


13. pgatk/testdata/proteindb_from_custom_VCF.fa Miscellaneous +4/-4

Reorder FASTA entries in test data

pgatk/testdata/proteindb_from_custom_VCF.fa


14. pgatk/testdata/proteindb_from_gnomad_VCF.fa Miscellaneous +0/-2

Remove one test sequence entry

pgatk/testdata/proteindb_from_gnomad_VCF.fa


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Mar 3, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. ALT not exon-clipped🐞 Bug ✓ Correctness
Description
In get_altseq(), the ref allele length is clipped when a multi-base variant extends past an exon
boundary, but the ALT allele is still inserted in full. For variants where only part of REF is
exonic, this can inject extra ALT bases into the transcript and produce incorrect proteins silently.
Code

pgatk/toolbox/vcf_utils.py[R109-123]

+    for feature in features_info:
+        if var_pos in range(feature[0], feature[1] + 1):
+            var_index_in_cds = nc_index + (var_pos - feature[0])
+            # Clip the ref allele length to the portion that falls within
+            # this feature.  When a multi-base variant extends from an exon
+            # into an intron, the intronic bases are absent from the
+            # transcript and must not be counted.
+            var_end_genomic = var_pos + len(ref_allele) - 1
+            exonic_ref_len = min(var_end_genomic, feature[1]) - var_pos + 1
+            c = max(exonic_ref_len, 0)
+            alt_seq = ref_seq[0:var_index_in_cds] + var_allele + ref_seq[var_index_in_cds + c::]
+            if strand == '-':
+                return ref_seq[::-1], alt_seq[::-1]
+            else:
+                return ref_seq, alt_seq
Evidence
The new logic explicitly acknowledges that intronic bases are absent from the transcript and clips
only the REF length (c), but then concatenates the full ALT allele regardless of c. Both Ensembl and
ClinVar pipelines compute var_end using the full genomic REF length and pass the full VCF ALT allele
into get_altseq(), so exon→intron spanning multi-base variants are plausible and will be
mis-applied.

pgatk/toolbox/vcf_utils.py[108-123]
pgatk/ensembl/ensembl.py[639-654]
pgatk/clinvar/clinvar_service.py[519-533]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`pgatk.toolbox.vcf_utils.get_altseq()` clips the REF allele length to the exon-overlapping portion (`c`), but still inserts the full ALT allele. For multi-base variants that extend from exon into intron, this can inject extra ALT bases into the transcript and produce incorrect variant proteins.
### Issue Context
Both Ensembl and ClinVar pipelines compute variant end positions using `len(REF)` and call `get_altseq()` with full VCF `REF/ALT` alleles, so exon-boundary spanning variants are possible.
### Fix Focus Areas
- pgatk/toolbox/vcf_utils.py[108-123]
- pgatk/tests/test_vcf_utils.py[62-113]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. COSMIC ins/delins accepts non-ATCG 🐞 Bug ⛯ Reliability
Description
COSMIC DNA insertion handling (ins/delins) validates inserted sequence with isalpha() only, unlike
SNVs which enforce A/T/C/G. This can silently create incorrect mutated CDS (e.g., if insert contains
IUPAC ambiguity like N) and yield wrong proteins rather than rejecting the record.
Code

pgatk/cgenomes/cgenomes_proteindb.py[R146-166]

+            elif "delins" in snp.dna_mut:
+                # Deletion-insertion: delete range then insert new bases
+                insert_dna = snp.dna_mut.split("delins")[1]
+                if insert_dna.isalpha() and len(positions) >= 2:
+                    del_index1 = int(positions[0]) - 1
+                    del_index2 = int(positions[1])
+                    seq_mut = seq[:del_index1] + insert_dna + seq[del_index2:]
+                    mut_pro_seq = str(seq_mut.translate(to_stop=False))
+                elif insert_dna.isalpha() and len(positions) == 1:
+                    # Single-position delins: replace one base
+                    del_index1 = int(positions[0]) - 1
+                    seq_mut = seq[:del_index1] + insert_dna + seq[del_index1 + 1:]
+                    mut_pro_seq = str(seq_mut.translate(to_stop=False))
+
           elif "ins" in snp.dna_mut:
               index = snp.dna_mut.index("ins")
               insert_dna = snp.dna_mut[index + 3:]
               if insert_dna.isalpha():
                   ins_index1 = int(positions[0])
                   seq_mut = seq[:ins_index1] + insert_dna + seq[ins_index1:]
-                    mut_pro_seq = seq_mut.translate(to_stop=False)
+                    mut_pro_seq = str(seq_mut.translate(to_stop=False))
Evidence
The function defines a nucleotide whitelist and applies it for substitutions, but the newly added
delins/ins branches accept any alphabetic string as DNA and translate it. The COSMIC pipeline only
retries on IndexError/ValueError; incorrect-but-translatable sequences will pass through and be
written as proteins.

pgatk/cgenomes/cgenomes_proteindb.py[132-166]
pgatk/cgenomes/cgenomes_proteindb.py[327-333]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`CancerGenomesService.get_mut_pro_seq()` validates SNVs against a strict A/T/C/G whitelist but allows `ins`/`delins` inserted sequences using `isalpha()` only. This can admit non-ATCG bases and silently generate incorrect mutant proteins.
### Issue Context
The COSMIC pipeline writes any non-empty returned protein to output; invalid inserted DNA that still translates won’t be rejected.
### Fix Focus Areas
- pgatk/cgenomes/cgenomes_proteindb.py[132-177]
- pgatk/cgenomes/cgenomes_proteindb.py[327-351]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

Adds a new "Use Cases" docs page and CLI docs; updates ClinVar/RefSeq downloader to expect RNA FASTA; hardens ClinVar filtering and VCF/translation parsing (I/O, amino-acid conversions, exon-boundary handling); updates related test data and unit tests.

Changes

Cohort / File(s) Summary
Documentation & Navigation
docs/index.md, docs/use-cases.md, docs/pgatk-cli.md, mkdocs.yml, docs/plans/*
Adds a new use-cases page, extends CLI docs (new commands and NCBI downloader section), updates mkdocs nav, and removes several planning documents.
ClinVar Downloader & Data
pgatk/clinvar/data_downloader.py, pgatk/testdata/clinvar/mini_refseq_protein.faa
Replaces expected RefSeq protein FASTA with RNA FASTA in download list and updates test expectations and test FASTA sequences accordingly.
ClinVar Service & Filtering
pgatk/clinvar/clinvar_service.py, pgatk/tests/test_clinvar/test_clinvar_service.py
Enhances CLNSIG filtering to validate compound values component-wise, normalizes delimiters, adjusts translation-table selection for mitochondrial chr names, and adds unit tests for compound/delimiter edge cases.
Cancer Genomes / Protein DB
pgatk/cgenomes/cgenomes_proteindb.py
Adds encoding- and compression-aware I/O helpers (_open_text), 3-to-1 AA conversion (_three_to_one), new CONFIG_* constants, changes get_mut_pro_seq to accept Bio.Seq, and tightens parsing/error handling for COSMIC/cBioPortal inputs.
VCF Utilities
pgatk/toolbox/vcf_utils.py, pgatk/tests/test_vcf_utils.py
Refines get_altseq to clip REF/ALT by exon boundaries for variants spanning exon–intron junctions and adds tests covering clipping and non-clipping edge cases.
Ensembl / VCF → proteindb
pgatk/ensembl/ensembl.py
Switches to canonical mutation key (CHROM/POS/REF/ALT) for transcript mapping, uses sets for deduplication, safely parses INFO/CDS tokens, and guards against out-of-range indices.
Test data: proteindb outputs
pgatk/testdata/proteindb_from_custom_VCF.fa, pgatk/testdata/proteindb_from_gnomad_VCF.fa
Updates and expands variant FASTA entries (headers and sequences) to reflect translation and VCF parsing changes.
CLI docs additions
docs/pgatk-cli.md
Adds documentation for new commands: clinvar-to-proteindb, digest-mutant-protein, map-peptide2genome, and ncbi-downloader (note: file contains duplicated sections).

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Poem

🐰
I nibbled bytes beneath the moonlight bright,
Compressed files sighed and found their right,
CLNSIGs sorted, exons clipped with care,
Sequences hummed in tidy, trimmed pairs,
A hopping cheer — docs, tests, and code take flight!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the PR: adding comprehensive documentation and examples for using the library in various proteogenomics use cases, which is reflected in the new use-cases.md file and supporting documentation updates.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread pgatk/toolbox/vcf_utils.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
pgatk/ensembl/ensembl.py (1)

634-637: Minor edge case in deduplication key construction.

The deduplication key concatenates transcript_id + REF + ALT without a separator. While collisions are extremely unlikely given that transcript IDs follow structured naming conventions (e.g., ENST00000...) and REF/ALT are nucleotide sequences, adding a separator would be slightly more defensive.

🔧 Optional: Add separator for clarity
 for alt in alts:
-    dedup_key = transcript_id + str(record.REF) + str(alt)
+    dedup_key = f"{transcript_id}:{record.REF}:{alt}"
     if dedup_key in processed_transcript_allele:
         continue
     processed_transcript_allele.add(dedup_key)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pgatk/ensembl/ensembl.py` around lines 634 - 637, The deduplication key
creation (dedup_key) currently concatenates transcript_id, record.REF and alt
without a separator which can theoretically collide; update the construction in
the block using dedup_key and processed_transcript_allele to include an explicit
delimiter (e.g., "|" or "\t") between transcript_id, record.REF and alt (for
example build dedup_key by joining [transcript_id, record.REF, alt] with a
separator) so the check in processed_transcript_allele and the add call use the
new separator-based key.
pgatk/cgenomes/cgenomes_proteindb.py (2)

17-28: Consider simplifying the encoding branches.

The explicit checks for utf-8 and latin-1 encodings (lines 23-26) are redundant since they all use the same open() pattern with the respective encoding. The fallback on line 28 already handles the general case.

♻️ Suggested simplification
 def _open_text(path: str, mode: str = 'r', encoding: str = 'utf-8', **kwargs):
     """Open a file for text I/O, decompressing automatically if path ends with .gz."""
     enc = encoding or "utf-8"
     if path.endswith('.gz'):
         text_mode = mode if 't' in mode else mode + 't'
         return gzip.open(path, text_mode, encoding=enc, **kwargs)
-    if enc == "utf-8":
-        return open(path, mode, encoding="utf-8", **kwargs)
-    if enc == "latin-1":
-        return open(path, mode, encoding="latin-1", **kwargs)
-    # Only utf-8 and latin-1 are used; fallback to utf-8 for any other value
-    return open(path, mode, encoding="utf-8", **kwargs)
+    # Validate against allowed encodings; fallback to utf-8 for unsupported values
+    if enc not in ("utf-8", "latin-1"):
+        enc = "utf-8"
+    return open(path, mode, encoding=enc, **kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pgatk/cgenomes/cgenomes_proteindb.py` around lines 17 - 28, The _open_text
function has redundant branches for encodings; after handling .gz files, replace
the explicit utf-8/latin-1 conditionals by using the computed enc variable
directly in a single open(...) call (i.e., return open(path, mode, encoding=enc,
**kwargs)) so all non-gz cases are handled uniformly and the fallback behavior
remains the same; keep the gzip handling (text_mode) intact and only simplify
the subsequent open calls in _open_text.

317-317: Consider renaming to avoid variable shadowing.

The variable header on line 317 (FASTA entry header) shadows the header variable from line 268 (column headers list). While functionally correct since the column indices are already extracted, this could cause confusion during maintenance.

♻️ Suggested rename for clarity
-                header = "COSMIC:%s:%s:%s" % (snp.gene, snp.aa_mut, snp.mutation_type.replace(" ", ""))
+                fasta_header = "COSMIC:%s:%s:%s" % (snp.gene, snp.aa_mut, snp.mutation_type.replace(" ", ""))

Then update subsequent references to header (lines 338-346) to use fasta_header.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pgatk/cgenomes/cgenomes_proteindb.py` at line 317, The FASTA entry variable
named header (assigned as "COSMIC:%s:%s:%s" % (snp.gene, snp.aa_mut, ...))
shadows an earlier variable named header used for column headers; rename this
FASTA variable to fasta_header and update all subsequent references in the
FASTA-generation block (the lines that build/write the FASTA header and sequence
for each snp) to use fasta_header to avoid confusion and variable shadowing.
pgatk/tests/test_clinvar/test_clinvar_service.py (1)

68-72: Test intent and expectation are inconsistent.

At Line 68-70, the name/doc imply exclusion, but Line 72 asserts the variant should pass. Please rename/reword for clarity.

🧹 Suggested cleanup
-    def test_compound_risk_factor_excluded(self):
-        """Benign with risk_factor — all components benign → excluded."""
+    def test_compound_risk_factor_passes(self):
+        """Benign with _risk_factor should pass because not all components are excluded."""
         exclude = ["Benign", "Likely_benign"]
-        # '_risk_factor' is not in exclude list, so it's not an excluded component
+        # '_risk_factor' is not in exclude list, so the compound value should pass
         assert ClinVarService.passes_clnsig_filter("Benign,_risk_factor", exclude) is True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pgatk/tests/test_clinvar/test_clinvar_service.py` around lines 68 - 72, The
test name/docstring and assertion conflict: in
test_compound_risk_factor_excluded the docstring says "excluded" but the
assertion expects True from ClinVarService.passes_clnsig_filter for input
"Benign,_risk_factor"; either rename the test or change the docstring to reflect
that the compound with '_risk_factor' should pass. Update the test name to
something like test_compound_risk_factor_not_excluded or change the docstring to
"Benign with risk_factor — all components benign except _risk_factor → not
excluded" so the test name/doc and the call to
ClinVarService.passes_clnsig_filter("Benign,_risk_factor", exclude) are
consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/use-cases.md`:
- Around line 181-188: The docs show running "pgatk ensembl-downloader" with the
flag "--skip_dna" but later steps call "gffread" and require "genome.fa", which
will fail; update the recipes that list "--skip_dna" (the ensembl-downloader
example) to either remove "--skip_dna" or add an explicit note and step to
supply "genome.fa" before running gffread; specifically reference the
ensembl-downloader invocation and the gffread steps so the user either downloads
DNA (omit "--skip_dna") or is instructed to provide a genome.fa path and where
to place it for gffread to use.

In `@pgatk/clinvar/clinvar_service.py`:
- Around line 124-127: The current logic splits clnsig into parts and then
treats an all(...) over an empty generator as True, causing delimiter-only
inputs (e.g., "///") to be treated as excluded; change the flow in the code
handling parts (after parts = re.split(r"[/,]", clnsig)) to first build a list
of non-empty trimmed tokens (e.g., tokens = [p.strip() for p in parts if
p.strip()]) and if tokens is empty return True (i.e., treat empty/missing CLNSIG
as passing), otherwise evaluate the existing exclusion check against tokens and
exclude_list as before.

In `@pgatk/toolbox/vcf_utils.py`:
- Around line 109-119: The current code only matches when var_pos falls inside a
single feature and only clips against that feature, so variants starting in
introns or spanning multiple exons are handled incorrectly; replace the
single-feature check with an overlap-based approach: compute var_end_genomic =
var_pos + len(ref_allele) - 1, collect all features in features_info where
feature[0] <= var_end_genomic and feature[1] >= var_pos (i.e., any overlap),
compute exonic_ref_len as the sum over those overlapping features of
(min(var_end_genomic, feature[1]) - max(var_pos, feature[0]) + 1), determine
var_index_in_cds using nc_index plus the cumulative CDS length up to the first
overlapping base (i.e., sum lengths of earlier features plus (max(var_pos,
first_feature[0]) - first_feature[0]) ), set c = max(exonic_ref_len, 0), and
then build alt_seq = ref_seq[0:var_index_in_cds] + var_allele +
ref_seq[var_index_in_cds + c:] using ref_seq, var_allele and ref_allele.

---

Nitpick comments:
In `@pgatk/cgenomes/cgenomes_proteindb.py`:
- Around line 17-28: The _open_text function has redundant branches for
encodings; after handling .gz files, replace the explicit utf-8/latin-1
conditionals by using the computed enc variable directly in a single open(...)
call (i.e., return open(path, mode, encoding=enc, **kwargs)) so all non-gz cases
are handled uniformly and the fallback behavior remains the same; keep the gzip
handling (text_mode) intact and only simplify the subsequent open calls in
_open_text.
- Line 317: The FASTA entry variable named header (assigned as "COSMIC:%s:%s:%s"
% (snp.gene, snp.aa_mut, ...)) shadows an earlier variable named header used for
column headers; rename this FASTA variable to fasta_header and update all
subsequent references in the FASTA-generation block (the lines that build/write
the FASTA header and sequence for each snp) to use fasta_header to avoid
confusion and variable shadowing.

In `@pgatk/ensembl/ensembl.py`:
- Around line 634-637: The deduplication key creation (dedup_key) currently
concatenates transcript_id, record.REF and alt without a separator which can
theoretically collide; update the construction in the block using dedup_key and
processed_transcript_allele to include an explicit delimiter (e.g., "|" or "\t")
between transcript_id, record.REF and alt (for example build dedup_key by
joining [transcript_id, record.REF, alt] with a separator) so the check in
processed_transcript_allele and the add call use the new separator-based key.

In `@pgatk/tests/test_clinvar/test_clinvar_service.py`:
- Around line 68-72: The test name/docstring and assertion conflict: in
test_compound_risk_factor_excluded the docstring says "excluded" but the
assertion expects True from ClinVarService.passes_clnsig_filter for input
"Benign,_risk_factor"; either rename the test or change the docstring to reflect
that the compound with '_risk_factor' should pass. Update the test name to
something like test_compound_risk_factor_not_excluded or change the docstring to
"Benign with risk_factor — all components benign except _risk_factor → not
excluded" so the test name/doc and the call to
ClinVarService.passes_clnsig_filter("Benign,_risk_factor", exclude) are
consistent.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27fd1ab and b4374ad.

📒 Files selected for processing (14)
  • docs/index.md
  • docs/pgatk-cli.md
  • docs/use-cases.md
  • mkdocs.yml
  • pgatk/cgenomes/cgenomes_proteindb.py
  • pgatk/clinvar/clinvar_service.py
  • pgatk/clinvar/data_downloader.py
  • pgatk/ensembl/ensembl.py
  • pgatk/testdata/clinvar/mini_refseq_protein.faa
  • pgatk/testdata/proteindb_from_custom_VCF.fa
  • pgatk/testdata/proteindb_from_gnomad_VCF.fa
  • pgatk/tests/test_clinvar/test_clinvar_service.py
  • pgatk/tests/test_clinvar/test_data_downloader.py
  • pgatk/toolbox/vcf_utils.py
💤 Files with no reviewable changes (1)
  • pgatk/testdata/proteindb_from_gnomad_VCF.fa

Comment thread docs/use-cases.md
Comment thread pgatk/clinvar/clinvar_service.py Outdated
Comment thread pgatk/toolbox/vcf_utils.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
pgatk/toolbox/vcf_utils.py (1)

109-147: ⚠️ Potential issue | 🟠 Major

Overlap handling is still incomplete for intron-start and multi-exon spans.

Line 110 still gates on var_pos being inside one feature, so variants that start in introns but overlap exons are dropped. Also, Line 117 computes exonic clipping from only the first matched feature, which under-deletes when REF spans multiple exons.

Proposed fix (overlap-based across all intersecting features)
-    nc_index = 0
-    for feature in features_info:
-        if var_pos in range(feature[0], feature[1] + 1):
-            var_index_in_cds = nc_index + (var_pos - feature[0])
-            # Clip the ref allele length to the portion that falls within
-            # this feature.  When a multi-base variant extends from an exon
-            # into an intron, the intronic bases are absent from the
-            # transcript and must not be counted.
-            var_end_genomic = var_pos + len(ref_allele) - 1
-            exonic_ref_len = min(var_end_genomic, feature[1]) - var_pos + 1
-            c = max(exonic_ref_len, 0)
-            # Clip the ALT allele by the same number of trailing bases that
-            # were removed from REF.  VCF variants are left-aligned, so the
-            # trailing bases correspond to the intronic portion.
-            intronic_ref_len = len(ref_allele) - c
-            if intronic_ref_len > 0:
-                exonic_alt_len = max(len(var_allele) - intronic_ref_len, 0)
-                var_allele = var_allele[:exonic_alt_len]
-                logger.debug(
-                    "Variant at %d extends %d bases past exon boundary "
-                    "(feature %d-%d); clipped REF to %d and ALT to %d bases",
-                    var_pos, intronic_ref_len, feature[0], feature[1],
-                    c, exonic_alt_len,
-                )
-            alt_seq = ref_seq[0:var_index_in_cds] + var_allele + ref_seq[var_index_in_cds + c::]
-            if strand == '-':
-                return ref_seq[::-1], alt_seq[::-1]
-            else:
-                return ref_seq, alt_seq
-        else:
-            # Log when variant starts outside this feature (intron-start)
-            var_end_genomic = var_pos + len(ref_allele) - 1
-            if var_end_genomic >= feature[0] and var_pos < feature[0]:
-                logger.debug(
-                    "Variant at %d-%d starts before feature %d-%d "
-                    "(intron-start); skipping this feature",
-                    var_pos, var_end_genomic, feature[0], feature[1],
-                )
-
-        nc_index += (feature[1] - feature[0] + 1)
+    nc_index = 0
+    var_end_genomic = var_pos + len(ref_allele) - 1
+    var_index_in_cds = None
+    exonic_ref_len = 0
+
+    for feature in features_info:
+        feature_start, feature_end = feature[0], feature[1]
+        feature_len = feature_end - feature_start + 1
+
+        # no overlap with this feature
+        if var_end_genomic < feature_start or var_pos > feature_end:
+            nc_index += feature_len
+            continue
+
+        overlap_start = max(var_pos, feature_start)
+        overlap_end = min(var_end_genomic, feature_end)
+
+        if var_index_in_cds is None:
+            var_index_in_cds = nc_index + (overlap_start - feature_start)
+
+        exonic_ref_len += max(0, overlap_end - overlap_start + 1)
+        nc_index += feature_len
+
+    if var_index_in_cds is None:
+        return ref_seq, alt_seq
+
+    c = max(exonic_ref_len, 0)
+    intronic_ref_len = len(ref_allele) - c
+    if intronic_ref_len > 0:
+        exonic_alt_len = max(len(var_allele) - intronic_ref_len, 0)
+        var_allele = var_allele[:exonic_alt_len]
+
+    alt_seq = ref_seq[:var_index_in_cds] + var_allele + ref_seq[var_index_in_cds + c:]
+    if strand == '-':
+        return ref_seq[::-1], alt_seq[::-1]
+    return ref_seq, alt_seq
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pgatk/toolbox/vcf_utils.py` around lines 109 - 147, The loop in the
features_info iteration incorrectly assumes the variant start (var_pos) must lie
inside a single feature and only clips REF based on that first feature; instead
compute the variant genomic span (var_end_genomic = var_pos + len(ref_allele) -
1), find all features that intersect that span, sum the total exonic portion of
REF across those intersecting features to get the total exonic_ref_len (rather
than using just one feature), determine var_index_in_cds from nc_index at the
first intersecting feature, and then clip var_allele by the total intronic
portion (len(ref_allele) - total_exonic_ref_len); also update the intron-start
debug logging to trigger when any intersection exists (not only when var_pos is
inside), and keep the existing strand reversal and ref/alt sequence assembly but
use the cumulative c when slicing ref_seq and building alt_seq.
🧹 Nitpick comments (1)
pgatk/tests/test_vcf_utils.py (1)

113-160: Good additions; consider adding intron-start + multi-exon regression tests.

These new tests validate single-exon clipping well, but they don’t cover variants that start before an exon or REF spans multiple exons. Adding those would protect the boundary logic more comprehensively.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pgatk/tests/test_vcf_utils.py` around lines 113 - 160, Add regression tests
exercising variants that start in introns (intron-start) and REF/ALT spanning
multiple exons to cover boundary logic in get_altseq: create new test cases in
pgatk/tests/test_vcf_utils.py that (1) place a variant start before an exon
(e.g., variant pos within intron leading into exon) and assert REF/ALT are
clipped correctly against the exon start, and (2) define features_info with two
or more exon ranges so REF spans an exon->intron->exon region and verify
get_altseq returns the expected coding_ref and coding_alt after
clipping/concatenation across exons; use the same pattern as existing tests (Seq
for ref_seq, Seq for REF/ALT, call get_altseq) and assert exact string outputs
to lock down multi-exon behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/use-cases.md`:
- Around line 9-15: Change the compound modifier "Cell-Type Specific" and the
inline "cell-type specific" to the hyphenated form "cell-type-specific" in the
heading "Cell-Type Specific Non-Canonical Peptide Discovery" and the workflow
description paragraph that references "cell-type specific databases"; update
both occurrences to ensure consistent compound-adjective hyphenation.
- Around line 613-622: The tip admonition currently uses an indented code block;
replace that indented block with a fenced triple-backtick code block so
markdownlint (MD046) is satisfied — locate the admonition containing the pgatk
digest-mutant-protein example and change the indented shell snippet to use
```bash on the opening line and ``` on the closing line, keeping the same
content and indentation inside the fence.

In `@pgatk/tests/test_clinvar/test_clinvar_service.py`:
- Around line 68-73: The test name/docstring claims the compound case with
'_risk_factor' is excluded, but the assertion expects it to pass; update the
test to reflect intended behavior by either renaming the test from
test_compound_risk_factor_excluded to test_compound_risk_factor_not_excluded (or
similar) and adjust the docstring to say "Benign with risk_factor — all
components benign or risk_factor → not excluded", or if exclusion was intended
change the assertion to assert False; locate this in the test function
referencing ClinVarService.passes_clnsig_filter and make the name/docstring and
assertion consistent.

---

Duplicate comments:
In `@pgatk/toolbox/vcf_utils.py`:
- Around line 109-147: The loop in the features_info iteration incorrectly
assumes the variant start (var_pos) must lie inside a single feature and only
clips REF based on that first feature; instead compute the variant genomic span
(var_end_genomic = var_pos + len(ref_allele) - 1), find all features that
intersect that span, sum the total exonic portion of REF across those
intersecting features to get the total exonic_ref_len (rather than using just
one feature), determine var_index_in_cds from nc_index at the first intersecting
feature, and then clip var_allele by the total intronic portion (len(ref_allele)
- total_exonic_ref_len); also update the intron-start debug logging to trigger
when any intersection exists (not only when var_pos is inside), and keep the
existing strand reversal and ref/alt sequence assembly but use the cumulative c
when slicing ref_seq and building alt_seq.

---

Nitpick comments:
In `@pgatk/tests/test_vcf_utils.py`:
- Around line 113-160: Add regression tests exercising variants that start in
introns (intron-start) and REF/ALT spanning multiple exons to cover boundary
logic in get_altseq: create new test cases in pgatk/tests/test_vcf_utils.py that
(1) place a variant start before an exon (e.g., variant pos within intron
leading into exon) and assert REF/ALT are clipped correctly against the exon
start, and (2) define features_info with two or more exon ranges so REF spans an
exon->intron->exon region and verify get_altseq returns the expected coding_ref
and coding_alt after clipping/concatenation across exons; use the same pattern
as existing tests (Seq for ref_seq, Seq for REF/ALT, call get_altseq) and assert
exact string outputs to lock down multi-exon behavior.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4374ad and eddcc45.

📒 Files selected for processing (11)
  • docs/plans/2026-03-01-pgatk-rename-refactoring-design.md
  • docs/plans/2026-03-01-pgatk-rename-refactoring-plan.md
  • docs/plans/2026-03-01-phase0-infrastructure.md
  • docs/plans/2026-03-02-clinvar-ncbi-implementation.md
  • docs/plans/2026-03-02-clinvar-ncbi-support-design.md
  • docs/plans/2026-03-03-clinvar-pipeline-hardening.md
  • docs/use-cases.md
  • pgatk/clinvar/clinvar_service.py
  • pgatk/tests/test_clinvar/test_clinvar_service.py
  • pgatk/tests/test_vcf_utils.py
  • pgatk/toolbox/vcf_utils.py
💤 Files with no reviewable changes (6)
  • docs/plans/2026-03-01-pgatk-rename-refactoring-plan.md
  • docs/plans/2026-03-01-phase0-infrastructure.md
  • docs/plans/2026-03-02-clinvar-ncbi-implementation.md
  • docs/plans/2026-03-02-clinvar-ncbi-support-design.md
  • docs/plans/2026-03-01-pgatk-rename-refactoring-design.md
  • docs/plans/2026-03-03-clinvar-pipeline-hardening.md

Comment thread docs/use-cases.md
Comment on lines +9 to +15
## Cell-Type Specific Non-Canonical Peptide Discovery

!!! abstract "Featured workflow"
This workflow reproduces the analysis from [Umer et al., *Bioinformatics* 2022](https://doi.org/10.1093/bioinformatics/btab838),
where cell-type specific databases were generated for 65 cell lines,
revealing non-canonical and cryptic peptides amounting to **>5% of total
peptide identifications**.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hyphenate compound modifiers consistently.

Please update terms like “Cell-Type Specific” / “cell-type specific” to “cell-type-specific” in Line 9 and Line 13 for standard compound-adjective usage.

🧰 Tools
🪛 LanguageTool

[grammar] ~9-~9: Use a hyphen to join words.
Context: ...ady protein database. --- ## Cell-Type Specific Non-Canonical Peptide Discovery...

(QB_NEW_EN_HYPHEN)


[grammar] ~13-~13: Use a hyphen to join words.
Context: ...nformatics/btab838), where cell-type specific databases were generated for 65...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/use-cases.md` around lines 9 - 15, Change the compound modifier
"Cell-Type Specific" and the inline "cell-type specific" to the hyphenated form
"cell-type-specific" in the heading "Cell-Type Specific Non-Canonical Peptide
Discovery" and the workflow description paragraph that references "cell-type
specific databases"; update both occurrences to ensure consistent
compound-adjective hyphenation.

Comment thread docs/use-cases.md
Comment thread pgatk/tests/test_clinvar/test_clinvar_service.py
@ypriverol ypriverol merged commit 39fee83 into master Mar 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant