Conversation
Refactor/rename
Refactor/rename remove improve the docs
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan Review Summary by QodoBug fixes, enhancements, and comprehensive use case documentation
WalkthroughsDescription• Add utility functions for gzip file handling and amino acid code conversion • Improve mutation parsing with better null checks and 3-letter amino acid support • Fix VCF annotation parsing and chromosome name normalization • Enhance error handling for missing COSMIC file columns and malformed rows • Add comprehensive documentation with 13 end-to-end proteogenomics use case workflows Diagramflowchart LR
A["Input Files<br/>COSMIC/VCF/FASTA"] -->|"Parse & Validate"| B["Enhanced Mutation<br/>Processing"]
B -->|"Handle Gzip<br/>& Amino Acids"| C["Improved Protein<br/>Translation"]
C -->|"Filter & Normalize"| D["Output Protein<br/>Database"]
E["Documentation"] -->|"13 Workflows"| F["Use Cases Guide"]
File Changes1. pgatk/cgenomes/cgenomes_proteindb.py
|
Code Review by Qodo
1.
|
📝 WalkthroughWalkthroughAdds a new "Use Cases" docs page and CLI docs; updates ClinVar/RefSeq downloader to expect RNA FASTA; hardens ClinVar filtering and VCF/translation parsing (I/O, amino-acid conversions, exon-boundary handling); updates related test data and unit tests. Changes
Sequence Diagram(s)Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
pgatk/ensembl/ensembl.py (1)
634-637: Minor edge case in deduplication key construction.The deduplication key concatenates
transcript_id + REF + ALTwithout a separator. While collisions are extremely unlikely given that transcript IDs follow structured naming conventions (e.g.,ENST00000...) and REF/ALT are nucleotide sequences, adding a separator would be slightly more defensive.🔧 Optional: Add separator for clarity
for alt in alts: - dedup_key = transcript_id + str(record.REF) + str(alt) + dedup_key = f"{transcript_id}:{record.REF}:{alt}" if dedup_key in processed_transcript_allele: continue processed_transcript_allele.add(dedup_key)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pgatk/ensembl/ensembl.py` around lines 634 - 637, The deduplication key creation (dedup_key) currently concatenates transcript_id, record.REF and alt without a separator which can theoretically collide; update the construction in the block using dedup_key and processed_transcript_allele to include an explicit delimiter (e.g., "|" or "\t") between transcript_id, record.REF and alt (for example build dedup_key by joining [transcript_id, record.REF, alt] with a separator) so the check in processed_transcript_allele and the add call use the new separator-based key.pgatk/cgenomes/cgenomes_proteindb.py (2)
17-28: Consider simplifying the encoding branches.The explicit checks for
utf-8andlatin-1encodings (lines 23-26) are redundant since they all use the sameopen()pattern with the respective encoding. The fallback on line 28 already handles the general case.♻️ Suggested simplification
def _open_text(path: str, mode: str = 'r', encoding: str = 'utf-8', **kwargs): """Open a file for text I/O, decompressing automatically if path ends with .gz.""" enc = encoding or "utf-8" if path.endswith('.gz'): text_mode = mode if 't' in mode else mode + 't' return gzip.open(path, text_mode, encoding=enc, **kwargs) - if enc == "utf-8": - return open(path, mode, encoding="utf-8", **kwargs) - if enc == "latin-1": - return open(path, mode, encoding="latin-1", **kwargs) - # Only utf-8 and latin-1 are used; fallback to utf-8 for any other value - return open(path, mode, encoding="utf-8", **kwargs) + # Validate against allowed encodings; fallback to utf-8 for unsupported values + if enc not in ("utf-8", "latin-1"): + enc = "utf-8" + return open(path, mode, encoding=enc, **kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pgatk/cgenomes/cgenomes_proteindb.py` around lines 17 - 28, The _open_text function has redundant branches for encodings; after handling .gz files, replace the explicit utf-8/latin-1 conditionals by using the computed enc variable directly in a single open(...) call (i.e., return open(path, mode, encoding=enc, **kwargs)) so all non-gz cases are handled uniformly and the fallback behavior remains the same; keep the gzip handling (text_mode) intact and only simplify the subsequent open calls in _open_text.
317-317: Consider renaming to avoid variable shadowing.The variable
headeron line 317 (FASTA entry header) shadows theheadervariable from line 268 (column headers list). While functionally correct since the column indices are already extracted, this could cause confusion during maintenance.♻️ Suggested rename for clarity
- header = "COSMIC:%s:%s:%s" % (snp.gene, snp.aa_mut, snp.mutation_type.replace(" ", "")) + fasta_header = "COSMIC:%s:%s:%s" % (snp.gene, snp.aa_mut, snp.mutation_type.replace(" ", ""))Then update subsequent references to
header(lines 338-346) to usefasta_header.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pgatk/cgenomes/cgenomes_proteindb.py` at line 317, The FASTA entry variable named header (assigned as "COSMIC:%s:%s:%s" % (snp.gene, snp.aa_mut, ...)) shadows an earlier variable named header used for column headers; rename this FASTA variable to fasta_header and update all subsequent references in the FASTA-generation block (the lines that build/write the FASTA header and sequence for each snp) to use fasta_header to avoid confusion and variable shadowing.pgatk/tests/test_clinvar/test_clinvar_service.py (1)
68-72: Test intent and expectation are inconsistent.At Line 68-70, the name/doc imply exclusion, but Line 72 asserts the variant should pass. Please rename/reword for clarity.
🧹 Suggested cleanup
- def test_compound_risk_factor_excluded(self): - """Benign with risk_factor — all components benign → excluded.""" + def test_compound_risk_factor_passes(self): + """Benign with _risk_factor should pass because not all components are excluded.""" exclude = ["Benign", "Likely_benign"] - # '_risk_factor' is not in exclude list, so it's not an excluded component + # '_risk_factor' is not in exclude list, so the compound value should pass assert ClinVarService.passes_clnsig_filter("Benign,_risk_factor", exclude) is True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pgatk/tests/test_clinvar/test_clinvar_service.py` around lines 68 - 72, The test name/docstring and assertion conflict: in test_compound_risk_factor_excluded the docstring says "excluded" but the assertion expects True from ClinVarService.passes_clnsig_filter for input "Benign,_risk_factor"; either rename the test or change the docstring to reflect that the compound with '_risk_factor' should pass. Update the test name to something like test_compound_risk_factor_not_excluded or change the docstring to "Benign with risk_factor — all components benign except _risk_factor → not excluded" so the test name/doc and the call to ClinVarService.passes_clnsig_filter("Benign,_risk_factor", exclude) are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/use-cases.md`:
- Around line 181-188: The docs show running "pgatk ensembl-downloader" with the
flag "--skip_dna" but later steps call "gffread" and require "genome.fa", which
will fail; update the recipes that list "--skip_dna" (the ensembl-downloader
example) to either remove "--skip_dna" or add an explicit note and step to
supply "genome.fa" before running gffread; specifically reference the
ensembl-downloader invocation and the gffread steps so the user either downloads
DNA (omit "--skip_dna") or is instructed to provide a genome.fa path and where
to place it for gffread to use.
In `@pgatk/clinvar/clinvar_service.py`:
- Around line 124-127: The current logic splits clnsig into parts and then
treats an all(...) over an empty generator as True, causing delimiter-only
inputs (e.g., "///") to be treated as excluded; change the flow in the code
handling parts (after parts = re.split(r"[/,]", clnsig)) to first build a list
of non-empty trimmed tokens (e.g., tokens = [p.strip() for p in parts if
p.strip()]) and if tokens is empty return True (i.e., treat empty/missing CLNSIG
as passing), otherwise evaluate the existing exclusion check against tokens and
exclude_list as before.
In `@pgatk/toolbox/vcf_utils.py`:
- Around line 109-119: The current code only matches when var_pos falls inside a
single feature and only clips against that feature, so variants starting in
introns or spanning multiple exons are handled incorrectly; replace the
single-feature check with an overlap-based approach: compute var_end_genomic =
var_pos + len(ref_allele) - 1, collect all features in features_info where
feature[0] <= var_end_genomic and feature[1] >= var_pos (i.e., any overlap),
compute exonic_ref_len as the sum over those overlapping features of
(min(var_end_genomic, feature[1]) - max(var_pos, feature[0]) + 1), determine
var_index_in_cds using nc_index plus the cumulative CDS length up to the first
overlapping base (i.e., sum lengths of earlier features plus (max(var_pos,
first_feature[0]) - first_feature[0]) ), set c = max(exonic_ref_len, 0), and
then build alt_seq = ref_seq[0:var_index_in_cds] + var_allele +
ref_seq[var_index_in_cds + c:] using ref_seq, var_allele and ref_allele.
---
Nitpick comments:
In `@pgatk/cgenomes/cgenomes_proteindb.py`:
- Around line 17-28: The _open_text function has redundant branches for
encodings; after handling .gz files, replace the explicit utf-8/latin-1
conditionals by using the computed enc variable directly in a single open(...)
call (i.e., return open(path, mode, encoding=enc, **kwargs)) so all non-gz cases
are handled uniformly and the fallback behavior remains the same; keep the gzip
handling (text_mode) intact and only simplify the subsequent open calls in
_open_text.
- Line 317: The FASTA entry variable named header (assigned as "COSMIC:%s:%s:%s"
% (snp.gene, snp.aa_mut, ...)) shadows an earlier variable named header used for
column headers; rename this FASTA variable to fasta_header and update all
subsequent references in the FASTA-generation block (the lines that build/write
the FASTA header and sequence for each snp) to use fasta_header to avoid
confusion and variable shadowing.
In `@pgatk/ensembl/ensembl.py`:
- Around line 634-637: The deduplication key creation (dedup_key) currently
concatenates transcript_id, record.REF and alt without a separator which can
theoretically collide; update the construction in the block using dedup_key and
processed_transcript_allele to include an explicit delimiter (e.g., "|" or "\t")
between transcript_id, record.REF and alt (for example build dedup_key by
joining [transcript_id, record.REF, alt] with a separator) so the check in
processed_transcript_allele and the add call use the new separator-based key.
In `@pgatk/tests/test_clinvar/test_clinvar_service.py`:
- Around line 68-72: The test name/docstring and assertion conflict: in
test_compound_risk_factor_excluded the docstring says "excluded" but the
assertion expects True from ClinVarService.passes_clnsig_filter for input
"Benign,_risk_factor"; either rename the test or change the docstring to reflect
that the compound with '_risk_factor' should pass. Update the test name to
something like test_compound_risk_factor_not_excluded or change the docstring to
"Benign with risk_factor — all components benign except _risk_factor → not
excluded" so the test name/doc and the call to
ClinVarService.passes_clnsig_filter("Benign,_risk_factor", exclude) are
consistent.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
docs/index.mddocs/pgatk-cli.mddocs/use-cases.mdmkdocs.ymlpgatk/cgenomes/cgenomes_proteindb.pypgatk/clinvar/clinvar_service.pypgatk/clinvar/data_downloader.pypgatk/ensembl/ensembl.pypgatk/testdata/clinvar/mini_refseq_protein.faapgatk/testdata/proteindb_from_custom_VCF.fapgatk/testdata/proteindb_from_gnomad_VCF.fapgatk/tests/test_clinvar/test_clinvar_service.pypgatk/tests/test_clinvar/test_data_downloader.pypgatk/toolbox/vcf_utils.py
💤 Files with no reviewable changes (1)
- pgatk/testdata/proteindb_from_gnomad_VCF.fa
minor changes to the algorithms and bugs
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
pgatk/toolbox/vcf_utils.py (1)
109-147:⚠️ Potential issue | 🟠 MajorOverlap handling is still incomplete for intron-start and multi-exon spans.
Line 110 still gates on
var_posbeing inside one feature, so variants that start in introns but overlap exons are dropped. Also, Line 117 computes exonic clipping from only the first matched feature, which under-deletes when REF spans multiple exons.Proposed fix (overlap-based across all intersecting features)
- nc_index = 0 - for feature in features_info: - if var_pos in range(feature[0], feature[1] + 1): - var_index_in_cds = nc_index + (var_pos - feature[0]) - # Clip the ref allele length to the portion that falls within - # this feature. When a multi-base variant extends from an exon - # into an intron, the intronic bases are absent from the - # transcript and must not be counted. - var_end_genomic = var_pos + len(ref_allele) - 1 - exonic_ref_len = min(var_end_genomic, feature[1]) - var_pos + 1 - c = max(exonic_ref_len, 0) - # Clip the ALT allele by the same number of trailing bases that - # were removed from REF. VCF variants are left-aligned, so the - # trailing bases correspond to the intronic portion. - intronic_ref_len = len(ref_allele) - c - if intronic_ref_len > 0: - exonic_alt_len = max(len(var_allele) - intronic_ref_len, 0) - var_allele = var_allele[:exonic_alt_len] - logger.debug( - "Variant at %d extends %d bases past exon boundary " - "(feature %d-%d); clipped REF to %d and ALT to %d bases", - var_pos, intronic_ref_len, feature[0], feature[1], - c, exonic_alt_len, - ) - alt_seq = ref_seq[0:var_index_in_cds] + var_allele + ref_seq[var_index_in_cds + c::] - if strand == '-': - return ref_seq[::-1], alt_seq[::-1] - else: - return ref_seq, alt_seq - else: - # Log when variant starts outside this feature (intron-start) - var_end_genomic = var_pos + len(ref_allele) - 1 - if var_end_genomic >= feature[0] and var_pos < feature[0]: - logger.debug( - "Variant at %d-%d starts before feature %d-%d " - "(intron-start); skipping this feature", - var_pos, var_end_genomic, feature[0], feature[1], - ) - - nc_index += (feature[1] - feature[0] + 1) + nc_index = 0 + var_end_genomic = var_pos + len(ref_allele) - 1 + var_index_in_cds = None + exonic_ref_len = 0 + + for feature in features_info: + feature_start, feature_end = feature[0], feature[1] + feature_len = feature_end - feature_start + 1 + + # no overlap with this feature + if var_end_genomic < feature_start or var_pos > feature_end: + nc_index += feature_len + continue + + overlap_start = max(var_pos, feature_start) + overlap_end = min(var_end_genomic, feature_end) + + if var_index_in_cds is None: + var_index_in_cds = nc_index + (overlap_start - feature_start) + + exonic_ref_len += max(0, overlap_end - overlap_start + 1) + nc_index += feature_len + + if var_index_in_cds is None: + return ref_seq, alt_seq + + c = max(exonic_ref_len, 0) + intronic_ref_len = len(ref_allele) - c + if intronic_ref_len > 0: + exonic_alt_len = max(len(var_allele) - intronic_ref_len, 0) + var_allele = var_allele[:exonic_alt_len] + + alt_seq = ref_seq[:var_index_in_cds] + var_allele + ref_seq[var_index_in_cds + c:] + if strand == '-': + return ref_seq[::-1], alt_seq[::-1] + return ref_seq, alt_seq🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pgatk/toolbox/vcf_utils.py` around lines 109 - 147, The loop in the features_info iteration incorrectly assumes the variant start (var_pos) must lie inside a single feature and only clips REF based on that first feature; instead compute the variant genomic span (var_end_genomic = var_pos + len(ref_allele) - 1), find all features that intersect that span, sum the total exonic portion of REF across those intersecting features to get the total exonic_ref_len (rather than using just one feature), determine var_index_in_cds from nc_index at the first intersecting feature, and then clip var_allele by the total intronic portion (len(ref_allele) - total_exonic_ref_len); also update the intron-start debug logging to trigger when any intersection exists (not only when var_pos is inside), and keep the existing strand reversal and ref/alt sequence assembly but use the cumulative c when slicing ref_seq and building alt_seq.
🧹 Nitpick comments (1)
pgatk/tests/test_vcf_utils.py (1)
113-160: Good additions; consider adding intron-start + multi-exon regression tests.These new tests validate single-exon clipping well, but they don’t cover variants that start before an exon or REF spans multiple exons. Adding those would protect the boundary logic more comprehensively.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pgatk/tests/test_vcf_utils.py` around lines 113 - 160, Add regression tests exercising variants that start in introns (intron-start) and REF/ALT spanning multiple exons to cover boundary logic in get_altseq: create new test cases in pgatk/tests/test_vcf_utils.py that (1) place a variant start before an exon (e.g., variant pos within intron leading into exon) and assert REF/ALT are clipped correctly against the exon start, and (2) define features_info with two or more exon ranges so REF spans an exon->intron->exon region and verify get_altseq returns the expected coding_ref and coding_alt after clipping/concatenation across exons; use the same pattern as existing tests (Seq for ref_seq, Seq for REF/ALT, call get_altseq) and assert exact string outputs to lock down multi-exon behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/use-cases.md`:
- Around line 9-15: Change the compound modifier "Cell-Type Specific" and the
inline "cell-type specific" to the hyphenated form "cell-type-specific" in the
heading "Cell-Type Specific Non-Canonical Peptide Discovery" and the workflow
description paragraph that references "cell-type specific databases"; update
both occurrences to ensure consistent compound-adjective hyphenation.
- Around line 613-622: The tip admonition currently uses an indented code block;
replace that indented block with a fenced triple-backtick code block so
markdownlint (MD046) is satisfied — locate the admonition containing the pgatk
digest-mutant-protein example and change the indented shell snippet to use
```bash on the opening line and ``` on the closing line, keeping the same
content and indentation inside the fence.
In `@pgatk/tests/test_clinvar/test_clinvar_service.py`:
- Around line 68-73: The test name/docstring claims the compound case with
'_risk_factor' is excluded, but the assertion expects it to pass; update the
test to reflect intended behavior by either renaming the test from
test_compound_risk_factor_excluded to test_compound_risk_factor_not_excluded (or
similar) and adjust the docstring to say "Benign with risk_factor — all
components benign or risk_factor → not excluded", or if exclusion was intended
change the assertion to assert False; locate this in the test function
referencing ClinVarService.passes_clnsig_filter and make the name/docstring and
assertion consistent.
---
Duplicate comments:
In `@pgatk/toolbox/vcf_utils.py`:
- Around line 109-147: The loop in the features_info iteration incorrectly
assumes the variant start (var_pos) must lie inside a single feature and only
clips REF based on that first feature; instead compute the variant genomic span
(var_end_genomic = var_pos + len(ref_allele) - 1), find all features that
intersect that span, sum the total exonic portion of REF across those
intersecting features to get the total exonic_ref_len (rather than using just
one feature), determine var_index_in_cds from nc_index at the first intersecting
feature, and then clip var_allele by the total intronic portion (len(ref_allele)
- total_exonic_ref_len); also update the intron-start debug logging to trigger
when any intersection exists (not only when var_pos is inside), and keep the
existing strand reversal and ref/alt sequence assembly but use the cumulative c
when slicing ref_seq and building alt_seq.
---
Nitpick comments:
In `@pgatk/tests/test_vcf_utils.py`:
- Around line 113-160: Add regression tests exercising variants that start in
introns (intron-start) and REF/ALT spanning multiple exons to cover boundary
logic in get_altseq: create new test cases in pgatk/tests/test_vcf_utils.py that
(1) place a variant start before an exon (e.g., variant pos within intron
leading into exon) and assert REF/ALT are clipped correctly against the exon
start, and (2) define features_info with two or more exon ranges so REF spans an
exon->intron->exon region and verify get_altseq returns the expected coding_ref
and coding_alt after clipping/concatenation across exons; use the same pattern
as existing tests (Seq for ref_seq, Seq for REF/ALT, call get_altseq) and assert
exact string outputs to lock down multi-exon behavior.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
docs/plans/2026-03-01-pgatk-rename-refactoring-design.mddocs/plans/2026-03-01-pgatk-rename-refactoring-plan.mddocs/plans/2026-03-01-phase0-infrastructure.mddocs/plans/2026-03-02-clinvar-ncbi-implementation.mddocs/plans/2026-03-02-clinvar-ncbi-support-design.mddocs/plans/2026-03-03-clinvar-pipeline-hardening.mddocs/use-cases.mdpgatk/clinvar/clinvar_service.pypgatk/tests/test_clinvar/test_clinvar_service.pypgatk/tests/test_vcf_utils.pypgatk/toolbox/vcf_utils.py
💤 Files with no reviewable changes (6)
- docs/plans/2026-03-01-pgatk-rename-refactoring-plan.md
- docs/plans/2026-03-01-phase0-infrastructure.md
- docs/plans/2026-03-02-clinvar-ncbi-implementation.md
- docs/plans/2026-03-02-clinvar-ncbi-support-design.md
- docs/plans/2026-03-01-pgatk-rename-refactoring-design.md
- docs/plans/2026-03-03-clinvar-pipeline-hardening.md
| ## Cell-Type Specific Non-Canonical Peptide Discovery | ||
|
|
||
| !!! abstract "Featured workflow" | ||
| This workflow reproduces the analysis from [Umer et al., *Bioinformatics* 2022](https://doi.org/10.1093/bioinformatics/btab838), | ||
| where cell-type specific databases were generated for 65 cell lines, | ||
| revealing non-canonical and cryptic peptides amounting to **>5% of total | ||
| peptide identifications**. |
There was a problem hiding this comment.
Hyphenate compound modifiers consistently.
Please update terms like “Cell-Type Specific” / “cell-type specific” to “cell-type-specific” in Line 9 and Line 13 for standard compound-adjective usage.
🧰 Tools
🪛 LanguageTool
[grammar] ~9-~9: Use a hyphen to join words.
Context: ...ady protein database. --- ## Cell-Type Specific Non-Canonical Peptide Discovery...
(QB_NEW_EN_HYPHEN)
[grammar] ~13-~13: Use a hyphen to join words.
Context: ...nformatics/btab838), where cell-type specific databases were generated for 65...
(QB_NEW_EN_HYPHEN)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/use-cases.md` around lines 9 - 15, Change the compound modifier
"Cell-Type Specific" and the inline "cell-type specific" to the hyphenated form
"cell-type-specific" in the heading "Cell-Type Specific Non-Canonical Peptide
Discovery" and the workflow description paragraph that references "cell-type
specific databases"; update both occurrences to ensure consistent
compound-adjective hyphenation.
Summary by CodeRabbit
New Features
Documentation
Improvements
Tests
Chores