Skip to content

minor changes to the algorithms and bugs#95

Merged
ypriverol merged 1 commit into
devfrom
refactor/rename
Mar 3, 2026
Merged

minor changes to the algorithms and bugs#95
ypriverol merged 1 commit into
devfrom
refactor/rename

Conversation

@ypriverol
Copy link
Copy Markdown
Member

No description provided.

@ypriverol ypriverol merged commit eddcc45 into dev Mar 3, 2026
3 of 4 checks passed
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 3, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor/rename

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Fix ALT clipping for variants extending past exons and improve edge case handling

🐞 Bug fix 🧪 Tests ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Fix ALT allele clipping when REF extends past exon boundaries
  - Properly calculate intronic portion and clip ALT accordingly
  - Add debug logging for variant boundary conditions
• Add comprehensive test cases for multibase REF clipping scenarios
• Fix CLNSIG filter to handle delimiter-only values correctly
• Update documentation to reflect genome FASTA requirement
Diagram
flowchart LR
  A["Variant extends past exon"] --> B["Calculate intronic REF length"]
  B --> C["Clip ALT by intronic bases"]
  C --> D["Return clipped sequences"]
  E["CLNSIG with delimiters only"] --> F["Filter empty components"]
  F --> G["Return pass result"]
Loading

Grey Divider

File Changes

1. pgatk/toolbox/vcf_utils.py 🐞 Bug fix +22/-0

Fix ALT clipping for variants extending past exons

pgatk/toolbox/vcf_utils.py


2. pgatk/tests/test_vcf_utils.py 🧪 Tests +48/-0

Add tests for multibase REF clipping edge cases

pgatk/tests/test_vcf_utils.py


3. pgatk/clinvar/clinvar_service.py 🐞 Bug fix +5/-2

Fix CLNSIG filter for delimiter-only values

pgatk/clinvar/clinvar_service.py


View more (8)
4. pgatk/tests/test_clinvar/test_clinvar_service.py 🧪 Tests +7/-0

Add test for delimiter-only CLNSIG handling

pgatk/tests/test_clinvar/test_clinvar_service.py


5. docs/use-cases.md 📝 Documentation +5/-6

Update documentation for genome FASTA requirement

docs/use-cases.md


6. docs/plans/2026-03-01-pgatk-rename-refactoring-design.md Additional files +0/-127

...

docs/plans/2026-03-01-pgatk-rename-refactoring-design.md


7. docs/plans/2026-03-01-pgatk-rename-refactoring-plan.md Additional files +0/-817

...

docs/plans/2026-03-01-pgatk-rename-refactoring-plan.md


8. docs/plans/2026-03-01-phase0-infrastructure.md Additional files +0/-1199

...

docs/plans/2026-03-01-phase0-infrastructure.md


9. docs/plans/2026-03-02-clinvar-ncbi-implementation.md Additional files +0/-1652

...

docs/plans/2026-03-02-clinvar-ncbi-implementation.md


10. docs/plans/2026-03-02-clinvar-ncbi-support-design.md Additional files +0/-208

...

docs/plans/2026-03-02-clinvar-ncbi-support-design.md


11. docs/plans/2026-03-03-clinvar-pipeline-hardening.md Additional files +0/-618

...

docs/plans/2026-03-03-clinvar-pipeline-hardening.md


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Mar 3, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Insertion past exon end 🐞 Bug ✓ Correctness
Description
get_altseq() clips ALT only when REF extends past the exon boundary; insertions at the last exonic
base (ALT longer than REF) can still incorrectly retain inserted intronic bases in the transcript
sequence. This can create false transcript/protein changes for boundary insertions.
Code

pgatk/toolbox/vcf_utils.py[R119-126]

+            # Clip the ALT allele by the same number of trailing bases that
+            # were removed from REF.  VCF variants are left-aligned, so the
+            # trailing bases correspond to the intronic portion.
+            intronic_ref_len = len(ref_allele) - c
+            if intronic_ref_len > 0:
+                exonic_alt_len = max(len(var_allele) - intronic_ref_len, 0)
+                var_allele = var_allele[:exonic_alt_len]
+                logger.debug(
Evidence
The new clipping is gated by intronic_ref_len = len(ref_allele) - c, so if REF is fully exonic
(intronic_ref_len == 0) no ALT clipping occurs even when ALT is longer than REF (insertion case).
The repo’s tests describe insertions as REF='G', ALT='GAA' (bases added after the anchor at POS),
so if POS is the exon end, the added bases are beyond the exon and should not appear in the
exon-only transcript sequence.

pgatk/toolbox/vcf_utils.py[106-136]
pgatk/toolbox/vcf_utils.py[112-115]
pgatk/tests/test_ensembl_core.py[131-148]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`get_altseq()` currently clips ALT only when REF extends beyond the exon end. Insertions at the last exonic base (ALT longer than REF) can therefore introduce intronic inserted bases into the exon-only transcript sequence.

### Issue Context
In VCF, insertions are represented with an anchor base (REF length 1) and ALT = anchor + inserted bases. If the anchor base is the last base of an exon, the inserted bases occur after the exon boundary and should not appear in the spliced transcript sequence.

### Fix Focus Areas
- pgatk/toolbox/vcf_utils.py[116-136]
- pgatk/tests/test_vcf_utils.py[66-160]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Docs genome.fa not created 🐞 Bug ✓ Correctness
Description
Docs now state the genome FASTA is downloaded/required, but the examples still reference
ensembl_*/genome.fa, which the downloader does not create. Users following the docs will hit a
file-not-found when running gffread -g .../genome.fa.
Code

docs/use-cases.md[R189-193]

+This downloads the gene annotation GTF, VCF file with known variants, and the
+genome FASTA (needed by gffread to extract transcript sequences).

### Step 2 -- Generate transcript sequences
Evidence
The docs instruct gffread to use ensembl_human/genome.fa. The downloader actually writes a
species/assembly-specific filename *.dna_sm.toplevel.fa(.gz) and the download helper only strips
.gz (no rename to genome.fa). Therefore genome.fa won’t exist unless the user manually
renames/symlinks it.

docs/use-cases.md[189-200]
pgatk/ensembl/data_downloader.py[461-483]
pgatk/toolbox/general.py[237-244]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The docs instruct using `ensembl_*/genome.fa` with `gffread`, but the downloader does not produce a file with that name; it produces a species/assembly-specific `*.dna_sm.toplevel.fa`.

### Issue Context
This PR removes `--skip_dna` in examples and explicitly states the genome FASTA is downloaded/needed, increasing the likelihood users follow the `gffread -g .../genome.fa` command and fail.

### Fix Focus Areas
- docs/use-cases.md[176-205]
- pgatk/ensembl/data_downloader.py[461-483]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Empty ALT handling brittle 🐞 Bug ⛯ Reliability
Description
The new clipping can reduce var_allele to length 0, producing an ALT sequence that is “empty” but
not consistently represented/checked across pipelines. Current callers use string-equality checks
(== "" / != ""), which is a fragile emptiness contract when get_altseq() may return Bio.Seq
objects.
Code

pgatk/toolbox/vcf_utils.py[R122-136]

+            intronic_ref_len = len(ref_allele) - c
+            if intronic_ref_len > 0:
+                exonic_alt_len = max(len(var_allele) - intronic_ref_len, 0)
+                var_allele = var_allele[:exonic_alt_len]
+                logger.debug(
+                    "Variant at %d extends %d bases past exon boundary "
+                    "(feature %d-%d); clipped REF to %d and ALT to %d bases",
+                    var_pos, intronic_ref_len, feature[0], feature[1],
+                    c, exonic_alt_len,
+                )
            alt_seq = ref_seq[0:var_index_in_cds] + var_allele + ref_seq[var_index_in_cds + c::]
            if strand == '-':
                return ref_seq[::-1], alt_seq[::-1]
            else:
                return ref_seq, alt_seq
Evidence
The PR introduces ALT clipping that can set exonic_alt_len to 0 and slice ALT down to empty; this
flows into alt_seq = ... + var_allele + .... Downstream code in both ClinVar and Ensembl pipelines
uses string equality checks to decide whether to continue, which is brittle when get_altseq() can
return sequence objects (and can now more plausibly return an empty one).

pgatk/toolbox/vcf_utils.py[122-136]
pgatk/clinvar/clinvar_service.py[527-540]
pgatk/ensembl/ensembl.py[650-656]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
ALT clipping can yield an empty ALT allele/sequence, but downstream logic relies on comparing `coding_alt_seq` to the empty string. With `get_altseq()` returning Bio.Seq-derived objects in many paths, this emptiness check is brittle.

### Issue Context
The PR added ALT clipping (`var_allele = var_allele[:exonic_alt_len]`) which can produce a zero-length ALT. Callers in Ensembl and ClinVar pipelines gate translation using `== ""` / `!= ""`.

### Fix Focus Areas
- pgatk/toolbox/vcf_utils.py[122-149]
- pgatk/clinvar/clinvar_service.py[538-540]
- pgatk/ensembl/ensembl.py[655-656]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +119 to +126
# Clip the ALT allele by the same number of trailing bases that
# were removed from REF. VCF variants are left-aligned, so the
# trailing bases correspond to the intronic portion.
intronic_ref_len = len(ref_allele) - c
if intronic_ref_len > 0:
exonic_alt_len = max(len(var_allele) - intronic_ref_len, 0)
var_allele = var_allele[:exonic_alt_len]
logger.debug(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Insertion past exon end 🐞 Bug ✓ Correctness

get_altseq() clips ALT only when REF extends past the exon boundary; insertions at the last exonic
base (ALT longer than REF) can still incorrectly retain inserted intronic bases in the transcript
sequence. This can create false transcript/protein changes for boundary insertions.
Agent Prompt
### Issue description
`get_altseq()` currently clips ALT only when REF extends beyond the exon end. Insertions at the last exonic base (ALT longer than REF) can therefore introduce intronic inserted bases into the exon-only transcript sequence.

### Issue Context
In VCF, insertions are represented with an anchor base (REF length 1) and ALT = anchor + inserted bases. If the anchor base is the last base of an exon, the inserted bases occur after the exon boundary and should not appear in the spliced transcript sequence.

### Fix Focus Areas
- pgatk/toolbox/vcf_utils.py[116-136]
- pgatk/tests/test_vcf_utils.py[66-160]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread docs/use-cases.md
Comment on lines +189 to 193
This downloads the gene annotation GTF, VCF file with known variants, and the
genome FASTA (needed by gffread to extract transcript sequences).

### Step 2 -- Generate transcript sequences

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Docs genome.fa not created 🐞 Bug ✓ Correctness

Docs now state the genome FASTA is downloaded/required, but the examples still reference
ensembl_*/genome.fa, which the downloader does not create. Users following the docs will hit a
file-not-found when running gffread -g .../genome.fa.
Agent Prompt
### Issue description
The docs instruct using `ensembl_*/genome.fa` with `gffread`, but the downloader does not produce a file with that name; it produces a species/assembly-specific `*.dna_sm.toplevel.fa`.

### Issue Context
This PR removes `--skip_dna` in examples and explicitly states the genome FASTA is downloaded/needed, increasing the likelihood users follow the `gffread -g .../genome.fa` command and fail.

### Fix Focus Areas
- docs/use-cases.md[176-205]
- pgatk/ensembl/data_downloader.py[461-483]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@ypriverol ypriverol deleted the refactor/rename branch March 3, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant