minor changes to the algorithms and bugs#95
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan Review Summary by QodoFix ALT clipping for variants extending past exons and improve edge case handling
WalkthroughsDescription• Fix ALT allele clipping when REF extends past exon boundaries - Properly calculate intronic portion and clip ALT accordingly - Add debug logging for variant boundary conditions • Add comprehensive test cases for multibase REF clipping scenarios • Fix CLNSIG filter to handle delimiter-only values correctly • Update documentation to reflect genome FASTA requirement Diagramflowchart LR
A["Variant extends past exon"] --> B["Calculate intronic REF length"]
B --> C["Clip ALT by intronic bases"]
C --> D["Return clipped sequences"]
E["CLNSIG with delimiters only"] --> F["Filter empty components"]
F --> G["Return pass result"]
File Changes1. pgatk/toolbox/vcf_utils.py
|
Code Review by Qodo
1. Insertion past exon end
|
| # Clip the ALT allele by the same number of trailing bases that | ||
| # were removed from REF. VCF variants are left-aligned, so the | ||
| # trailing bases correspond to the intronic portion. | ||
| intronic_ref_len = len(ref_allele) - c | ||
| if intronic_ref_len > 0: | ||
| exonic_alt_len = max(len(var_allele) - intronic_ref_len, 0) | ||
| var_allele = var_allele[:exonic_alt_len] | ||
| logger.debug( |
There was a problem hiding this comment.
1. Insertion past exon end 🐞 Bug ✓ Correctness
get_altseq() clips ALT only when REF extends past the exon boundary; insertions at the last exonic base (ALT longer than REF) can still incorrectly retain inserted intronic bases in the transcript sequence. This can create false transcript/protein changes for boundary insertions.
Agent Prompt
### Issue description
`get_altseq()` currently clips ALT only when REF extends beyond the exon end. Insertions at the last exonic base (ALT longer than REF) can therefore introduce intronic inserted bases into the exon-only transcript sequence.
### Issue Context
In VCF, insertions are represented with an anchor base (REF length 1) and ALT = anchor + inserted bases. If the anchor base is the last base of an exon, the inserted bases occur after the exon boundary and should not appear in the spliced transcript sequence.
### Fix Focus Areas
- pgatk/toolbox/vcf_utils.py[116-136]
- pgatk/tests/test_vcf_utils.py[66-160]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| This downloads the gene annotation GTF, VCF file with known variants, and the | ||
| genome FASTA (needed by gffread to extract transcript sequences). | ||
|
|
||
| ### Step 2 -- Generate transcript sequences | ||
|
|
There was a problem hiding this comment.
2. Docs genome.fa not created 🐞 Bug ✓ Correctness
Docs now state the genome FASTA is downloaded/required, but the examples still reference ensembl_*/genome.fa, which the downloader does not create. Users following the docs will hit a file-not-found when running gffread -g .../genome.fa.
Agent Prompt
### Issue description
The docs instruct using `ensembl_*/genome.fa` with `gffread`, but the downloader does not produce a file with that name; it produces a species/assembly-specific `*.dna_sm.toplevel.fa`.
### Issue Context
This PR removes `--skip_dna` in examples and explicitly states the genome FASTA is downloaded/needed, increasing the likelihood users follow the `gffread -g .../genome.fa` command and fail.
### Fix Focus Areas
- docs/use-cases.md[176-205]
- pgatk/ensembl/data_downloader.py[461-483]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
No description provided.