Improve Protein Variant Transformation by folded · Pull Request #3 · folded/hgvs-weaver

folded · 2026-03-01T22:45:36Z

This PR introduces comprehensive improvements to variant handling (especially protein variants), refactors core mapping logic for better type safety, and addresses
several critical bugs in normalization. It also adds infrastructure for large-scale validation and architectural tracking.

Key Features

Protein Variant Transformation: Added VariantTransformSettings and StartCodonConvention to control protein notation. The Variant.transform(settings) method now allows converting specific predictions (e.g., p.(Met1Val)) into standard HGVS notations (e.g., p.Met1?) via the HgvsQuestion convention.
Improved Equivalence: Enhanced logic to treat p.Met1? as equivalent to p.Met1Xaa, ensuring consistent identity across different notations.
Protein-to-cDNA Mapping: Added mapper.p_to_c to support back-conversion of protein substitutions to cDNA.
Enhanced Test Coverage: Added new integration tests for intronic variants (tests/test_intronic_cp.py), p_unknown cases (tests/test_p_unknown.py), and transformation logic (tests/test_transform.py).

Bug Fixes

Normalization Stability: Fixed a bug in normalize_variant where 3'-shifts crossing transcript boundaries could result in invalid c.0 positions.
Mapping Correctness: Fixed incorrect reference/target advancement in CigarMapper and corrected unsafe i32-as-usize casts that caused silent failures on negative coordinates.
Entrypoint Cleanup: Removed the incorrect Variant.to_spdi (which lacked strand and CDS-relative context) in favor of the robust VariantMapper.to_spdi.
AltSeq Generation: Fixed altseq generation for Repeat variants and added guards for N-terminal insertions.

Refactoring and Quality

Strand Type Safety: Replaced raw i32 strand indicators with a Strand enum (Plus, Minus) to improve code clarity and leverage exhaustive matching.
Normalization Dispatch: Refactored normalize_variant into discrete, type-specific methods (e.g., normalize_coding_variant) to improve maintainability.
DRY Improvements: Extracted over 10 repeated patterns into reusable helpers for sequence extraction, strand complementation, and position construction.
Documentation: Added CODE_REVIEW.md to track architectural findings, open issues, and future type-system opportunities.

Tools and Infrastructure

Normalization Validation: Added scripts/validate_normalization.py for large-scale verification of normalization logic.
PyO3 Stubs: Finalized and updated Python type stubs (_weaver.pyi) to reflect the new APIs and settings.

- Add VariantTransformSettings / StartCodonConvention for post-mapping protein notation control: Variant.transform(settings) converts p.(Met1Val) -> p.Met1? under HgvsQuestion convention while keeping specific predictions as the internal default (option b architecture) - Fix p.Met1? ≡ p.Met1Xaa equivalence: AaEdit::Special{"?"} now projects to ResidueToken::Any in analogous_edit.rs, and normalize_format maps '?' -> 'X' (same as Xaa) so both notations compare equal - Add mapper.p_to_c for back-converting protein substitutions to cDNA Code quality improvements: Remove the broken Variant.to_spdi entrypoint (wrong position for c./n. variants due to CDS-relative vs transcript-relative mismatch; no strand complementation). VariantMapper.to_spdi is the sole correct entrypoint. Fix unsafe i32-as-usize casts in get_c_indices and p_to_c that silently wrapped on negative TranscriptPos/ProteinPos values, producing garbage sequence indices. Fix normalize_variant Coding arm: direct arithmetic on HgvsTranscriptPos produced the invalid position c.0 when a 3'-shift crossed the 5'UTR/CDS boundary. Positions are now re-derived via TranscriptMapper::n_to_c. Refactor: extract n_to_c_position() helper to deduplicate the repeated pattern of converting a 0-based n. index to a BaseOffsetPosition. Remove get_n_indices() (identical to get_c_indices()). Replace redundant get_transcript() calls in the shift and Ins->Dup blocks with dyn_clone. - transcript_mapper: sort exons in transcript order per strand build CigarMapper per exon and consult it in g_to_n/n_to_g separate best_dist/best_offset in intronic search - mapper: add checked_usize helper and apply throughout); implement cyclic-rotation comparison for multi-base ins shift; g_to_c_all now propagates errors when all mappings fail - altseq: use max (not min) for Repeat alt-seq generation - altseq_to_hgvsp: guard N-terminal insertion (start_idx=0) - cigar: fix CigarMapper ref/tgt advancement ops (D advances ref, I advances tgt); update test expectations - data: clarify c_to_g doc comment refactor: eliminate 10 DRY violations across hgvs-weaver Extract repeated logic into focused helpers: - utils: translate_cds delegates to translate_codon; aa3_chunk_to_residue extracted from decompose_aa - analogous_edit: push_dna_tokens / push_aa_tokens replace 7 inline token-push patterns - equivalence: strand_aware_edit free fn (4 sites); normalize_ins_to_dup_boi collapses identical Coding/NonCoding arms - mapper: extract_edit_sequences (shift_3_prime + shift_5_prime); apply_strand_complement (6 sites); make_base_offset_position + make_simple_position; ins_anchor_and_end (2 sites); update_del_dup_ref method (3 sites) refactor: split normalize_variant into per-variant-type methods Extract normalize_coding_variant, normalize_genomic_variant, and normalize_noncoding_variant from the monolithic normalize_variant, which is now a clean 10-line dispatch. Each arm can now be read and tested in isolation without constructing the full SequenceVariant enum.

… warnings

gemini-code-assist · 2026-03-01T22:46:51Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the hgvs-weaver library's capabilities for handling and normalizing HGVS variants, particularly focusing on protein variants. It introduces new mechanisms for transforming protein notations, enables back-conversion from protein to cDNA, and refines the core variant normalization and equivalence logic. Several critical bugs related to coordinate mapping and sequence generation have been resolved, and the codebase benefits from improved type safety and maintainability. The addition of extensive validation infrastructure ensures the robustness and accuracy of these changes.

Highlights

Protein Variant Transformation: Introduced VariantTransformSettings and StartCodonConvention to control protein notation, allowing conversion of specific predictions (e.g., p.(Met1Val)) to standard HGVS p.Met1? notation.
Improved Variant Equivalence: Enhanced logic to treat p.Met1? as equivalent to p.Met1Xaa and refactored sequence tokenization for more robust comparison.
Protein-to-cDNA Mapping (p_to_c): Added functionality to back-convert protein substitutions to cDNA variants, including handling of ambiguous codon changes.
Enhanced Normalization Stability: Fixed bugs in normalize_variant related to 3'-shifts crossing transcript boundaries and improved Ins to Dup conversion logic.
Refactored Type Safety: Replaced raw i32 strand indicators with a Strand enum for improved code clarity and exhaustiveness in data models.
CIGAR Mapping Corrections: Addressed incorrect reference/target advancement in CigarMapper and corrected unsafe integer casts that caused silent failures.
Comprehensive Test Coverage & Validation: Added new integration tests for intronic variants, p_unknown cases, and transformation logic, alongside a new large-scale normalization validation script.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

CODE_REVIEW.md
- Added a new document outlining architectural findings, open issues, and future type system opportunities.
commits.txt
- Removed an outdated commit log file.
hgvs-weaver/src/altseq.rs
- Updated repeat variant altseq generation to use the max repeat count.
hgvs-weaver/src/altseq_to_hgvsp.rs
- Added a guard to prevent unsupported N-terminal protein insertions and included a corresponding test.
hgvs-weaver/src/analogous_edit.rs
- Refactored sequence tokenization helpers and introduced ResidueToken::Any for p.Met1? equivalence.
- Implemented logic to reconcile p.Met1? with p.Met1Xaa in equivalence checks.
hgvs-weaver/src/cigar.rs
- Corrected the logic for advancing reference and target positions in CigarMapper for D (deletion) and I (insertion) operations.
hgvs-weaver/src/coords.rs
- Removed the to_spdi method from the Variant trait, centralizing SPDI generation in VariantMapper.
hgvs-weaver/src/data.rs
- Implemented a new Strand enum to replace i32 for representing strand information, enhancing type safety and clarity.
hgvs-weaver/src/equivalence.rs
- Updated variant equivalence logic to use the new Strand enum.
- Refactored normalization to duplication into a dedicated helper function.
- Modified protein format normalization to treat ? as X for equivalence.
- Corrected get_n_indices to get_c_indices for non-coding/RNA variants.
hgvs-weaver/src/lib.rs
- Exposed new transform module, StartCodonConvention, VariantTransformSettings, and transform_variant for Python bindings.
hgvs-weaver/src/mapper.rs
- Implemented p_to_c for protein to cDNA back-conversion, including handling of ambiguous codons.
- Refactored normalize_variant into type-specific functions (normalize_coding_variant, normalize_genomic_variant, normalize_noncoding_variant).
- Added logic to c_to_p to return p.? for intronic or 5'UTR variants.
- Corrected 3'-shifting logic for multi-base insertions.
- Added validation for negative genomic positions and improved g_to_c_all error handling.
hgvs-weaver/src/structs.rs
- Removed to_spdi related traits and implementations for protein variants, aligning with the centralization of SPDI generation.
hgvs-weaver/src/transcript_mapper.rs
- Enhanced TranscriptMapper to correctly sort exons for minus-strand transcripts and integrate CIGAR mappers for accurate coordinate transformations.
hgvs-weaver/src/transform.rs
- Added new module defining StartCodonConvention and VariantTransformSettings to manage protein variant display, specifically for p.Met1? notation.
hgvs-weaver/src/utils.rs
- Refactored codon translation logic into helper functions (translate_codon, codons_for_aa) and introduced aa3_chunk_to_residue.
hgvs-weaver/tests/analogous_test.rs
- Updated tests to reflect Strand enum usage.
hgvs-weaver/tests/equivalence_test.rs
- Updated tests to reflect Strand enum usage.
hgvs-weaver/tests/mapping_test.rs
- Updated tests to reflect Strand enum usage and added new c_to_p tests.
hgvs-weaver/tests/normalization_test.rs
- Updated tests to reflect Strand enum usage.
hgvs-weaver/tests/regression_test.rs
- Updated tests to reflect Strand enum usage.
hgvs-weaver/tests/test_regressions.rs
- Updated tests to reflect Strand enum usage.
hgvs-weaver/tests/test_shift.rs
- Updated tests to reflect Strand enum usage and added a new test for multi-base insertion 3'-shifting.
hgvs-weaver/tests/toy_data_test.rs
- Added new tests for c_to_p missense variants.
scripts/validate_normalization.py
- Added a new Python script for large-scale validation of normalization and equivalence, including VCF normalization and HGVS string comparisons.
src/bin/stub_gen.rs
- Modified Python stub generation to improve type hints, remove redundant __repr__/__str__ methods, and format output with ruff.
src/lib.rs
- Updated Python bindings to expose new error types, StartCodonConvention, VariantTransformSettings, and transform method for Variant objects.
tests/test_api.py
- Added tests for c_to_p missense variants.
tests/test_equivalence.py
- Added tests for c_vs_p and g_vs_p equivalence, and p.Met1? vs p.Met1Xaa equivalence.
tests/test_intronic_cp.py
- Added a new test for c_to_p handling of intronic variants (expecting p.?).
tests/test_mavehgvs_compliance.py
- Updated error handling in tests to catch the new HGVSError types.
tests/test_p_unknown.py
- Added a new test for parsing p.? and p.(=) protein variants.
tests/test_transform.py
- Added a new test file to verify the VariantTransformSettings and transform functionality.
weaver/init.py
- Updated the Python __init__.py to expose new classes and error types.
weaver/_weaver.pyi
- Updated the Python type stub file to reflect the new API, including VariantTransformSettings and new error classes.
weaver/cli/provider.py
- Enhanced RefSeqDataProvider to improve transcript lookup robustness, handle versionless accessions, and load MANE Select data.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive set of improvements to the hgvs-weaver library, significantly enhancing its functionality, correctness, and maintainability. The changes include the introduction of configurable protein variant transformations, robust handling of start-codon and intronic variants, and a new protein-to-cDNA back-conversion feature.

Key bug fixes have been implemented, such as correcting the CIGAR mapping logic for reference/target advancement and ensuring type safety by replacing raw integer strand indicators with a Strand enum. The normalization logic has been substantially refactored for clarity and correctness, particularly in handling 3' shifts across transcript boundaries.

The codebase has been improved through extensive refactoring to reduce duplication and improve type safety. The Python API has also been enhanced with a better exception hierarchy and cleaner type stubs.

The addition of new integration tests and a large-scale validation script demonstrates a strong commitment to quality and correctness.

Overall, this is an excellent and well-executed pull request. The changes are thorough, well-tested, and significantly improve the quality of the codebase. I have reviewed the changes in detail and have not found any issues of medium or higher severity.

…n table

…porting

folded added 5 commits February 25, 2026 09:51

clean up pre-commit issues

5b232b4

chore: Update docstrings, finalize PyO3 generated stubs, and fix lint…

6cd4e73

… warnings

refactor: replace integer strand with Strand enum

22c573f

feat: add missing intronic and p_unknown integration tests

eda7ad2

gemini-code-assist bot reviewed Mar 1, 2026

View reviewed changes

folded added 13 commits March 2, 2026 10:20

address pre-commit issues

10c48fd

bump version

6cc84b2

fix: resolve subtraction overflow panic in shift_5_prime for insertions

7995799

feat: consolidate README table and graph update into weaver-analyze

0a67263

feat: merge protein identity and spdi columns into a single validatio…

57d36b4

…n table

style: fix lints in analyze.py

a7ed6ee

docs: update README with 0.4.0 validation results

4fb60f8

docs: update benchmark history for 0.3.0 and 0.4.0

1aee01b

bench: update validation results for 0.4.0 and fix analogous match re…

ba62ea0

…porting

refactor: remove redundant validation scripts

1a8f7da

format

c850bbf

format

011c1bd

lint cleanup

a35ab4c

folded merged commit e92b430 into main Mar 2, 2026
4 checks passed

folded deleted the feat/transform-settings branch March 2, 2026 02:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Protein Variant Transformation#3

Improve Protein Variant Transformation#3
folded merged 18 commits intomainfrom
feat/transform-settings

folded commented Mar 1, 2026

Uh oh!

gemini-code-assist bot commented Mar 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

folded commented Mar 1, 2026

Uh oh!

gemini-code-assist bot commented Mar 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant