Skip to content

Improve Protein Variant Transformation#3

Merged
folded merged 18 commits intomainfrom
feat/transform-settings
Mar 2, 2026
Merged

Improve Protein Variant Transformation#3
folded merged 18 commits intomainfrom
feat/transform-settings

Conversation

@folded
Copy link
Copy Markdown
Owner

@folded folded commented Mar 1, 2026

This PR introduces comprehensive improvements to variant handling (especially protein variants), refactors core mapping logic for better type safety, and addresses
several critical bugs in normalization. It also adds infrastructure for large-scale validation and architectural tracking.

Key Features

  • Protein Variant Transformation: Added VariantTransformSettings and StartCodonConvention to control protein notation. The Variant.transform(settings) method now allows converting specific predictions (e.g., p.(Met1Val)) into standard HGVS notations (e.g., p.Met1?) via the HgvsQuestion convention.
  • Improved Equivalence: Enhanced logic to treat p.Met1? as equivalent to p.Met1Xaa, ensuring consistent identity across different notations.
  • Protein-to-cDNA Mapping: Added mapper.p_to_c to support back-conversion of protein substitutions to cDNA.
  • Enhanced Test Coverage: Added new integration tests for intronic variants (tests/test_intronic_cp.py), p_unknown cases (tests/test_p_unknown.py), and transformation logic (tests/test_transform.py).

Bug Fixes

  • Normalization Stability: Fixed a bug in normalize_variant where 3'-shifts crossing transcript boundaries could result in invalid c.0 positions.
  • Mapping Correctness: Fixed incorrect reference/target advancement in CigarMapper and corrected unsafe i32-as-usize casts that caused silent failures on negative coordinates.
  • Entrypoint Cleanup: Removed the incorrect Variant.to_spdi (which lacked strand and CDS-relative context) in favor of the robust VariantMapper.to_spdi.
  • AltSeq Generation: Fixed altseq generation for Repeat variants and added guards for N-terminal insertions.

Refactoring and Quality

  • Strand Type Safety: Replaced raw i32 strand indicators with a Strand enum (Plus, Minus) to improve code clarity and leverage exhaustive matching.
  • Normalization Dispatch: Refactored normalize_variant into discrete, type-specific methods (e.g., normalize_coding_variant) to improve maintainability.
  • DRY Improvements: Extracted over 10 repeated patterns into reusable helpers for sequence extraction, strand complementation, and position construction.
  • Documentation: Added CODE_REVIEW.md to track architectural findings, open issues, and future type-system opportunities.

Tools and Infrastructure

  • Normalization Validation: Added scripts/validate_normalization.py for large-scale verification of normalization logic.
  • PyO3 Stubs: Finalized and updated Python type stubs (_weaver.pyi) to reflect the new APIs and settings.

- Add VariantTransformSettings / StartCodonConvention for post-mapping
  protein notation control: Variant.transform(settings) converts
  p.(Met1Val) -> p.Met1? under HgvsQuestion convention while keeping
  specific predictions as the internal default (option b architecture)

- Fix p.Met1? ≡ p.Met1Xaa equivalence: AaEdit::Special{"?"} now
  projects to ResidueToken::Any in analogous_edit.rs, and normalize_format
  maps '?' -> 'X' (same as Xaa) so both notations compare equal

- Add mapper.p_to_c for back-converting protein substitutions to cDNA

Code quality improvements:

Remove the broken Variant.to_spdi entrypoint (wrong position for c./n.
variants due to CDS-relative vs transcript-relative mismatch; no strand
complementation). VariantMapper.to_spdi is the sole correct entrypoint.

Fix unsafe i32-as-usize casts in get_c_indices and p_to_c that silently
wrapped on negative TranscriptPos/ProteinPos values, producing garbage
sequence indices.

Fix normalize_variant Coding arm: direct arithmetic on HgvsTranscriptPos
produced the invalid position c.0 when a 3'-shift crossed the 5'UTR/CDS
boundary. Positions are now re-derived via TranscriptMapper::n_to_c.

Refactor: extract n_to_c_position() helper to deduplicate the repeated
pattern of converting a 0-based n. index to a BaseOffsetPosition. Remove
get_n_indices() (identical to get_c_indices()). Replace redundant
get_transcript() calls in the shift and Ins->Dup blocks with dyn_clone.

- transcript_mapper: sort exons in transcript order per strand
  build CigarMapper per exon and consult it in g_to_n/n_to_g
  separate best_dist/best_offset in intronic search
- mapper: add checked_usize helper and apply throughout);
  implement cyclic-rotation comparison for multi-base ins shift;
  g_to_c_all now propagates errors when all mappings fail
- altseq: use max (not min) for Repeat alt-seq generation
- altseq_to_hgvsp: guard N-terminal insertion (start_idx=0)
- cigar: fix CigarMapper ref/tgt advancement ops (D advances ref,
  I advances tgt); update test expectations
- data: clarify c_to_g doc comment

refactor: eliminate 10 DRY violations across hgvs-weaver

Extract repeated logic into focused helpers:

- utils: translate_cds delegates to translate_codon; aa3_chunk_to_residue
  extracted from decompose_aa
- analogous_edit: push_dna_tokens / push_aa_tokens replace 7 inline
  token-push patterns
- equivalence: strand_aware_edit free fn (4 sites); normalize_ins_to_dup_boi
  collapses identical Coding/NonCoding arms
- mapper: extract_edit_sequences (shift_3_prime + shift_5_prime);
  apply_strand_complement (6 sites); make_base_offset_position +
  make_simple_position; ins_anchor_and_end (2 sites);
  update_del_dup_ref method (3 sites)

refactor: split normalize_variant into per-variant-type methods

Extract normalize_coding_variant, normalize_genomic_variant, and
normalize_noncoding_variant from the monolithic normalize_variant, which
is now a clean 10-line dispatch. Each arm can now be read and tested in
isolation without constructing the full SequenceVariant enum.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the hgvs-weaver library's capabilities for handling and normalizing HGVS variants, particularly focusing on protein variants. It introduces new mechanisms for transforming protein notations, enables back-conversion from protein to cDNA, and refines the core variant normalization and equivalence logic. Several critical bugs related to coordinate mapping and sequence generation have been resolved, and the codebase benefits from improved type safety and maintainability. The addition of extensive validation infrastructure ensures the robustness and accuracy of these changes.

Highlights

  • Protein Variant Transformation: Introduced VariantTransformSettings and StartCodonConvention to control protein notation, allowing conversion of specific predictions (e.g., p.(Met1Val)) to standard HGVS p.Met1? notation.
  • Improved Variant Equivalence: Enhanced logic to treat p.Met1? as equivalent to p.Met1Xaa and refactored sequence tokenization for more robust comparison.
  • Protein-to-cDNA Mapping (p_to_c): Added functionality to back-convert protein substitutions to cDNA variants, including handling of ambiguous codon changes.
  • Enhanced Normalization Stability: Fixed bugs in normalize_variant related to 3'-shifts crossing transcript boundaries and improved Ins to Dup conversion logic.
  • Refactored Type Safety: Replaced raw i32 strand indicators with a Strand enum for improved code clarity and exhaustiveness in data models.
  • CIGAR Mapping Corrections: Addressed incorrect reference/target advancement in CigarMapper and corrected unsafe integer casts that caused silent failures.
  • Comprehensive Test Coverage & Validation: Added new integration tests for intronic variants, p_unknown cases, and transformation logic, alongside a new large-scale normalization validation script.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • CODE_REVIEW.md
    • Added a new document outlining architectural findings, open issues, and future type system opportunities.
  • commits.txt
    • Removed an outdated commit log file.
  • hgvs-weaver/src/altseq.rs
    • Updated repeat variant altseq generation to use the max repeat count.
  • hgvs-weaver/src/altseq_to_hgvsp.rs
    • Added a guard to prevent unsupported N-terminal protein insertions and included a corresponding test.
  • hgvs-weaver/src/analogous_edit.rs
    • Refactored sequence tokenization helpers and introduced ResidueToken::Any for p.Met1? equivalence.
    • Implemented logic to reconcile p.Met1? with p.Met1Xaa in equivalence checks.
  • hgvs-weaver/src/cigar.rs
    • Corrected the logic for advancing reference and target positions in CigarMapper for D (deletion) and I (insertion) operations.
  • hgvs-weaver/src/coords.rs
    • Removed the to_spdi method from the Variant trait, centralizing SPDI generation in VariantMapper.
  • hgvs-weaver/src/data.rs
    • Implemented a new Strand enum to replace i32 for representing strand information, enhancing type safety and clarity.
  • hgvs-weaver/src/equivalence.rs
    • Updated variant equivalence logic to use the new Strand enum.
    • Refactored normalization to duplication into a dedicated helper function.
    • Modified protein format normalization to treat ? as X for equivalence.
    • Corrected get_n_indices to get_c_indices for non-coding/RNA variants.
  • hgvs-weaver/src/lib.rs
    • Exposed new transform module, StartCodonConvention, VariantTransformSettings, and transform_variant for Python bindings.
  • hgvs-weaver/src/mapper.rs
    • Implemented p_to_c for protein to cDNA back-conversion, including handling of ambiguous codons.
    • Refactored normalize_variant into type-specific functions (normalize_coding_variant, normalize_genomic_variant, normalize_noncoding_variant).
    • Added logic to c_to_p to return p.? for intronic or 5'UTR variants.
    • Corrected 3'-shifting logic for multi-base insertions.
    • Added validation for negative genomic positions and improved g_to_c_all error handling.
  • hgvs-weaver/src/structs.rs
    • Removed to_spdi related traits and implementations for protein variants, aligning with the centralization of SPDI generation.
  • hgvs-weaver/src/transcript_mapper.rs
    • Enhanced TranscriptMapper to correctly sort exons for minus-strand transcripts and integrate CIGAR mappers for accurate coordinate transformations.
  • hgvs-weaver/src/transform.rs
    • Added new module defining StartCodonConvention and VariantTransformSettings to manage protein variant display, specifically for p.Met1? notation.
  • hgvs-weaver/src/utils.rs
    • Refactored codon translation logic into helper functions (translate_codon, codons_for_aa) and introduced aa3_chunk_to_residue.
  • hgvs-weaver/tests/analogous_test.rs
    • Updated tests to reflect Strand enum usage.
  • hgvs-weaver/tests/equivalence_test.rs
    • Updated tests to reflect Strand enum usage.
  • hgvs-weaver/tests/mapping_test.rs
    • Updated tests to reflect Strand enum usage and added new c_to_p tests.
  • hgvs-weaver/tests/normalization_test.rs
    • Updated tests to reflect Strand enum usage.
  • hgvs-weaver/tests/regression_test.rs
    • Updated tests to reflect Strand enum usage.
  • hgvs-weaver/tests/test_regressions.rs
    • Updated tests to reflect Strand enum usage.
  • hgvs-weaver/tests/test_shift.rs
    • Updated tests to reflect Strand enum usage and added a new test for multi-base insertion 3'-shifting.
  • hgvs-weaver/tests/toy_data_test.rs
    • Added new tests for c_to_p missense variants.
  • scripts/validate_normalization.py
    • Added a new Python script for large-scale validation of normalization and equivalence, including VCF normalization and HGVS string comparisons.
  • src/bin/stub_gen.rs
    • Modified Python stub generation to improve type hints, remove redundant __repr__/__str__ methods, and format output with ruff.
  • src/lib.rs
    • Updated Python bindings to expose new error types, StartCodonConvention, VariantTransformSettings, and transform method for Variant objects.
  • tests/test_api.py
    • Added tests for c_to_p missense variants.
  • tests/test_equivalence.py
    • Added tests for c_vs_p and g_vs_p equivalence, and p.Met1? vs p.Met1Xaa equivalence.
  • tests/test_intronic_cp.py
    • Added a new test for c_to_p handling of intronic variants (expecting p.?).
  • tests/test_mavehgvs_compliance.py
    • Updated error handling in tests to catch the new HGVSError types.
  • tests/test_p_unknown.py
    • Added a new test for parsing p.? and p.(=) protein variants.
  • tests/test_transform.py
    • Added a new test file to verify the VariantTransformSettings and transform functionality.
  • weaver/init.py
    • Updated the Python __init__.py to expose new classes and error types.
  • weaver/_weaver.pyi
    • Updated the Python type stub file to reflect the new API, including VariantTransformSettings and new error classes.
  • weaver/cli/provider.py
    • Enhanced RefSeqDataProvider to improve transcript lookup robustness, handle versionless accessions, and load MANE Select data.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of improvements to the hgvs-weaver library, significantly enhancing its functionality, correctness, and maintainability. The changes include the introduction of configurable protein variant transformations, robust handling of start-codon and intronic variants, and a new protein-to-cDNA back-conversion feature.

Key bug fixes have been implemented, such as correcting the CIGAR mapping logic for reference/target advancement and ensuring type safety by replacing raw integer strand indicators with a Strand enum. The normalization logic has been substantially refactored for clarity and correctness, particularly in handling 3' shifts across transcript boundaries.

The codebase has been improved through extensive refactoring to reduce duplication and improve type safety. The Python API has also been enhanced with a better exception hierarchy and cleaner type stubs.

The addition of new integration tests and a large-scale validation script demonstrates a strong commitment to quality and correctness.

Overall, this is an excellent and well-executed pull request. The changes are thorough, well-tested, and significantly improve the quality of the codebase. I have reviewed the changes in detail and have not found any issues of medium or higher severity.

@folded folded merged commit e92b430 into main Mar 2, 2026
4 checks passed
@folded folded deleted the feat/transform-settings branch March 2, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant