Skip to content

mitochondrial support #19

@ofrei

Description

@ofrei

Problem

Current genomatch metadata uses genome_build=GRCh37, but mitochondrial reference handling is not unambiguous under that label.

For nuclear primary contigs (1-22, X, Y), treating UCSC hg19 as GRCh37 is acceptable for current workflows. For mitochondrial sequence, it is not enough:

Reference surface Naming MT length Notes
UCSC hg19 chrM 16571 Matches UCSC hg19 chain files
1000G/GATK b37 / rCRS-style MT 16569 Common GRCh37/b37 reference surface
GRCh38 chrM 16569 Configured FASTA has chrM
T2T-CHM13v2.0 chrM 16569 Configured FASTA has chrM

The hg19 chrM and b37/rCRS MT sequences are not interchangeable. They differ by substitutions and indels across the sequence, so same numeric MT coordinates do not generally refer to the same base.

Current Behavior

After switching the configured GRCh37 FASTA to UCSC hg19.fa, genomatch currently supports:

  • GRCh37/hg19 MT: chrM, length 16571
  • GRCh38 MT: chrM, length 16569
  • T2T FASTA MT: chrM, length 16569

Known limitations:

  • b37/1000G/GATK-style MT length 16569 is not supported as a distinct GRCh37 reference surface.
  • T2T chain files currently omit chrM, so liftover involving T2T does not support MT.
  • guess_build.py only guesses GRCh37 vs GRCh38; T2T must be declared explicitly.
  • genome_build=GRCh37 alone is insufficient to distinguish hg19 MT from b37/rCRS MT.

Evidence

Local comparison of configured GRCh37 mitochondrial candidates:

  • ref/ucsc/GRCh37/hg19.fa: chrM, length 16571
  • ref/ucsc/GRCh37/hg19.p13.plusMT.no_alt_analysis_set.fa: chrMT, length 16569

Aligned comparison showed the sequences differ by multiple substitutions and indels, not only a terminal length difference. Direct same-coordinate comparison is therefore unsafe for MT.

T2T chain inspection showed configured T2T chain files include 1-22, X, Y, but omit chrM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions