Problem
Current genomatch metadata uses genome_build=GRCh37, but mitochondrial reference handling is not unambiguous under that label.
For nuclear primary contigs (1-22, X, Y), treating UCSC hg19 as GRCh37 is acceptable for current workflows. For mitochondrial sequence, it is not enough:
| Reference surface |
Naming |
MT length |
Notes |
| UCSC hg19 |
chrM |
16571 |
Matches UCSC hg19 chain files |
| 1000G/GATK b37 / rCRS-style |
MT |
16569 |
Common GRCh37/b37 reference surface |
| GRCh38 |
chrM |
16569 |
Configured FASTA has chrM |
| T2T-CHM13v2.0 |
chrM |
16569 |
Configured FASTA has chrM |
The hg19 chrM and b37/rCRS MT sequences are not interchangeable. They differ by substitutions and indels across the sequence, so same numeric MT coordinates do not generally refer to the same base.
Current Behavior
After switching the configured GRCh37 FASTA to UCSC hg19.fa, genomatch currently supports:
- GRCh37/hg19 MT:
chrM, length 16571
- GRCh38 MT:
chrM, length 16569
- T2T FASTA MT:
chrM, length 16569
Known limitations:
- b37/1000G/GATK-style
MT length 16569 is not supported as a distinct GRCh37 reference surface.
- T2T chain files currently omit
chrM, so liftover involving T2T does not support MT.
guess_build.py only guesses GRCh37 vs GRCh38; T2T must be declared explicitly.
genome_build=GRCh37 alone is insufficient to distinguish hg19 MT from b37/rCRS MT.
Evidence
Local comparison of configured GRCh37 mitochondrial candidates:
ref/ucsc/GRCh37/hg19.fa: chrM, length 16571
ref/ucsc/GRCh37/hg19.p13.plusMT.no_alt_analysis_set.fa: chrMT, length 16569
Aligned comparison showed the sequences differ by multiple substitutions and indels, not only a terminal length difference. Direct same-coordinate comparison is therefore unsafe for MT.
T2T chain inspection showed configured T2T chain files include 1-22, X, Y, but omit chrM.
Problem
Current genomatch metadata uses
genome_build=GRCh37, but mitochondrial reference handling is not unambiguous under that label.For nuclear primary contigs (
1-22,X,Y), treating UCSC hg19 as GRCh37 is acceptable for current workflows. For mitochondrial sequence, it is not enough:chrMMTchrMchrMchrMchrMThe hg19
chrMand b37/rCRSMTsequences are not interchangeable. They differ by substitutions and indels across the sequence, so same numeric MT coordinates do not generally refer to the same base.Current Behavior
After switching the configured GRCh37 FASTA to UCSC
hg19.fa, genomatch currently supports:chrM, length 16571chrM, length 16569chrM, length 16569Known limitations:
MTlength 16569 is not supported as a distinct GRCh37 reference surface.chrM, so liftover involving T2T does not support MT.guess_build.pyonly guessesGRCh37vsGRCh38; T2T must be declared explicitly.genome_build=GRCh37alone is insufficient to distinguish hg19 MT from b37/rCRS MT.Evidence
Local comparison of configured GRCh37 mitochondrial candidates:
ref/ucsc/GRCh37/hg19.fa:chrM, length 16571ref/ucsc/GRCh37/hg19.p13.plusMT.no_alt_analysis_set.fa:chrMT, length 16569Aligned comparison showed the sequences differ by multiple substitutions and indels, not only a terminal length difference. Direct same-coordinate comparison is therefore unsafe for MT.
T2T chain inspection showed configured T2T chain files include
1-22,X,Y, but omitchrM.