MatchY estimates Y-STR haplotype match probabilities by Monte Carlo simulation with importance sampling over a user-supplied pedigree. This document describes the statistical model, simulation algorithm, convergence criterion, and Rust implementation.
Given:
- A Y-chromosome pedigree (directed acyclic graph of individuals related by paternal descent)
- Known Y-STR haplotypes for a subset of individuals
- A Y-STR mutation model (per-marker rates, step-size distribution)
Compute:
- Pedigree probability — P(all observed haplotypes | pedigree structure, mutation model)
- Inside-pedigree match probability — P(at least one unknown non-excluded pedigree member matches the person of interest's haplotype)
- Outside-pedigree match probability — P(a random individual just outside this pedigree matches the PoI haplotype)
- Pedigree odds and likelihood ratios derived from the above
Each generation, an allele mutates with probability μ (the per-marker mutation rate). When a mutation occurs:
- With probability
1 − f_2: the allele changes by ±1 repeat unit (one-step mutation). - With probability
f_2: the allele changes by ±2 repeat units (two-step mutation).
Direction (up or down) is equally likely for both step sizes.
The parameter f_2 is the two-step mutation fraction, defaulting to 0.03.
For markers with n copies (e.g. DYF387S1 with 2 copies), each copy mutates independently with the same per-copy mutation rate. The observed haplotype is the multi-set of allele values across all copies.
Two haplotypes match at a marker if their allele multi-sets are equal. A full haplotype match requires matching at every marker in the active set.
Naive Monte Carlo estimates P(match) by simulating many random haplotypes for unknown individuals and counting how often the haplotype configuration is consistent with the observations. For rare events (low mutation rates, many generations), the naive estimator has very high variance — billions of samples would be needed.
MatchY uses importance sampling to reduce variance. Instead of sampling haplotypes uniformly, the simulation biases the sampling distribution toward configurations that are more likely to produce the observed outcome, then corrects for this bias using importance weights.
The bias parameter b ∈ [0, 1) controls the strength of the bias:
b = 0→ no bias (uniform sampling, equivalent to naive Monte Carlo)- Higher
b→ stronger bias toward observed alleles, lower variance for rare events
The adaptive bias option (off by default) differentiates the bias parameter independently across the three ensemble models based on their relative performance in previous batches. It does not tune the global bias value; rather, it assigns different bias levels to the three models to increase ensemble diversity. Leave it off unless you have a specific reason to enable it.
For each simulated haplotype configuration, the importance weight corrects for the sampling bias:
w = P(configuration | true mutation model) / P(configuration | biased model)
The estimator for probability P is:
P̂ = (1/N) Σ [ I(match) × w ]
where I(match) = 1 if the simulated configuration produces a match and 0 otherwise.
To estimate convergence without a ground truth, MatchY runs three independent simulation models in parallel. Each model uses a different random seed and (slightly) different importance-sampling parameters.
The convergence criterion is met when the relative spread between the three models' current running means falls below the threshold ε:
spread = (max(P̂₀, P̂₁, P̂₂) − min(P̂₀, P̂₁, P̂₂)) / mean(P̂₀, P̂₁, P̂₂) ≤ ε
The default ε = 0.02 (2%). The final reported probability is the mean of the three model estimates.
This approach is robust to local stochastic fluctuations: a single model might temporarily appear stable while still drifting, but three independent models drifting in the same direction simultaneously is rare.
The pedigree probability P(pedigree) is computed by summing over all possible haplotype assignments to unknown individuals, weighting each by its probability under the mutation model.
For a pedigree with root r, the probability factorises over edges:
P(all haplotypes) = P(root alleles) × ∏_{(parent,child) edges} P(child | parent)
where P(child | parent) is the probability of the child's haplotype given the parent's haplotype under the mutation model.
In importance sampling, the simulation draws haplotypes for unknown individuals from the biased distribution and accumulates weighted probability estimates.
The inside-pedigree match probability asks: given the PoI's haplotype, what is the probability that at least one other unknown, non-excluded individual in the pedigree has the same haplotype?
This is computed as:
P(inside match) = 1 − P(no other individual matches)
= Σ_{k≥1} P(exactly k individuals match)
The simulation accumulates a count distribution {k → P(k matches)} and the reported probability is Σ_{k≥1} P(k).
Odds_pedigree = P(PoI is unique | observed) / P(at least one other matches | observed)
= (1 − P_inside) / P_inside
This is a true odds (a ratio of two complementary probabilities), not a likelihood ratio. It answers: how many times more probable is it that the PoI is the only matching individual versus at least one other pedigree member also matching?
The outside-pedigree match probability estimates P(random individual just outside this pedigree matches PoI), where "just outside" means: related to the pedigree but not typed or included.
The simulation extends the pedigree by adding one virtual individual as a child of each leaf node, then estimates the probability that this virtual individual matches the PoI haplotype. The result is averaged across all extended positions.
LR_outside = 1 / P_outside
This is a likelihood ratio: how many times more probable is the evidence if the PoI is the trace donor than if a random outside individual is the donor?
In trace mode, there is no prior PoI — instead, the simulation identifies which pedigree member is the most likely donor of a crime-scene profile (the TRACE profile).
For each unknown individual i, the simulation estimates P_i(TRACE match) — the probability that individual i matches the TRACE haplotype. The results are reported as:
- Absolute match probabilities for each individual
- Relative ratios scaled to the most likely donor (100%)
- The outside-pedigree probability as a baseline (likelihood that the donor is not in the pedigree)
| Parameter | Symbol | Default | Description |
|---|---|---|---|
batch_length |
N | 10,000 | Monte Carlo samples evaluated per batch. Each batch produces one data point on the convergence chart. |
convergence_criterion |
ε | 0.02 | Maximum relative spread between three model estimates. Simulation stops when this threshold is met. |
two_step_mutation_fraction |
f₂ | 0.03 | Fraction of mutations that are two-step (±2 repeat units). |
number_of_threads |
— | all CPUs | Rayon thread pool size. The three models run in parallel; within each model, per-batch parallelism uses the remaining threads. |
bias |
b | auto | Importance-sampling bias. Omit for automatic tuning. |
seed |
— | 0 | RNG seed. Default (0) gives fully reproducible results. Specify a different integer for a different reproducible sequence. |
A lower value of ε is stricter: the three models must agree more closely before the simulation stops, producing a more precise estimate at the cost of more batches. The default of 0.02 (2%) is appropriate for most use cases.
Larger batch lengths:
- Smoother convergence charts (less noise per data point)
- Fewer chart updates during a run
- More stable running means at each convergence check — all samples within a batch contribute to the estimate before the criterion is evaluated, so larger batches reduce the risk of premature convergence due to a temporarily favourable batch
The default of 10,000 is appropriate for most cases.
matchy/
├── crates/
│ ├── matchy-core/ # Core algorithm: simulation, mutation model, pedigree ops
│ ├── matchy-io/ # Config loading, pedigree/haplotype file formats, kits
│ ├── matchy-report/ # HTML report rendering (Minijinja templates)
│ ├── matchy-cli/ # CLI binary (clap argument parsing, batch runner)
│ └── matchy-app/ # Tauri desktop application + React/TypeScript frontend
└── Cargo.toml # Workspace manifest (single version source)
The core crate contains:
simulation/mod.rs— Top-levelrun_simulation()entry point. Creates a Rayon thread pool sized tonumber_of_threads, then delegates to the convergence loop.simulation/convergence.rs— Convergence loop: runs batches until all three models meet the criterion. EmitsProgressEventstructs over anmpscchannel for live chart updates.simulation/importance.rs— Per-batch importance sampling. Splitsbatch_lengthsamples across CPU threads (one RNG per thread to avoid per-iteration RNG initialisation overhead), accumulates weighted estimates.simulation/mutation.rs— Mutation model: generates child haplotypes from parent haplotypes under the single/two-step model.simulation/probability.rs— Pedigree probability accumulator.graph.rs— Petgraph-based DAG utilities: topological sort, BFS layers, ancestor/descendant queries, MRCA computation.models.rs— All public types:SimulationParameters,SimulationResult,MatchProbabilities,StageStats,ProgressEvent, etc.
config.rs— TOML and legacy INI config loader. ConvertsConfig→SimulationParameters.tgf.rs/ped.rs— Pedigree file parsers.haplotypes_json.rs— JSON haplotype loader; assigns haplotypes to pedigree nodes and returns the optional TRACE profile.kits.rs— Loads embedded kit definitions fromassets/kits.json.marker_csv.rs— Custom CSV marker set parser.mutation_rates.rs— Reads per-marker default mutation rates fromassets/mutation_rates.csv.
Uses Minijinja (a Rust Jinja2 port) to render HTML templates. Both render_report() and render_trace_report() accept a SimulationResult and optional JSON-serialised progress events, and return an HTML string. Templates are compiled into the binary via include_str!.
Thin binary that:
- Parses CLI arguments with
clap. - Detects single-file vs directory mode.
- Calls
run_config()for each config file. - Spawns a progress-collector thread to capture
ProgressEvents for report charts. - Writes the HTML report.
Uses mimalloc as the global allocator, which is critical for performance with the musl target — musl's default allocator serialises all allocations on a global mutex, causing severe slowdowns in parallel workloads.
Tauri v2 desktop application. The Rust backend exposes Tauri commands (defined in src/commands/) that the React frontend calls via invoke(). Key commands:
| Command | Description |
|---|---|
run_simulation |
Start a simulation; progress events streamed via Tauri events |
cancel_simulation |
Set the cancel flag |
generate_report |
Render and return HTML |
save_and_open_report |
Write HTML to disk and open in browser |
parse_tgf / parse_ped |
Parse uploaded pedigree files |
export_tgf |
Serialize current pedigree to TGF |
validate_pedigree |
Check DAG, connectivity |
build_extended_pedigree |
Extend pedigree for outside calculation |
parse_haplotypes_json / export_haplotypes_json |
Haplotype serialisation |
list_kits / load_kit / load_custom_csv |
Marker set management |
All parallelism uses Rayon. The thread pool is created explicitly with ThreadPoolBuilder so its size is controlled by the user parameter, not Rayon's global default.
Within each batch:
n_chunks = rayon::current_num_threads()
chunk_size = batch_length / n_chunks
Each chunk runs sequentially with a single RNG (seeded once per chunk). This avoids the cost of initialising a ChaCha20 RNG state for every iteration — an initialisation that was the root cause of a 10–16× performance regression when per-iteration RNGs were used.
The three convergence models themselves also run in parallel, so the effective thread utilisation is:
3 models × (batch_length / chunk_size) chunks each
The simulation is fully deterministic: the same seed with the same parameters on the same binary produces bit-identical results. The default seed is 0, so results are reproducible even without specifying a seed. Specify a different integer to obtain a different reproducible sequence.
| Factor | Impact | Recommendation |
|---|---|---|
| Thread count | Linear speedup up to ~16 threads for typical pedigrees | Use all available cores |
| Batch length | Affects stability of running means at each convergence check; larger reduces risk of premature convergence | 10,000 is appropriate for most cases |
| Convergence criterion | Quadratic: halving ε ≈ 4× more batches | Use 0.02 for casework |
| Allocator (Linux/musl) | musl's default allocator serialises parallel allocations | Use provided musl binary (mimalloc) |
| Pedigree size | Larger pedigrees = more unknowns to simulate = slower per batch | No workaround; unavoidable |
| Marker count | Linear: more markers = more per-marker work | Use targeted panels for speed |
| Format | Example | Meaning |
|---|---|---|
| Integer | 14 |
Standard STR repeat count |
| Decimal | 9.3 |
Intermediate allele (e.g. 9 repeats + partial) |
| Multiple values | ["35", "37"] |
Multi-copy marker (two distinct copies) |
Intermediate alleles are treated as distinct allele values — a mutation from 9.3 goes to 8.3 or 10.3 (one-step), not to 9 or 10.
| Field | Description |
|---|---|
iterations_per_model |
Total MC samples used by each of the three models |
model_probabilities |
Final probability estimate from each model |
total_iterations() |
Sum across all models |
formatted_runtime() |
Wall-clock time for this stage (s / m s / h m / d h) |
| Field | Description |
|---|---|
inside_match_probabilities |
MatchProbabilities struct with count distribution and average pedigree probability |
outside_match_probability |
Scalar Decimal |
per_individual_probabilities |
Map of individual ID → match probability (trace mode only) |
trials |
Number of batches until convergence |
converged |
Whether the convergence criterion was met |
total_runtime_secs |
Total wall-clock time in seconds |