Skip to content

uwaiseibna/PBML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PBML

Scaling the PBWT for Long-Range Shared Ancestry Detection in Large Haplotype Panels

PBML finds all set maximal exact matches (SMEMs) in a query haplotype against a reference panel, subject to user-specified constraints on minimum match length (L) and minimum panel occurrences (k). It is the first PBWT-based tool to support kL-SMEM queries — identifying shared haplotype segments of length ≥ L that occur in at least k panel haplotypes and cannot be extended without losing matches. This is particularly useful for identity-by-descent (IBD) segment analysis in large cohorts.

PBML adapts the BML (Boyer-Moore-Li) algorithm, originally proposed by T. Gagie for BWT-based text indexing, to the Positional Burrows-Wheeler Transform (PBWT). It operates on run-length encoded forward and reverse PBWTs, using LCP/LCS queries with Boyer-Moore skip logic to enumerate SMEMs efficiently. See the paper for details.

Key properties

  • kL-SMEMs — SMEMs of length ≥ L with at least k occurrences. Unique to PBML; competing tools are limited to L=1.
  • 4.6× faster than μ-PBWT, 2.4× faster than Durbin's PBWT, using 1.3–25.6× less memory (1KGP, single-threaded).
  • O(r) index size — Run-length compressed PBWTs scale with panel compressibility, not raw size.
  • Parallel queries — 8.2× speedup at 16 threads with near-constant memory overhead.

Installation

Dependencies

Building from source

git clone https://github.com/uwaiseibna/PBML.git
cd PBML && mkdir build && cd build
cmake .. && make -j

Input files must be .bcf or .vcf.

Quick start

./pbml run -p panel.bcf -q query.bcf -L 5 -k 1 -o smems.tsv

Usage

./pbml <mode> [options]

Modes

Mode Description
index Build and save index from panel
query Query from a saved index
run Build and query in one step (no index saved)

Options

Flag Description Default
-p, --panel Panel BCF/VCF file (index/run)
-q, --query Query BCF/VCF file (query/run)
-i, --index Index file path
-o, --output Output file smems.tsv
-L, --length Minimum SMEM length 1
-k, --min-occ Minimum occurrences in panel 1
-t, --threads Number of threads all available

Examples

# Build index once, query many times
./pbml index  -p panel.bcf -i panel.pbml
./pbml query  -i panel.pbml -q query.bcf

# One-shot build + query
./pbml run -p panel.bcf -q query.bcf

# Multi-threaded
OMP_NUM_THREADS=8 ./pbml run -p panel.bcf -q query.bcf -L 5

Output format

Tab-separated, one line per (query, panel haplotype, SMEM) triple:

query_id    panel_haplotype    start_site    end_site    length

Benchmarks

Median across chromosomes 1–22 on 1000 Genomes Project Phase 3 (4,008 panel haplotypes, 1,000 queries, k=1, L=1, single-threaded). Build and query times normalized per million variant sites.

Method Build (s/M sites) Query (s/M sites) Peak Memory (GB)
PBML 45.19 (43.86–46.12) 49.08 (44.86–67.42) 6.4 (2.4–12.4)
μ-PBWT 63.76 (62.72–65.88) 233.90 (200.48–276.75) 9.1 (2.8–17.4)
Dynamic μ-PBWT 212.45 (203.46–351.44) 582.28 (485.38–756.82) 24.2 (9.4–46.0)
PBWT 20.20 (14.95–20.60) 121.75 (113.36–135.03) 176.2 (52.3–336.2)

Values are median (min–max). All methods report identical SMEM sets. With 16 threads, PBML achieves an 8.2× query speedup with < 1% memory increase, while μ-PBWT sees 15.9% memory growth for a 2.4× speedup.

Citation

@article{islam2025scalable,
  title   = {Scalable PBWT Queries with Minimum-Length SMEM Constraints},
  author  = {Islam, Uwaise Ibna and Cozzi, Davide and Gagie, Travis and
             Varki, Rahul and Colonna, Vincenza and Garrison, Erik and
             Bonizzoni, Paola and Boucher, Christina},
  journal = {bioRxiv},
  pages   = {2025--12},
  year    = {2025},
  publisher = {Cold Spring Harbor Laboratory}
}

About

Scaling the PBWT for Long-Range Shared Ancestry Detection in Large Haplotype Panels.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages