A modern Python implementation of the D2R statistic for efficient detection of repetitive sequences in DNA.
Based on the paper:
A new statistic for efficient detection of repetitive sequences Sijie Chen, Yixin Chen, Fengzhu Sun, Michael S. Waterman, Xuegong Zhang Bioinformatics, 35(22), 2019, pp. 4596–4606 doi:10.1093/bioinformatics/btz262
This repository has been restructured. The original Java, C++, and SWIG-based implementations have been moved to the legacy/ directory and replaced by pyd2r, a pure-Python package that reimplements all functionality with a modern, clean API.
| Legacy | pyd2r | |
|---|---|---|
| Language | Java (read filter), C++ (genome scanner), C++/SWIG (Python binding) | Pure Python + NumPy |
| Platform | Windows/Linux binaries, no macOS | Any platform with Python >= 3.9 |
| Install | Manual compilation or pre-built binaries | pip install . |
| CLI | java -jar D2RReadFilter.jar ... |
d2r filter ... / d2r scan ... |
| API | SWIG-generated Python wrapper | Native Python package with type hints |
| Tests | None | 28 tests including real genome validation |
The original implementations are preserved in legacy/ for reference:
legacy/D2RReadFilter/— Java package for filtering sequencing reads by D2R scorelegacy/D2RGenomeScanner/— C++ sliding-window genome scannerlegacy/D2RLibrary/— SWIG-based Python bindings (Win64/Linux64 only)legacy/README.md— Original documentation
Repetitive regions are ubiquitous in genomes — tandem repeats, interspersed repeats, and clustered spaced repeats like CRISPR. Detecting them is fundamental to genome assembly, annotation, and functional genomics. Existing tools either require curated repeat databases (RepeatMasker) or are computationally expensive for large-scale screening of raw sequencing reads.
The D2 statistic measures similarity between two sequences by counting shared k-mer words. Given sequences A and B, the original D2 is defined as:
D2 = sum_w ( Xw * Yw )
where Xw and Yw are the counts of k-mer word w in A and B, respectively. More shared k-mers means higher D2, indicating greater similarity.
D2R extends this idea to a single sequence: instead of comparing two sequences, it measures self-similarity — how much a sequence matches itself. The key insight is that repetitive sequences must contain duplicated k-mers, so counting k-mer self-matches reveals repetitiveness regardless of repeat type (tandem, interspaced, or noisy).
For a sequence S of length n, the D2R statistic is:
D2R = [ sum_w { Xw(Xw - 1) - E[Xw(Xw - 1)] } ] / [ n_tilde * (n_tilde - 1) ]
where:
- Xw is the count of k-mer w in S
- Xw(Xw - 1) / 2 is the number of self-matches for word w (combinatorial pairs)
- E[Xw(Xw - 1)] is the expected number under a null (i.i.d.) background model, approximated by [n_tilde * lambda(w)]^2 using a Poisson assumption
- n_tilde = n - k + 1 is the number of k-mer positions
- lambda(w) is the probability of word w under the i.i.d. model: the product of its constituent base probabilities
The normalization by n_tilde * (n_tilde - 1) bounds D2R between 0 and 1, making it comparable across sequences of different lengths.
Interpretation: D2R is high when a sequence contains more k-mer self-matches than expected by chance — i.e., when it has repetitive regions. D2R is near zero for random sequences.
Unlike methods designed for specific repeat structures (e.g., TRF for tandem repeats, CRT for CRISPR), D2R is agnostic to repeat architecture. Whether repeats are:
- Tandem (ABCABC...) — the same k-mers recur in adjacent copies
- Interspaced (ABC...spacer...ABC...spacer...) — k-mers recur across spacers
- Noisy/inexact — most k-mers within mutated copies still match
D2R detects them all, because the statistic only counts k-mer occurrences, not their spatial arrangement.
The k-mer size k controls sensitivity:
- k too small (e.g., 3–4): short k-mers occur frequently by chance in any sequence, reducing discrimination power
- k too large (e.g., > repeat unit length): k-mers spanning repeat boundaries are unique, losing the repeat signal
- k slightly smaller than the repeat unit length: optimal, as it tolerates mutations in inexact repeats while maintaining specificity
In practice, k = 5–10 works well for most applications. The statistic is robust across a wide range of k values (see paper Figure 3d).
The package provides two variants:
d2r()— uses the Poisson approximation E[Xw(Xw-1)] = [n_tilde * lambda(w)]^2 (Equation 7 in the paper). Fast and works well in practice.d2r_exact_mean()— uses the exact expected value from Equation 5, which includes the overlapping coefficient delta(w) that accounts for a word's ability to overlap itself. Slightly more accurate but slower.
Experiments show both variants perform comparably (paper Supplementary Section 4.2).
# From source
pip install .
# Editable install for development
pip install -e ".[dev]"Requires Python >= 3.9 and NumPy.
from pyd2r import d2r
# A repetitive sequence
score = d2r("ACGTACGT" * 10, k=7)
print(f"Repetitive: {score:.6f}") # high value
# A random-ish sequence
score = d2r("ACGATCGTACGATCGTAGCTAGCTGA", k=7)
print(f"Random: {score:.6f}") # near zerofrom pyd2r import scan_genome, estimate_base_freq
from pyd2r.io import read_genome_fasta
header, genome = read_genome_fasta("genome.fa")
base_freq = estimate_base_freq(genome)
# Returns a numpy array of D2R values, one per window position
values = scan_genome(genome, window_size=1000, k=7, base_freq=base_freq)from pyd2r import filter_reads, ReadRecord
reads = [
ReadRecord(header=">read1", sequence="ACGTACGTACGTACGT" * 10),
ReadRecord(header=">read2", sequence="ACGATCGTACGATCGTAGCTAGCTGA" * 5),
]
passed = filter_reads(reads, k=7, threshold=0.001)
# Only reads with D2R > threshold are returned# Scan a genome
d2r scan genome.fa -w 1000 -k 7 -o result.txt
# Filter reads
d2r filter reads.fa -k 7 -t 0.001 -o filtered.fa
# With custom base frequencies
d2r scan genome.fa --freq 0.258 0.242 0.244 0.256src/pyd2r/
├── _types.py Data models and constants
├── _kmer.py K-mer counting and probability utilities
├── core.py D2R statistic functions
├── scanner.py Sliding-window genome scanner
├── filter.py Read filtering with optional parallelism
├── io.py FASTA I/O
└── cli.py Command-line interface
Dependency graph (clean DAG, no circular dependencies):
_types.py ReadRecord, BaseFreq, BASE_UNIFORM
|
_kmer.py word_prob, count_kmers, estimate_base_freq
|
core.py d2r, d2r_exact_mean, d2r_from_counts
|
+-- scanner.py scan_genome (sliding window)
+-- filter.py filter_reads (parallel scoring)
+-- io.py read_fasta, write_fasta
|
cli.py d2r filter / d2r scan
Separation of statistic from algorithm. core.py implements the pure D2R formula. scanner.py implements the sliding-window optimization (Algorithm 2 from the paper). This separation allows using the statistic in custom workflows without the scanning machinery.
d2r_from_counts() as the shared primitive. Both d2r() (for single sequences) and scan_genome() (for the first window) call the same underlying function that operates on pre-counted k-mers. This eliminates code duplication and ensures consistent results.
ReadRecord in _types.py, not filter.py. The data model is shared between I/O and filtering. Placing it in a dedicated types module prevents the I/O layer from depending on the filtering layer.
Case-insensitive throughout. Sequences are uppercased at input boundaries (count_kmers, scan_genome). Base frequency dicts accept both uppercase and lowercase keys. Internally, everything is uppercase.
Single-sequence D2R (Algorithm 1 from the paper):
- Count all k-mers in the sequence using a hash map — O(n)
- For each k-mer w with count Xw, compute Xw(Xw-1) - [n_tilde * lambda(w)]^2
- Sum and normalize by n_tilde * (n_tilde - 1)
- Total: O(n) time, O(n) space
Sliding-window scan (Algorithm 2 from the paper):
- Compute D2R for the first window using Algorithm 1
- For each subsequent position, only the head k-mer (removed) and tail k-mer (added) change
- Update D2R incrementally in O(1) per step using the update rules from the paper
- Total: O(G) time for a genome of length G
Read filtering:
- Each read is scored independently — trivially parallelizable
- Optional multiprocessing via
ProcessPoolExecutor
Three approaches (from Section 2.5 of the paper):
- Supervised — choose a threshold that minimizes classification error on labeled training data
- Unsupervised — if the D2R distribution is bimodal (background peak + repeat peak), set the threshold at the midpoint between peaks
- Q-value — simulate null sequences, compute empirical p-values, and control FDR
The package reproduces key results from the paper:
| Experiment | Paper | pyd2r |
|---|---|---|
| Classification AUC (k=7, interspaced repeats) | ~0.99 | 0.99 |
| Classification AUC (k=4, same setting) | ~0.84 | 0.87 |
| A. fulgidus CRISPR regions found | 3 | 3 |
| CRISPR signal-to-noise ratio | >>10x | 45–49x |
| Genome scan time (2.2 Mbp) | 563 ms (C++) | ~2s (pure Python) |
The sliding-window scanner correctly identifies all three CRISPR regions in the Archaeoglobus fulgidus genome (NC_000917) using the paper's unsupervised threshold method, with the bimodal D2R distribution matching Figure 5 of the paper.
| Function | Description |
|---|---|
d2r(seq, k, base_freq=None) |
Compute D2R for a single sequence |
d2r_exact_mean(seq, k, base_freq=None) |
D2R with exact (non-Poisson) expected value |
d2r_from_counts(kmer_counts, n_tilde, base_freq=None) |
D2R from pre-counted k-mers |
| Function | Description |
|---|---|
scan_genome(genome, window_size, k, base_freq=None) |
Sliding-window D2R scan, returns numpy array |
| Function | Description |
|---|---|
filter_reads(reads, k, threshold, base_freq=None, n_workers=1) |
Filter reads by D2R score |
| Function | Description |
|---|---|
word_prob(word, base_freq) |
K-mer probability under i.i.d. model |
count_kmers(seq, k) |
Count k-mer occurrences (returns Counter) |
estimate_base_freq(seq) |
Estimate base frequencies from a sequence |
| Function | Description |
|---|---|
read_fasta(path) |
Iterator of ReadRecord from FASTA file |
write_fasta(records, path) |
Write ReadRecords to FASTA |
read_genome_fasta(path) |
Read single-sequence FASTA as (header, seq) |
If you use this software, please cite:
@article{chen2019d2r,
title={A new statistic for efficient detection of repetitive sequences},
author={Chen, Sijie and Chen, Yixin and Sun, Fengzhu and Waterman, Michael S and Zhang, Xuegong},
journal={Bioinformatics},
volume={35},
number={22},
pages={4596--4606},
year={2019},
doi={10.1093/bioinformatics/btz262}
}GPL-3.0-or-later