D2R

A modern Python implementation of the D2R statistic for efficient detection of repetitive sequences in DNA.

Based on the paper:

A new statistic for efficient detection of repetitive sequences Sijie Chen, Yixin Chen, Fengzhu Sun, Michael S. Waterman, Xuegong Zhang Bioinformatics, 35(22), 2019, pp. 4596–4606 doi:10.1093/bioinformatics/btz262

Migration Notice

This repository has been restructured. The original Java, C++, and SWIG-based implementations have been moved to the legacy/ directory and replaced by pyd2r, a pure-Python package that reimplements all functionality with a modern, clean API.

What changed

	Legacy	pyd2r
Language	Java (read filter), C++ (genome scanner), C++/SWIG (Python binding)	Pure Python + NumPy
Platform	Windows/Linux binaries, no macOS	Any platform with Python >= 3.9
Install	Manual compilation or pre-built binaries	`pip install .`
CLI	`java -jar D2RReadFilter.jar ...`	`d2r filter ...` / `d2r scan ...`
API	SWIG-generated Python wrapper	Native Python package with type hints
Tests	None	28 tests including real genome validation

Legacy code

The original implementations are preserved in legacy/ for reference:

legacy/D2RReadFilter/ — Java package for filtering sequencing reads by D2R score
legacy/D2RGenomeScanner/ — C++ sliding-window genome scanner
legacy/D2RLibrary/ — SWIG-based Python bindings (Win64/Linux64 only)
legacy/README.md — Original documentation

Theory

The problem

Repetitive regions are ubiquitous in genomes — tandem repeats, interspersed repeats, and clustered spaced repeats like CRISPR. Detecting them is fundamental to genome assembly, annotation, and functional genomics. Existing tools either require curated repeat databases (RepeatMasker) or are computationally expensive for large-scale screening of raw sequencing reads.

The D2 family of statistics

The D2 statistic measures similarity between two sequences by counting shared k-mer words. Given sequences A and B, the original D2 is defined as:

D2 = sum_w ( Xw * Yw )

where Xw and Yw are the counts of k-mer word w in A and B, respectively. More shared k-mers means higher D2, indicating greater similarity.

From similarity to repetitiveness: D2R

D2R extends this idea to a single sequence: instead of comparing two sequences, it measures self-similarity — how much a sequence matches itself. The key insight is that repetitive sequences must contain duplicated k-mers, so counting k-mer self-matches reveals repetitiveness regardless of repeat type (tandem, interspaced, or noisy).

For a sequence S of length n, the D2R statistic is:

D2R = [ sum_w { Xw(Xw - 1) - E[Xw(Xw - 1)] } ] / [ n_tilde * (n_tilde - 1) ]

where:

Xw is the count of k-mer w in S
Xw(Xw - 1) / 2 is the number of self-matches for word w (combinatorial pairs)
E[Xw(Xw - 1)] is the expected number under a null (i.i.d.) background model, approximated by [n_tilde * lambda(w)]^2 using a Poisson assumption
n_tilde = n - k + 1 is the number of k-mer positions
lambda(w) is the probability of word w under the i.i.d. model: the product of its constituent base probabilities

The normalization by n_tilde * (n_tilde - 1) bounds D2R between 0 and 1, making it comparable across sequences of different lengths.

Interpretation: D2R is high when a sequence contains more k-mer self-matches than expected by chance — i.e., when it has repetitive regions. D2R is near zero for random sequences.

Why it works for all repeat types

Unlike methods designed for specific repeat structures (e.g., TRF for tandem repeats, CRT for CRISPR), D2R is agnostic to repeat architecture. Whether repeats are:

Tandem (ABCABC...) — the same k-mers recur in adjacent copies
Interspaced (ABC...spacer...ABC...spacer...) — k-mers recur across spacers
Noisy/inexact — most k-mers within mutated copies still match

D2R detects them all, because the statistic only counts k-mer occurrences, not their spatial arrangement.

Choosing k-mer size

The k-mer size k controls sensitivity:

k too small (e.g., 3–4): short k-mers occur frequently by chance in any sequence, reducing discrimination power
k too large (e.g., > repeat unit length): k-mers spanning repeat boundaries are unique, losing the repeat signal
k slightly smaller than the repeat unit length: optimal, as it tolerates mutations in inexact repeats while maintaining specificity

In practice, k = 5–10 works well for most applications. The statistic is robust across a wide range of k values (see paper Figure 3d).

Exact vs. approximate expected value

The package provides two variants:

d2r() — uses the Poisson approximation E[Xw(Xw-1)] = [n_tilde * lambda(w)]^2 (Equation 7 in the paper). Fast and works well in practice.
d2r_exact_mean() — uses the exact expected value from Equation 5, which includes the overlapping coefficient delta(w) that accounts for a word's ability to overlap itself. Slightly more accurate but slower.

Experiments show both variants perform comparably (paper Supplementary Section 4.2).

Installation

# From source
pip install .

# Editable install for development
pip install -e ".[dev]"

Requires Python >= 3.9 and NumPy.

Quick Start

Compute D2R for a sequence

from pyd2r import d2r

# A repetitive sequence
score = d2r("ACGTACGT" * 10, k=7)
print(f"Repetitive: {score:.6f}")   # high value

# A random-ish sequence
score = d2r("ACGATCGTACGATCGTAGCTAGCTGA", k=7)
print(f"Random: {score:.6f}")       # near zero

Scan a genome for repetitive regions

from pyd2r import scan_genome, estimate_base_freq
from pyd2r.io import read_genome_fasta

header, genome = read_genome_fasta("genome.fa")
base_freq = estimate_base_freq(genome)

# Returns a numpy array of D2R values, one per window position
values = scan_genome(genome, window_size=1000, k=7, base_freq=base_freq)

Filter sequencing reads

from pyd2r import filter_reads, ReadRecord

reads = [
    ReadRecord(header=">read1", sequence="ACGTACGTACGTACGT" * 10),
    ReadRecord(header=">read2", sequence="ACGATCGTACGATCGTAGCTAGCTGA" * 5),
]

passed = filter_reads(reads, k=7, threshold=0.001)
# Only reads with D2R > threshold are returned

Command-line interface

# Scan a genome
d2r scan genome.fa -w 1000 -k 7 -o result.txt

# Filter reads
d2r filter reads.fa -k 7 -t 0.001 -o filtered.fa

# With custom base frequencies
d2r scan genome.fa --freq 0.258 0.242 0.244 0.256

Design

Architecture

src/pyd2r/
├── _types.py      Data models and constants
├── _kmer.py       K-mer counting and probability utilities
├── core.py        D2R statistic functions
├── scanner.py     Sliding-window genome scanner
├── filter.py      Read filtering with optional parallelism
├── io.py          FASTA I/O
└── cli.py         Command-line interface

Dependency graph (clean DAG, no circular dependencies):

_types.py       ReadRecord, BaseFreq, BASE_UNIFORM
    |
_kmer.py        word_prob, count_kmers, estimate_base_freq
    |
core.py         d2r, d2r_exact_mean, d2r_from_counts
    |
    +-- scanner.py    scan_genome (sliding window)
    +-- filter.py     filter_reads (parallel scoring)
    +-- io.py         read_fasta, write_fasta
            |
         cli.py       d2r filter / d2r scan

Design decisions

Separation of statistic from algorithm. core.py implements the pure D2R formula. scanner.py implements the sliding-window optimization (Algorithm 2 from the paper). This separation allows using the statistic in custom workflows without the scanning machinery.

d2r_from_counts() as the shared primitive. Both d2r() (for single sequences) and scan_genome() (for the first window) call the same underlying function that operates on pre-counted k-mers. This eliminates code duplication and ensures consistent results.

ReadRecord in _types.py, not filter.py. The data model is shared between I/O and filtering. Placing it in a dedicated types module prevents the I/O layer from depending on the filtering layer.

Case-insensitive throughout. Sequences are uppercased at input boundaries (count_kmers, scan_genome). Base frequency dicts accept both uppercase and lowercase keys. Internally, everything is uppercase.

Algorithms

Single-sequence D2R (Algorithm 1 from the paper):

Count all k-mers in the sequence using a hash map — O(n)
For each k-mer w with count Xw, compute Xw(Xw-1) - [n_tilde * lambda(w)]^2
Sum and normalize by n_tilde * (n_tilde - 1)
Total: O(n) time, O(n) space

Sliding-window scan (Algorithm 2 from the paper):

Compute D2R for the first window using Algorithm 1
For each subsequent position, only the head k-mer (removed) and tail k-mer (added) change
Update D2R incrementally in O(1) per step using the update rules from the paper
Total: O(G) time for a genome of length G

Read filtering:

Each read is scored independently — trivially parallelizable
Optional multiprocessing via ProcessPoolExecutor

Threshold selection

Three approaches (from Section 2.5 of the paper):

Supervised — choose a threshold that minimizes classification error on labeled training data
Unsupervised — if the D2R distribution is bimodal (background peak + repeat peak), set the threshold at the midpoint between peaks
Q-value — simulate null sequences, compute empirical p-values, and control FDR

Validation

The package reproduces key results from the paper:

Experiment	Paper	pyd2r
Classification AUC (k=7, interspaced repeats)	~0.99	0.99
Classification AUC (k=4, same setting)	~0.84	0.87
A. fulgidus CRISPR regions found	3	3
CRISPR signal-to-noise ratio	>>10x	45–49x
Genome scan time (2.2 Mbp)	563 ms (C++)	~2s (pure Python)

The sliding-window scanner correctly identifies all three CRISPR regions in the Archaeoglobus fulgidus genome (NC_000917) using the paper's unsupervised threshold method, with the bimodal D2R distribution matching Figure 5 of the paper.

API Reference

Core functions

Function	Description
`d2r(seq, k, base_freq=None)`	Compute D2R for a single sequence
`d2r_exact_mean(seq, k, base_freq=None)`	D2R with exact (non-Poisson) expected value
`d2r_from_counts(kmer_counts, n_tilde, base_freq=None)`	D2R from pre-counted k-mers

Genome scanning

Function	Description
`scan_genome(genome, window_size, k, base_freq=None)`	Sliding-window D2R scan, returns numpy array

Read filtering

Function	Description
`filter_reads(reads, k, threshold, base_freq=None, n_workers=1)`	Filter reads by D2R score

Utilities

Function	Description
`word_prob(word, base_freq)`	K-mer probability under i.i.d. model
`count_kmers(seq, k)`	Count k-mer occurrences (returns Counter)
`estimate_base_freq(seq)`	Estimate base frequencies from a sequence

I/O

Function	Description
`read_fasta(path)`	Iterator of ReadRecord from FASTA file
`write_fasta(records, path)`	Write ReadRecords to FASTA
`read_genome_fasta(path)`	Read single-sequence FASTA as (header, seq)

Citation

If you use this software, please cite:

@article{chen2019d2r,
  title={A new statistic for efficient detection of repetitive sequences},
  author={Chen, Sijie and Chen, Yixin and Sun, Fengzhu and Waterman, Michael S and Zhang, Xuegong},
  journal={Bioinformatics},
  volume={35},
  number={22},
  pages={4596--4606},
  year={2019},
  doi={10.1093/bioinformatics/btz262}
}

License

GPL-3.0-or-later

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
legacy		legacy
src/pyd2r		src/pyd2r
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

D2R

Migration Notice

What changed

Legacy code

Theory

The problem

The D2 family of statistics

From similarity to repetitiveness: D2R

Why it works for all repeat types

Choosing k-mer size

Exact vs. approximate expected value

Installation

Quick Start

Compute D2R for a sequence

Scan a genome for repetitive regions

Filter sequencing reads

Command-line interface

Design

Architecture

Design decisions

Algorithms

Threshold selection

Validation

API Reference

Core functions

Genome scanning

Read filtering

Utilities

I/O

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages