Skip to content

XuegongLab/D2R_codes

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

D2R

A modern Python implementation of the D2R statistic for efficient detection of repetitive sequences in DNA.

Based on the paper:

A new statistic for efficient detection of repetitive sequences Sijie Chen, Yixin Chen, Fengzhu Sun, Michael S. Waterman, Xuegong Zhang Bioinformatics, 35(22), 2019, pp. 4596–4606 doi:10.1093/bioinformatics/btz262


Migration Notice

This repository has been restructured. The original Java, C++, and SWIG-based implementations have been moved to the legacy/ directory and replaced by pyd2r, a pure-Python package that reimplements all functionality with a modern, clean API.

What changed

Legacy pyd2r
Language Java (read filter), C++ (genome scanner), C++/SWIG (Python binding) Pure Python + NumPy
Platform Windows/Linux binaries, no macOS Any platform with Python >= 3.9
Install Manual compilation or pre-built binaries pip install .
CLI java -jar D2RReadFilter.jar ... d2r filter ... / d2r scan ...
API SWIG-generated Python wrapper Native Python package with type hints
Tests None 28 tests including real genome validation

Legacy code

The original implementations are preserved in legacy/ for reference:

  • legacy/D2RReadFilter/ — Java package for filtering sequencing reads by D2R score
  • legacy/D2RGenomeScanner/ — C++ sliding-window genome scanner
  • legacy/D2RLibrary/ — SWIG-based Python bindings (Win64/Linux64 only)
  • legacy/README.md — Original documentation

Theory

The problem

Repetitive regions are ubiquitous in genomes — tandem repeats, interspersed repeats, and clustered spaced repeats like CRISPR. Detecting them is fundamental to genome assembly, annotation, and functional genomics. Existing tools either require curated repeat databases (RepeatMasker) or are computationally expensive for large-scale screening of raw sequencing reads.

The D2 family of statistics

The D2 statistic measures similarity between two sequences by counting shared k-mer words. Given sequences A and B, the original D2 is defined as:

D2 = sum_w ( Xw * Yw )

where Xw and Yw are the counts of k-mer word w in A and B, respectively. More shared k-mers means higher D2, indicating greater similarity.

From similarity to repetitiveness: D2R

D2R extends this idea to a single sequence: instead of comparing two sequences, it measures self-similarity — how much a sequence matches itself. The key insight is that repetitive sequences must contain duplicated k-mers, so counting k-mer self-matches reveals repetitiveness regardless of repeat type (tandem, interspaced, or noisy).

For a sequence S of length n, the D2R statistic is:

D2R = [ sum_w { Xw(Xw - 1) - E[Xw(Xw - 1)] } ] / [ n_tilde * (n_tilde - 1) ]

where:

  • Xw is the count of k-mer w in S
  • Xw(Xw - 1) / 2 is the number of self-matches for word w (combinatorial pairs)
  • E[Xw(Xw - 1)] is the expected number under a null (i.i.d.) background model, approximated by [n_tilde * lambda(w)]^2 using a Poisson assumption
  • n_tilde = n - k + 1 is the number of k-mer positions
  • lambda(w) is the probability of word w under the i.i.d. model: the product of its constituent base probabilities

The normalization by n_tilde * (n_tilde - 1) bounds D2R between 0 and 1, making it comparable across sequences of different lengths.

Interpretation: D2R is high when a sequence contains more k-mer self-matches than expected by chance — i.e., when it has repetitive regions. D2R is near zero for random sequences.

Why it works for all repeat types

Unlike methods designed for specific repeat structures (e.g., TRF for tandem repeats, CRT for CRISPR), D2R is agnostic to repeat architecture. Whether repeats are:

  • Tandem (ABCABC...) — the same k-mers recur in adjacent copies
  • Interspaced (ABC...spacer...ABC...spacer...) — k-mers recur across spacers
  • Noisy/inexact — most k-mers within mutated copies still match

D2R detects them all, because the statistic only counts k-mer occurrences, not their spatial arrangement.

Choosing k-mer size

The k-mer size k controls sensitivity:

  • k too small (e.g., 3–4): short k-mers occur frequently by chance in any sequence, reducing discrimination power
  • k too large (e.g., > repeat unit length): k-mers spanning repeat boundaries are unique, losing the repeat signal
  • k slightly smaller than the repeat unit length: optimal, as it tolerates mutations in inexact repeats while maintaining specificity

In practice, k = 5–10 works well for most applications. The statistic is robust across a wide range of k values (see paper Figure 3d).

Exact vs. approximate expected value

The package provides two variants:

  • d2r() — uses the Poisson approximation E[Xw(Xw-1)] = [n_tilde * lambda(w)]^2 (Equation 7 in the paper). Fast and works well in practice.
  • d2r_exact_mean() — uses the exact expected value from Equation 5, which includes the overlapping coefficient delta(w) that accounts for a word's ability to overlap itself. Slightly more accurate but slower.

Experiments show both variants perform comparably (paper Supplementary Section 4.2).


Installation

# From source
pip install .

# Editable install for development
pip install -e ".[dev]"

Requires Python >= 3.9 and NumPy.


Quick Start

Compute D2R for a sequence

from pyd2r import d2r

# A repetitive sequence
score = d2r("ACGTACGT" * 10, k=7)
print(f"Repetitive: {score:.6f}")   # high value

# A random-ish sequence
score = d2r("ACGATCGTACGATCGTAGCTAGCTGA", k=7)
print(f"Random: {score:.6f}")       # near zero

Scan a genome for repetitive regions

from pyd2r import scan_genome, estimate_base_freq
from pyd2r.io import read_genome_fasta

header, genome = read_genome_fasta("genome.fa")
base_freq = estimate_base_freq(genome)

# Returns a numpy array of D2R values, one per window position
values = scan_genome(genome, window_size=1000, k=7, base_freq=base_freq)

Filter sequencing reads

from pyd2r import filter_reads, ReadRecord

reads = [
    ReadRecord(header=">read1", sequence="ACGTACGTACGTACGT" * 10),
    ReadRecord(header=">read2", sequence="ACGATCGTACGATCGTAGCTAGCTGA" * 5),
]

passed = filter_reads(reads, k=7, threshold=0.001)
# Only reads with D2R > threshold are returned

Command-line interface

# Scan a genome
d2r scan genome.fa -w 1000 -k 7 -o result.txt

# Filter reads
d2r filter reads.fa -k 7 -t 0.001 -o filtered.fa

# With custom base frequencies
d2r scan genome.fa --freq 0.258 0.242 0.244 0.256

Design

Architecture

src/pyd2r/
├── _types.py      Data models and constants
├── _kmer.py       K-mer counting and probability utilities
├── core.py        D2R statistic functions
├── scanner.py     Sliding-window genome scanner
├── filter.py      Read filtering with optional parallelism
├── io.py          FASTA I/O
└── cli.py         Command-line interface

Dependency graph (clean DAG, no circular dependencies):

_types.py       ReadRecord, BaseFreq, BASE_UNIFORM
    |
_kmer.py        word_prob, count_kmers, estimate_base_freq
    |
core.py         d2r, d2r_exact_mean, d2r_from_counts
    |
    +-- scanner.py    scan_genome (sliding window)
    +-- filter.py     filter_reads (parallel scoring)
    +-- io.py         read_fasta, write_fasta
            |
         cli.py       d2r filter / d2r scan

Design decisions

Separation of statistic from algorithm. core.py implements the pure D2R formula. scanner.py implements the sliding-window optimization (Algorithm 2 from the paper). This separation allows using the statistic in custom workflows without the scanning machinery.

d2r_from_counts() as the shared primitive. Both d2r() (for single sequences) and scan_genome() (for the first window) call the same underlying function that operates on pre-counted k-mers. This eliminates code duplication and ensures consistent results.

ReadRecord in _types.py, not filter.py. The data model is shared between I/O and filtering. Placing it in a dedicated types module prevents the I/O layer from depending on the filtering layer.

Case-insensitive throughout. Sequences are uppercased at input boundaries (count_kmers, scan_genome). Base frequency dicts accept both uppercase and lowercase keys. Internally, everything is uppercase.

Algorithms

Single-sequence D2R (Algorithm 1 from the paper):

  1. Count all k-mers in the sequence using a hash map — O(n)
  2. For each k-mer w with count Xw, compute Xw(Xw-1) - [n_tilde * lambda(w)]^2
  3. Sum and normalize by n_tilde * (n_tilde - 1)
  4. Total: O(n) time, O(n) space

Sliding-window scan (Algorithm 2 from the paper):

  1. Compute D2R for the first window using Algorithm 1
  2. For each subsequent position, only the head k-mer (removed) and tail k-mer (added) change
  3. Update D2R incrementally in O(1) per step using the update rules from the paper
  4. Total: O(G) time for a genome of length G

Read filtering:

  • Each read is scored independently — trivially parallelizable
  • Optional multiprocessing via ProcessPoolExecutor

Threshold selection

Three approaches (from Section 2.5 of the paper):

  1. Supervised — choose a threshold that minimizes classification error on labeled training data
  2. Unsupervised — if the D2R distribution is bimodal (background peak + repeat peak), set the threshold at the midpoint between peaks
  3. Q-value — simulate null sequences, compute empirical p-values, and control FDR

Validation

The package reproduces key results from the paper:

Experiment Paper pyd2r
Classification AUC (k=7, interspaced repeats) ~0.99 0.99
Classification AUC (k=4, same setting) ~0.84 0.87
A. fulgidus CRISPR regions found 3 3
CRISPR signal-to-noise ratio >>10x 45–49x
Genome scan time (2.2 Mbp) 563 ms (C++) ~2s (pure Python)

The sliding-window scanner correctly identifies all three CRISPR regions in the Archaeoglobus fulgidus genome (NC_000917) using the paper's unsupervised threshold method, with the bimodal D2R distribution matching Figure 5 of the paper.


API Reference

Core functions

Function Description
d2r(seq, k, base_freq=None) Compute D2R for a single sequence
d2r_exact_mean(seq, k, base_freq=None) D2R with exact (non-Poisson) expected value
d2r_from_counts(kmer_counts, n_tilde, base_freq=None) D2R from pre-counted k-mers

Genome scanning

Function Description
scan_genome(genome, window_size, k, base_freq=None) Sliding-window D2R scan, returns numpy array

Read filtering

Function Description
filter_reads(reads, k, threshold, base_freq=None, n_workers=1) Filter reads by D2R score

Utilities

Function Description
word_prob(word, base_freq) K-mer probability under i.i.d. model
count_kmers(seq, k) Count k-mer occurrences (returns Counter)
estimate_base_freq(seq) Estimate base frequencies from a sequence

I/O

Function Description
read_fasta(path) Iterator of ReadRecord from FASTA file
write_fasta(records, path) Write ReadRecords to FASTA
read_genome_fasta(path) Read single-sequence FASTA as (header, seq)

Citation

If you use this software, please cite:

@article{chen2019d2r,
  title={A new statistic for efficient detection of repetitive sequences},
  author={Chen, Sijie and Chen, Yixin and Sun, Fengzhu and Waterman, Michael S and Zhang, Xuegong},
  journal={Bioinformatics},
  volume={35},
  number={22},
  pages={4596--4606},
  year={2019},
  doi={10.1093/bioinformatics/btz262}
}

License

GPL-3.0-or-later

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 81.2%
  • Python 16.0%
  • Java 2.3%
  • Other 0.5%