ChromaFormer

Predicting Chromatin Accessibility from DNA Sequence with a From-Scratch Transformer

ChromaFormer is a lightweight transformer model built from scratch in NumPy that learns to predict whether a DNA sequence region is in "open" (accessible) chromatin state — a key indicator of active gene regulation. It is trained on synthetic ENCODE-style ATAC-seq data that replicates known motif distributions

Why This Problem Matters

Chromatin accessibility determines which genomic regions are available for transcription factor binding and gene activation. ATAC-seq (Assay for Transposase-Accessible Chromatin) is the standard experimental method for measuring this. Training a model to predict accessibility from sequence is valuable for:

Annotating genomes of organisms without ATAC-seq data
Predicting the regulatory impact of genetic variants (eQTLs, GWAS hits)
Discovering novel regulatory elements

Model Architecture

DNA Sequence (200 bp)
      │
      ▼
 K-mer Tokenizer (k=6, stride=1)
      │  ~195 tokens per sequence
      ▼
 Learned Embedding (dim=64)
      │
      ▼
 Rotary Positional Embedding (RoPE)
      │
      ▼
 ┌─────────────────────────────┐
 │  Transformer Encoder Block  │  × 2 layers
 │  ┌───────────────────────┐  │
 │  │  Multi-Head Attention │  │  4 heads
 │  │  (with RoPE keys/     │  │
 │  │   queries)            │  │
 │  └───────────────────────┘  │
 │  ┌───────────────────────┐  │
 │  │  Feed-Forward Network │  │
 │  │  (ReLU, dim=128)      │  │
 │  └───────────────────────┘  │
 └─────────────────────────────┘
      │
      ▼
 [CLS] token pooling
      │
      ▼
 Linear → Sigmoid → P(open chromatin)

design choices:

K-mer tokenization treats overlapping 6-mers as vocabulary tokens (4^6 = 4096 possible tokens), analogous to subword tokenization in NLP but grounded in known biological sequence motifs
Rotary Positional Embeddings (RoPE) encodes relative positions in attention, more robust than learned absolute positional embeddings for variable-length sequences
Focal Loss addresses severe class imbalance (~15% open chromatin in real ATAC-seq data) by down-weighting easy negatives
From-scratch NumPy implementation no deep learning framework; every forward pass, backward pass, and optimizer step is explicit

Data

The project ships with a synthetic data generator (data/generate_data.py) that mimics real ENCODE ATAC-seq patterns:

Open chromatin regions are seeded with real regulatory motifs: CTCF (CCGCGNGGNGGCAG), AP-1 (TGASTCA), SP1 (GGGCGG), and TATA-box (TATAAA)
Closed regions are random genomic background sequences
Class ratio mirrors real ATAC-seq (≈15% open)

To use real ENCODE data instead, see scripts/download_encode.sh.

Project Structure

chromaformer/
├── README.md
├── requirements.txt
├── data/
│   └── generate_data.py       # Synthetic ATAC-seq data generator
├── models/
│   ├── tokenizer.py           # K-mer DNA tokenizer
│   ├── embeddings.py          # RoPE positional embeddings
│   ├── attention.py           # Multi-head self-attention
│   ├── transformer.py         # Full transformer encoder + classifier
│   └── losses.py              # Focal loss implementation
├── utils/
│   ├── metrics.py             # AUROC, AUPRC, F1
│   └── viz.py                 # Attention map & training curve plots
├── scripts/
│   └── download_encode.sh     # Script to fetch real ENCODE ATAC-seq data
├── train.py                   # Training entrypoint
└── evaluate.py                # Evaluation & visualization entrypoint

Quickstart

git clone https://github.com/YOUR_USERNAME/chromaformer
cd chromaformer
pip install -r requirements.txt

# Generate synthetic training data
python data/generate_data.py --n_sequences 5000 --output_dir data/

# Train the model
python train.py --data_dir data/ --epochs 20 --lr 0.001 --batch_size 32

# Evaluate and visualize attention maps
python evaluate.py --data_dir data/ --checkpoint results/best_model.npz

Using ENCODE Data

bash scripts/download_encode.sh

This downloads ATAC-seq peak BED files and genome FASTA for K562 cells from the ENCODE portal (https://www.encodeproject.org). Positive sequences are extracted from peak summits ±100bp; negatives are sampled from non-peak regions with matched GC content.

Requirements

numpy>=1.24
pandas>=1.5
scikit-learn>=1.2
matplotlib>=3.6
seaborn>=0.12
tqdm>=4.64

No PyTorch or TensorFlow required. The entire model runs on NumPy.

Citation & Data Sources

ENCODE Project: https://www.encodeproject.org
ATAC-seq methodology: Buenrostro et al., Nature Methods 2013
RoPE: Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", 2021
Focal Loss: Lin et al., "Focal Loss for Dense Object Detection", ICCV 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChromaFormer

Why This Problem Matters

Model Architecture

Data

Project Structure

Quickstart

Using ENCODE Data

Requirements

Citation & Data Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
attention.py		attention.py
download_encode.sh		download_encode.sh
embeddings.py		embeddings.py
evaluate.py		evaluate.py
extract_sequences.py		extract_sequences.py
generate_data.py		generate_data.py
losses.py		losses.py
metrics.py		metrics.py
tokenizer.py		tokenizer.py
train.py		train.py
transformer.py		transformer.py
viz.py		viz.py

Folders and files

Latest commit

History

Repository files navigation

ChromaFormer

Why This Problem Matters

Model Architecture

Data

Project Structure

Quickstart

Using ENCODE Data

Requirements

Citation & Data Sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages