Enformer Embeddings is a Python framework for extracting embeddings from DNA sequences using the Enformer model. This framework allows you to process FASTA files and extract high-dimensional embeddings that can be used for downstream analysis tasks.
The Enformer Embeddings framework provides tools to:
- Extract embeddings from FASTA sequences using the Enformer model
- Process sequences with automatic centering and padding
- Apply mean pooling for dimensionality reduction
- Save embeddings in compressed NumPy format for easy loading
- Python 3.11 or higher
uvpackage manager (install from https://github.com/astral-sh/uv)- CUDA-capable GPU (optional, but recommended for faster inference)
# Or using pip
pip install uvRun the environment setup command:
uv sync
source .venv/bin/activateThis will install all required dependencies including:
enformer-pytorch: Enformer model implementationbiopython: FASTA file parsingpandas: Data processing utilitiesnumpy: Numerical operations
The main script for extracting embeddings can be run like below:
python -m retrieve_embeddings.cli \
--input-file test_files/test.fasta \
--output-file output/embeddings.npzpython -m retrieve_embeddings.cli \
--input-file <path-to-input.fasta> \
--output-file <path-to-output.npz> \
--window-size 196608 \
--pad-value N \
--mean-pool \
--no-center--input-file(required): Path to input FASTA file containing DNA sequences--output-file(required): Path to output.npzfile where embeddings will be saved--window-size(optional): Window size for sequence centering. Defaults to 196608 (Enformer requirement)--pad-value(optional): Padding value when sequences are shorter than window size. Options:N: Pad with 'N' characters (default)-: Pad with gap character '-'-1: Pad with -1 index value
--mean-pool(optional): Apply mean pooling across the embedding dimension. Reduces shape from(N, 896, 3072)to(N, 896)--no-center(optional): Disable sequence centering. Sequences must be exactlywindow_sizein length
The script outputs a compressed NumPy archive (.npz) file containing:
ids: Array of sequence IDs from the FASTA fileembeddings: Array of embeddings with shape:(num_sequences, 896, 3072)without mean pooling(num_sequences, 896)with mean pooling (--mean-poolflag)
# Basic usage with default settings
python -m retrieve_embeddings.cli \
--input-file test_files/test.fasta \
--output-file output/embeddings.npz
# With mean pooling for reduced dimensionality
python -m retrieve_embeddings.cli \
--input-file test_files/test.fasta \
--output-file output/embeddings.npz \
--mean-pool
# Using gap character padding
python -m retrieve_embeddings.cli \
--input-file test_files/test.fasta \
--output-file output/embeddings.npz \
--pad-value -
# Disable centering (sequences must be exactly 196608 bp)
python -m retrieve_embeddings.cli \
--input-file test_files/test.fasta \
--output-file output/embeddings.npz \
--no-centerYou can load the saved embeddings in Python:
import numpy as np
# Load embeddings
data = np.load('output/embeddings.npz')
sequence_ids = data['ids']
embeddings = data['embeddings']
print(f"Loaded {len(sequence_ids)} sequences")
print(f"Embeddings shape: {embeddings.shape}") # (num_sequences, 896, 3072) or (num_sequences, 896)
print(f"Sequence IDs: {sequence_ids}")You can also use the framework programmatically in Python:
from retrieve_embeddings import retrieve_embeddings_from_fasta, create_enformer_model
# Create model
model = create_enformer_model()
# Retrieve embeddings
embeddings, _, sequence_ids = retrieve_embeddings_from_fasta(
fasta_path="test_files/test.fasta",
model=model,
center_sequences=True,
window_size=196608,
pad_value="N",
mean_pool=False,
save_path="output/embeddings.npz"
)
print(f"Processed {len(sequence_ids)} sequences")
print(f"Embeddings shape: {embeddings.shape}")enformer-embeddings/
├── retrieve_embeddings/ # Main package
│ ├── __init__.py # Package initialization
│ ├── retrieve_embeddings.py # Core embedding functions
│ ├── cli.py # Command-line interface
│ └── util.py # Utility functions (FASTA parsing, validation)
├── test/ # Unit tests
│ ├── test_pre_embeddings.py # Tests for preprocessing functions
│ └── test_get_embeddings.py # Tests for embedding extraction
├── test_files/ # Example files
│ ├── test.fasta # Example FASTA input
│ └── embeddings.npz # Example output
├── pyproject.toml # Project dependencies and configuration
└── README.md # This file
The framework automatically validates sequences:
- Length check: Sequences longer than
window_sizewill raise aValueError - Character validation: Only valid nucleotide characters (A, C, G, T, N, -) are allowed
By default, sequences are centered in a window of size 196608:
- Shorter sequences: Padded on both sides with the specified padding value
- Exact length: No padding needed
- Longer sequences: Raises an error (use
--no-centerif sequences are pre-processed)
N(default): Pads with 'N' characters, converted to index 4-: Pads with gap character '-', converted to index -1-1: Directly pads tensor with -1 index value
Run the test suite to verify your installation:
pytest test/Or run specific test files:
pytest test/test_pre_embeddings.py -vIf you encounter ValueError: Sequence length exceeds window size, ensure that:
- Sequences are at most 196608 base pairs long (or your specified
window_size) - Or use
--no-centerflag if sequences are already pre-processed to exact length
If you encounter ValueError: Invalid character, ensure that:
- FASTA sequences contain only valid nucleotides: A, C, G, T, N, or -
- Sequences are properly formatted (no special characters)
If you encounter import errors, make sure:
- The package is installed:
uv pip install -e . - You're in the correct virtual environment
- All dependencies are installed:
uv sync
The framework will automatically use GPU if available (CUDA), otherwise it will fall back to CPU. Processing is slower on CPU but will work correctly.
enformer-pytorch>=0.8.11: Enformer model implementationbiopython>=1.83: FASTA file parsingpandas>=2.3.1: Data processingnumpy: Numerical operations (included with PyTorch)torch: Deep learning framework (included with enformer-pytorch)