SEI Framework

SEI (Sequence-to-Effect Inference) is a deep learning framework for predicting chromatin profiles and sequence classes from DNA sequences. This framework allows you to extract embeddings from DNA sequences using the pre-trained SEI model.

Overview

The SEI framework provides tools to:

Extract embeddings from FASTA sequences using the pre-trained SEI model
Predict chromatin profiles and sequence classes from DNA sequences
Process sequences in batches for efficient inference

Prerequisites

Python 3.9 or higher
uv package manager (install from https://github.com/astral-sh/uv)
CUDA-capable GPU (optional, but recommended for faster inference)

Environment Setup

1. Install uv (if not already installed)

# Or using pip
pip install uv

2. Create Virtual Environment and Install Dependencies

Run the environment setup cmd:

uv sync
source .venv/bin/activate

3. Download the SEI Model

The trained SEI model needs to be downloaded from Zenodo. You can use the provided download script:

bash download_data.sh

This will download:

The trained SEI model (sei_model.tar.gz) from Zenodo
SEI framework resources (FASTA files) from Zenodo

Note: The model files should be extracted to the model/ directory. The main model file should be at model/sei.pth.

Alternatively, you can manually download from:

Model: https://zenodo.org/record/4906996 (DOI: 10.5281/zenodo.4906996)
Extract the model files to the model/ directory

Or just run

bash env_setup.sh

Usage

Extract Embeddings from FASTA Sequences

The main script for extracting embeddings is retrieve_embeddings/retrieve_embeddings.py.

Basic Usage

python -m retrieve_embeddings.retrieve_embeddings \
    --input-file retrieve_embeddings/test.fasta \
    --output-file output/embeddings.npz

Full Command with All Options

python retrieve_embeddings/retrieve_embeddings.py \
    --input-file <path-to-input.fasta> \
    --output-file <path-to-output.npz> \
    --model-path model/sei.pth \
    --batch-size 32 \
    --sequence-length 4096 \
    --use-hooks

Command-Line Arguments

--input-file (required): Path to input FASTA file containing DNA sequences
--output-file (required): Path to output .npz file where embeddings will be saved
--model-path (optional): Path to SEI model file (default: model/sei.pth)
--batch-size (optional): Batch size for processing sequences (default: 32)
--sequence-length (optional): Target sequence length for encoding (default: 4096)
--use-hooks (default): Use register_hooks method for embedding extraction (recommended)
--no-use-hooks: Use manual layer-by-layer method instead of hooks

Output Format

The script outputs a compressed NumPy archive (.npz) file containing:

ids: Array of sequence IDs from the FASTA file
embeddings: Array of embeddings with shape (num_sequences, 960, 16)

Example

# Extract embeddings using the default method (hooks)
python retrieve_embeddings/retrieve_embeddings.py \
    --input-file retrieve_embeddings/test.fasta \
    --output-file output/embeddings.npz

# Extract embeddings using manual method
python retrieve_embeddings/retrieve_embeddings.py \
    --input-file retrieve_embeddings/test.fasta \
    --output-file output/embeddings.npz \
    --no-use-hooks

Loading Embeddings

You can load the saved embeddings in Python:

import numpy as np

# Load embeddings
data = np.load('output/embeddings.npz')
sequence_ids = data['ids']
embeddings = data['embeddings']

print(f"Loaded {len(sequence_ids)} sequences")
print(f"Embeddings shape: {embeddings.shape}")  # (num_sequences, 960, 16)

Project Structure

sei-framework/
├── model/                # SEI model files
│   ├── sei.py            # Model architecture
│   ├── sei.pth           # Trained model weights (download required)
│   └── *.names           # Target and sequence class names
├── retrieve_embeddings/  # Embedding extraction scripts
│   ├── retrieve_embeddings.py  # Main embedding extraction script
│   ├── util.py           # Utility functions for embeddings
│   └── test.fasta        # Example input file
├── encode/               # Sequence encoding utilities
├── pca/                  # PCA analysis tools
├── tests/                # Unit tests
├── env_setup.sh          # Environment setup script
├── download_data.sh      # Model download script
└── pyproject.toml        # Project dependencies

Testing

Run the test suite to verify your installation:

pytest tests/

Troubleshooting

Model File Not Found

If you encounter FileNotFoundError: Model file not found: model/sei.pth, make sure you have:

Downloaded the model using bash download_data.sh
Extracted the model files to the model/ directory
Verified that model/sei.pth exists

The framework will automatically use CPU if CUDA is not available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEI Framework

Overview

Prerequisites

Environment Setup

1. Install uv (if not already installed)

2. Create Virtual Environment and Install Dependencies

3. Download the SEI Model

Usage

Extract Embeddings from FASTA Sequences

Basic Usage

Full Command with All Options

Command-Line Arguments

Output Format

Example

Loading Embeddings

Project Structure

Testing

Troubleshooting

Model File Not Found

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SEI Framework

Overview

Prerequisites

Environment Setup

1. Install uv (if not already installed)

2. Create Virtual Environment and Install Dependencies

3. Download the SEI Model

Usage

Extract Embeddings from FASTA Sequences

Basic Usage

Full Command with All Options

Command-Line Arguments

Output Format

Example

Loading Embeddings

Project Structure

Testing

Troubleshooting

Model File Not Found