A production-structured PyTorch project for predicting local RNA 3D geometry from sequence and multiple-sequence-alignment features.
The core model, DistTransformer, predicts a 21 × 21 pairwise distance matrix for local RNA residue windows. Each input window contains 21 nucleotides and each nucleotide is represented by an 8-dimensional feature vector: 4 nucleotide one-hot features and 4 MSA profile-frequency features.
RNA molecules fold into 3D structures that determine their biological function. This project converts the original experimental notebook into a reusable machine-learning pipeline for local RNA 3D distance prediction.
The pipeline:
- Loads RNA sequences, residue-level 3D coordinates, and MSA FASTA files.
- Builds local 21-residue windows around each residue.
- Encodes every position as one-hot nucleotide identity plus MSA profile frequencies.
- Normalizes 3D coordinates and converts them into pairwise distance matrices.
- Trains a Transformer encoder with a pairwise MLP head.
- Reports RMSE, MAE, per-window errors, Pearson correlation, and diagnostic plots.
Fig. 1. End-to-end project architecture from RNA sequence, MSA profile, and 3D coordinate labels to local pairwise distance prediction and evaluation.
rna-3d-folding-disttransformer/
├── configs/
│ └── default.yaml
├── data/
│ └── README.md
├── docs/
│ ├── github_push_guide.md
│ └── methodology.md
├── notebooks/
│ └── original_rna_3d_folding_experiment.ipynb
├── reports/
│ └── figures/ # README figures and architecture diagram
├── results/
│ ├── metrics_summary.csv
│ ├── metrics_summary.json
│ └── training_history_from_notebook.csv
├── scripts/
│ ├── evaluate.py
│ ├── plot_nussinov_example.py
│ └── train.py
├── src/
│ └── rna3d_folding/
├── tests/
├── requirements.txt
├── pyproject.toml
└── README.md
The dataset is not included because the files are large and should not be committed to GitHub.
Use the Kaggle dataset here:
Stanford RNA 3D Folding
https://www.kaggle.com/competitions/stanford-rna-3d-folding/data
Expected local layout:
data/
├── train_labels.v2.csv
├── train_sequences.xlsx # train_sequences.csv is also supported
└── MSA/
├── <target_id>.MSA.fasta
└── ...
On Kaggle, you can also pass the mounted input directory directly:
python -m rna3d_folding.train \
--data-dir /kaggle/input/stanford-rna-3d-folding/stanford-rna-3d-folding \
--output-dir outputsCreate and activate a virtual environment:
python -m venv .venvWindows:
.venv\Scripts\activatemacOS/Linux:
source .venv/bin/activateInstall the project:
pip install -e .For development tools:
pip install -e ".[dev]"Default configuration is stored in configs/default.yaml.
Run training:
python -m rna3d_folding.train --data-dir data --output-dir outputsRun a faster smoke experiment:
python -m rna3d_folding.train \
--data-dir data \
--output-dir outputs/smoke_test \
--epochs 5 \
--sample-size 500The training command saves:
outputs/
├── checkpoints/
│ └── best_model.pt
├── figures/
│ ├── training_curves.png
│ ├── rmse_distribution.png
│ └── true_vs_predicted.png
├── metrics.json
└── training_history.csv
Evaluate a saved checkpoint:
python -m rna3d_folding.evaluate \
--checkpoint outputs/checkpoints/best_model.pt \
--data-dir data \
--output-dir outputs/evaluationInput: 21 × 8 feature matrix
↓
Linear projection: 8 → 64
↓
Learned positional embedding
↓
2-layer Transformer encoder
↓
Pairwise residue embedding concatenation
↓
MLP distance head
↓
Output: symmetric 21 × 21 pairwise distance matrix
Default hyperparameters:
| Parameter | Value |
|---|---|
| Window radius | 10 |
| Window length | 21 |
| Input features per residue | 8 |
| Transformer dimension | 64 |
| Attention heads | 8 |
| Transformer layers | 2 |
| Feed-forward dimension | 128 |
| Dropout | 0.1 |
| Optimizer | AdamW |
| Learning rate | 3e-4 |
| Weight decay | 1e-5 |
| Batch size | 64 |
| Sampled training windows | 8,000 |
The following results are preserved from the original notebook run. The experiment used an 8,000-window sampled subset, 21-residue windows, globally standardized coordinates, and an 80/20 train-test split.
| Experiment | Final Train RMSE | Final Test RMSE | Final Train MAE | Final Test MAE | Test Window RMSE Mean | Test Window MAE Mean | Pearson Corr. |
|---|---|---|---|---|---|---|---|
| DistTransformer, 100 epochs | 0.0600 | 0.0611 | 0.0411 | 0.0401 | 0.0574 | 0.0401 | 0.8253 |
| DistTransformer, 500 epochs | 0.0469 | 0.0612 | 0.0335 | 0.0394 | 0.0569 | 0.0394 | 0.8248 |
The 500-epoch run improves training error but does not materially improve test RMSE over the 100-epoch run, suggesting mild overfitting after the model has already converged.
The README directly displays the main figures exported from the original notebook. These images are stored under reports/figures/, so they render automatically on GitHub.
Fig. 2. RMSE and MAE curves for the 500-epoch DistTransformer experiment.
Fig. 3. Test-window RMSE distribution for local RNA distance-matrix prediction.
Fig. 4. Predicted pairwise distances plotted against true normalized pairwise distances.
Fig. 5. Nussinov dynamic-programming arc diagram included as a lightweight secondary-structure baseline visualization.
Historical 100-epoch versions are also included for comparison:
training_curves_100_epoch.pngrmse_distribution_100_epoch.pngtrue_vs_predicted_100_epoch.pngnussinov_arc_diagram_100_epoch.png
The project includes a simple Nussinov dynamic-programming baseline for dot-bracket prediction and arc-diagram visualization.
Run:
python scripts/plot_nussinov_example.pyThis saves:
outputs/figures/nussinov_example.png
Run the included tests:
pytest -qAfter unzipping this folder inside your empty repository directory:
git init
git branch -M main
git add .
git commit -m "Initial commit: add RNA 3D folding DistTransformer pipeline"
git remote add origin https://github.com/<your-username>/<your-repo-name>.git
git push -u origin mainRecommended repository description:
MSA-enhanced Transformer pipeline for RNA 3D local pairwise distance prediction using the Stanford RNA 3D Folding dataset.
Suggested GitHub topics:
rna-3d-folding deep-learning pytorch transformer bioinformatics computational-biology kaggle
- The repository intentionally excludes dataset files, trained checkpoints, and generated outputs through
.gitignore. - The original notebook is preserved under
notebooks/for traceability. - The reusable implementation lives under
src/rna3d_folding/. - The project is suitable as a research-style GitHub portfolio project because it includes methodology, modular code, reproducible configuration, tests, figures, and reported results.




