Skip to content

jsilter/premval

Repository files navigation

PREMVAL — Protein Ensemble Evaluation

A neutral, reproducible benchmark and leaderboard for protein conformational ensemble generators (AlphaFlow/ESMFlow, BioEmu, ESMDiff, ConfDiff, Str2Str, …), scored on the ATLAS molecular-dynamics dataset with a single fixed metric panel.

Live leaderboard: https://premval--web.modal.run/

The field moves fast and every paper benchmarks its own model, on its own systems, with its own metrics; there is no neutral cross-model comparison. PREMVAL is that comparison: one metric panel, one submission format, one reference dataset, applied identically to every model.

Its differentiator is per-model held-out labels. AlphaFlow-MD and ESMFlow-MD fine-tune on the ATLAS train split and are scored on the held-out test split, but that split is temporal only (no sequence-homology filter), so the guarantee is weak; BioEmu never trains on ATLAS at all; ESMDiff's training relationship to ATLAS is unclear. Models trained on ATLAS can score better partly because the test split resembles their training data, and no individual paper surfaces this. PREMVAL tags every model held_out / weak_holdout / uncertain so the leaderboard is read with that context, not without it.

The scoring library is CPU-only and free of GPU dependencies: it ingests already-generated ensembles (downloaded or produced by the GPU harnesses in inference/) and re-runs the same metrics on all of them.

Models & contamination labels

Every evaluated model is tagged by how strong its held-out guarantee is on the ATLAS evaluation, because a model whose test data resembles its training set holds an advantage that raw scores hide:

  • held_out (green) — never trained on ATLAS or MD data (PDB-only, zero-shot, or training filtered away from the test proteins). A genuine held-out evaluation.
  • weak_holdout (amber) — fine-tuned on the ATLAS train split and scored on the held-out test split, but the split is temporal only (no sequence-homology filter), so test homologs may resemble training data. A real but weak held-out guarantee; read these scores with that caveat.
  • uncertain (grey) — the training corpus's relationship to ATLAS is not established (a broad MD/structure corpus, or a fine-tune of a model whose pretraining overlap is unclear).

Models currently on the leaderboard:

Model What it is Label
AlphaFlow-MD (base, distilled) AlphaFold2 fine-tuned with flow matching on ATLAS MD (Jing et al., ICML 2024) weak_holdout
ESMFlow-MD (base, distilled) ESMFold fine-tuned with flow matching on ATLAS MD (Jing et al., ICML 2024) weak_holdout
BioEmu Equilibrium-ensemble emulator; not trained on ATLAS, training filtered to <40% sequence identity to test proteins (Lewis et al., 2024) held_out
ESMDiff ESM3 fine-tuned with masked diffusion over discrete structure tokens (Lu et al., ICLR 2025) uncertain

The leaderboard's badge labels are defined in src/premval/models.py; per-model evidence and labels for additional run-yourself models (Str2Str, ConfDiff, the PDB-trained flow variants, which are held_out) are recorded in data/contamination_labels.yaml and wired up in inference/.

The metric panel

Ported from AlphaFlow's evaluation scripts (Jing et al., ICML 2024) so numbers line up with the literature, plus the raw RMWD components. Every metric compares a 250-frame submission ensemble against the ATLAS MD reference for the same chain:

Metric (JSON key) What it measures
rmsf_pearson Per-residue Cα flexibility (RMSF) correlation with MD (higher better)
rmwd Root-mean Wasserstein distance between per-atom Gaussian fits (lower)
emd_mean_rms / emd_var_rms RMWD split into mean-displacement and covariance-mismatch components
md_pca_w2 2-Wasserstein distance in the MD-fit PCA basis (lower better)
weak_contacts_jaccard Jaccard overlap of weak (transiently-broken) Cα–Cα contacts (higher)
transient_contacts_jaccard Jaccard overlap of transiently-formed Cα–Cα contacts (higher)

The leaderboard summarizes each quality metric by its mean across a split, and per-chain inference wall time by its median (runtime is dominated by sequence length, so it is heavily right-skewed).

Submission format

One ensemble per chain, as a single multi-model PDB with exactly 250 frames, named {chain}.pdb (e.g. 6cka_B.pdb). The reference splits are ATLAS: 39 val chains and 82 test chains.

Install

Python 3.12+. Editable install with dev tools:

pip install -e ".[dev]"

Optional extras: viz (matplotlib), viz-pymol (PyMOL renderer), web (FastAPI dashboard). The core scoring path needs none of them.

Quickstart

# 1. Download ATLAS reference bundles into the local cache (~/.cache/premval).
premval fetch                      # val split; --chains ... for specific chains

# 2a. Ingest a model that publishes its ATLAS ensembles (no GPU).
premval ingest --model alphaflow_pdb_base
# 2b. ...or generate one yourself on a GPU; see inference/README.md.

# 3. Precompute the reference-observables cache, then batch-score a split.
premval prepare-refs
premval score-all --split val      # writes results/{model}.json

# 4. Score a single submission ad hoc.
premval score --chain 6cka_B --submission ensemble.pdb

# 5. Serve the dashboard locally (requires the [web] extra).
premval serve --port 8000

premval --help (and premval <command> --help) documents every subcommand.

Repository layout

Path What it holds
src/premval/ The CPU-only scoring library + CLI (metrics/, data/, scoring.py, leaderboard.py, web/)
inference/ Run-yourself GPU harnesses for models PREMVAL can't just download (see its README)
results/ Committed per-model scores ({model}.json); the leaderboard reads these
data/ ATLAS split lists and contamination_labels.yaml
inference/web_modal.py Deploys the dashboard as a Modal ASGI app (the live premval--web.modal.run)
tests/ Pytest suite (run before every change)

The dependency arrow only points one way: inference/ imports premval, never the reverse, which keeps the installable package GPU-free.

Development

pytest               # full suite
ruff check . && ruff format .
mypy                 # strict mode over src + tests

mypy --strict and the lint rules (E, F, I, W, B, UP, line length 100) must pass. See CODING_STANDARDS.md for the conventions.

Contributing a model

Score your ensemble into results/{model}.json (via premval ingest + premval score-all, or a GPU harness in inference/), add a row to data/contamination_labels.yaml and a display entry to src/premval/models.py, and open a PR. The leaderboard auto-discovers any model with a committed results file.

License & attribution

PREMVAL is released under the MIT License. The metric panel is ported from AlphaFlow (Jing et al., "AlphaFold Meets Flow Matching for Generating Protein Ensembles," ICML 2024). Each model evaluated here carries its own upstream weights license (documented in inference/README.md); ESMDiff in particular depends on the non-commercial, gated ESM3 weights. ATLAS reference data is distributed by its authors under their own terms.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors