Zero-Shot Protein-Ligand Binding Site Detection

Abstract

Accurate identification of protein–ligand binding residues is critical for mechanistic biology and drug discovery, yet performance varies widely across ligand families and data regimes. We present a systematic evaluation framework that stratifies ligands into three settings, overrepresented (many examples), underrepresented (tens of examples), and zero-shot (unseen at training). We developed a three-stage, sequence-based modeling suite that progressively adds ligand conditioning and zero-shot capability, and used an evaluation framework to assess the suite. Stage 1 trains per-ligand predictors using a pretrained protein language model (PLM). Stage 2 introduces ligand-aware conditioning via an embedding table, enabling a single multi-ligand model. Stage 3 replaces the table with a pretrained chemical language model (CLM) operating on SMILES, enabling zero-shot generalization. We show Stage 2 improves Macro F1 on the overrepresented test set from 0.4769 (Stage 1) to 0.5832 and outperforms sequence- and structure-based baselines. Stage 3 attains zero-shot performance (F1 = 0.3109) on 5612 previously unseen ligands while remaining competitive on represented ligands. Ablations across five PLM scales and multiple CLMs reveal larger PLM backbones consistently increase Macro F1 across all regimes, whereas scaling the CLM yields modest or inconsistent gains, which need further investigation.

News

🗓️ 25 Sept 2025 — 🎉 Our paper was accepted to the NeurIPS 2025 AI4Science workshop!
🗓️ 30 Sept 2025 — Our preprint has been published in bioRxiv.

Model Architecture

Stage 2: Multi-Ligand Model with Embedding Table

Stage 3: Zero-Shot Model with Chemical Language Model

Installation

Follow these steps on Ubuntu to set up a clean Python environment and run the installer script.

Prerequisites: Ubuntu 20.04+ with Python 3.9–3.11 and pip. GPU is optional.
Recommended: use a virtual environment to avoid conflicting packages.

# 1) Create and activate a virtual environment (in project root)
python3 -m venv .venv
source .venv/bin/activate

# 2) Make the installer executable and run it
chmod +x install.sh
./install.sh

Downloading Datasets

We provide datasets for the three evaluation stages described in the manuscript: overrepresented (ligands with >=100 samples), underrepresented (20-99 samples), and zero-shot (<20 samples). These datasets have been slightly modified for compatibility with the UNIMOL 2 chemical encoder (see manuscript Appendix for details). We also release the original unmodified datasets for future research.

Original unmodified datasets: link
Modified datasets: link

Inference

Run residue-level binding-site predictions from a trained checkpoint.

1. Prepare the inference workspace

cd inference
Download checkpoints into this directory:
- Stage 2 (embedding table): download
  Note: This checkpoint only supports 166 ligands, not the full 167 reported in the paper. This is because it was trained on an older version of the dataset.
- Stage 3 (chemical encoder): download
Place your input CSV under inference/data/. An example file is provided at inference/data/example_data.csv.

2. Configure inputs

Update inference/configs/config.yaml:

input_csv_path: path to your CSV (e.g., ./data/example_data.csv).
output_csv_path: output file for predictions (e.g., ./data/stage2_predictions.csv).
checkpoint_path: choose either ./stage2_checkpoint.pth or ./stage3_checkpoint.pth.
stage_3: set to false for Stage 2, true for Stage 3.
device_type: cuda or cpu.

Additional expectations:

The CSV must contain columns named ligand_name and protein_sequence (configurable via lig_name_col / prot_seq_col).
Stage 2 accepts only the 166 ligands seen during training; ensure ligand names match those in example_data.csv.
Stage 3 requires a SMILES column with valid strings for every row (zero-shot ligands are allowed).

3. Run inference

From the inference directory (with the environment activated):

python inference.py

The script loads the checkpoint, runs predictions with automatic mixed precision by default, and writes a CSV with:

predictions: list of 0/1 residue labels (prediction_threshold controls the cutoff).
positive_indices: residue indices predicted as binding.
binding_probabilities: per-residue probabilities (enabled by output_binding_probs).

4. Working with outputs

For the supplied example_data.csv, Stage 2 and Stage 3 can be run back-to-back by toggling stage_3 and checkpoint_path as described above.
To use custom data, ensure sequences are plain amino-acid strings and, for Stage 3, SMILES strings are present. Adjust prediction_threshold or post-process the probability column to suit your application.

Note: Inference has only been validated with mixed-precision enabled on NVIDIA GPUs. If you need pure CPU or full-precision runs, test carefully before relying on the outputs. The first run downloads the ESM backbone weights (used by both Stage 2 and Stage 3) into inference/hf_cache/; Stage 3 additionally fetches the MoLFormer chemical encoder. Subsequent runs reuse those files.

Results

Benchmarking the three Stages on the test sets. Macro F1 scores are based on the average of F1 for each type of ligand.

Method	Training Dataset	Overrepresented Macro F1	Underrepresented Macro F1	Zero-shot Macro F1	Zero-shot F1
Stage 1	Overrepresented (separated)	0.4769	-	-	-
Stage 2	Overrepresented	0.5826 ±0.0035	-	-	-
Stage 2	Overrepresented + Underrepresented	0.5832 ±0.0014	0.3752 ±0.0049	-	-
Stage 3	Overrepresented + Underrepresented	0.5526 ±0.0012	0.3603 ±0.0029	0.2338 ±0.0051	0.3109 ±0.0087

Comparison of our method’s best performance for each ligand with other available methods on selected ligands in the overrepresented test set based on F1 score. The main values are taken from the original papers, and ∗ indicates methods evaluated on our test sets.

Ligand	Stage 2 (Our)	Prot2Token	TargetS	LMetalSite	ZinCap	MIB2	Boltz-2x
Ca²⁺	0.6958 ±0.0011	0.6566∗	0.392∗	0.526 (0.7370∗)	-	-	0.380∗
Mg²⁺	0.5637 ±0.0036	0.4603∗	0.433∗	0.367 (0.5560∗)	-	-	0.339∗
Zn²⁺	0.8180 ±0.0017	0.7594∗	0.660∗	0.760 (0.8299∗)	0.451∗	-	0.557∗
Mn²⁺	0.7663 ±0.0113	0.7376∗	0.579∗	0.662 (0.8048∗)	-	-	0.419∗

📜 Citation

If you use this code or the pretrained models, please cite the following paper:

@article {Pourmirzaei2025.09.28.679103,
	author = {Pourmirzaei, Mahdi and Alqarghuli, Salhuldin and Chen, Kai and Pourmirzaei, Mohammadreza and Xu, Dong},
	title = {Zero-Shot Protein-Ligand Binding Site Prediction from Protein Sequence and SMILES},
	year = {2025},
	doi = {10.1101/2025.09.28.679103},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/09/30/2025.09.28.679103},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
data_processing		data_processing
inference		inference
single_ligand		single_ligand
src		src
zero_shot		zero_shot
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
test_clm.py		test_clm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zero-Shot Protein-Ligand Binding Site Detection

Abstract

News

Model Architecture

Stage 2: Multi-Ligand Model with Embedding Table

Stage 3: Zero-Shot Model with Chemical Language Model

Installation

Downloading Datasets

Inference

1. Prepare the inference workspace

2. Configure inputs

3. Run inference

4. Working with outputs

Results

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Zero-Shot Protein-Ligand Binding Site Detection

Abstract

News

Model Architecture

Stage 2: Multi-Ligand Model with Embedding Table

Stage 3: Zero-Shot Model with Chemical Language Model

Installation

Downloading Datasets

Inference

1. Prepare the inference workspace

2. Configure inputs

3. Run inference

4. Working with outputs

Results

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages