Seq2Pocket: From Sequence-Level pLM Predictions to 3D Structural Pockets

Seq2Pocket is a framework for protein ligand binding site (LBS) prediction that maps sequence-level predictions to 3D structural pockets. The method utilizes finetuned protein language model (pLM) to identify binding residues and restores pocket continuity through a two-step structural refinement process:

Embedding-Supported Smoothing: An additional classifier leverages latent embeddings of neighboring residues to fill in gaps and resolve spatial incompletion inherent in independent residue-wise predictions.
Structure-based Clustering: A surface-based clustering that utilizes Solvent Accessible Surface (SAS) points to group refined predictions into distinct, biologically relevant 3D pockets.

The Seq2Pocket Pipeline. Our framework takes residue-level probabilities from a finetuned pLM, applies an embedding-supported smoothing classifier to restore pocket continuity, and utilizes a surface-based clustering to define final 3D binding regions. Here, the finetuned pLM correctly identifies the Staurosporine inhibitor binding site on the human C-terminal Src kinase (PDB ID = 1bygA).

Availability

You can check out the model for cryptic binding site prediction using on this website: https://cryptoshow.cz/

Run Locally

Refer to the /tutorial folder for running the model locally.

Framework Components

The Seq2Pocket workflow is divided into three main stages:

pLM Fine-tuning: Scripts for fine-tuning ESM2-3B on ligand binding tasks.
Smoothing Classifier: A module that uses latent embeddings of a residue and its neighbors to decide if additional residues should be included in a pocket, improving structural completeness.
SAS Clustering: An optimized clustering approach using Solvent Accessible Surface (SAS) points to group predicted residues into distinct 3D pockets.

Repository Structure

The source code is located in the src/ folder, which contains the following subfolders:

data-extraction: Scripts for processing dataset files.
pLM-training: Training scripts for finetuning the ESM2-3B models for General Binding Site (GBS) prediction.
smoothing-classifier: Implementation of the embedding-supported smoothing classifier, including training logic.
evaluation: Benchmarking scripts that utilize out-of-the-box scikit-learn clustering algorithms (e.g., DBSCAN, Mean Shift).
clustering: Implementation of the proposed clustering methodology: MeanShift applied to Solvent Accessible Surface (SAS) points.
stats: Calculation of data and figures for the manuscript.
visualizations: Scripts for visualizing pockets in pyMOL.

The enhancement of the sc-PDB dataset is implemented in separate branch named 'scPDB_enhancement' in the CryptoBench repository.

Installation

To run the scripts, it might be useful to install the packages specified in requirements.txt.

To generate SAS points, update your biopython's SASA.py with the updated version from this repository (SASA.py).

Data and Materials

scPDB-Enhanced: Zenodo repository
- An updated version of the scPDB dataset containing experimentally observed binding sites and ions omitted in the original release.
Data and model weights: Storage link

Contact us

If you have any questions regarding the usage of the framework, comparing your method against the benchmark, or if you have any suggestions, please feel free to contact us by raising an issue!

How to cite

If you find our work useful, please cite the paper:

Vít Škrhák, Lukáš Polák, Marian Novotný, and David Hoksza. 2026. Seq2Pocket: Augmenting protein language models for spatially consistent binding site prediction. bioRxiv. https://doi.org/10.64898/2026.01.28.702257

or, if you prefer the BibTeX format:

@article {2026.01.28.702257,
	author = {Škrhák, Vít and Polák, Lukáš and Novotný, Marian and Hoksza, David},
	title = {Seq2Pocket: Augmenting protein language models for spatially consistent binding site prediction},
	elocation-id = {2026.01.28.702257},
	year = {2026},
	doi = {10.64898/2026.01.28.702257},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {Protein-ligand binding site prediction (LBS) is important for many domains including computational drug discovery, where, as in other tasks, protein language models (pLMs) have shown a great promise. In their application to LBS, the pLM classifies each amino acid as binding or not. Subsequently, for the purposes of downstream analysis, these predictions are mapped onto the structure, forming structure-continuous pockets. However, their residue-oriented nature often results in spatially fragmented predictions. We present a comprehensive framework (Seq2Pocket) that addresses this by combining finetuned pLM with an embedding-supported smoothing classifier and an optimized clustering strategy. While finetuning on our enhanced scPDB dataset yields state-of-the-art results, outperforming existing predictors by up to 11\% in DCC recall, the smoothing classifier restores pocket continuity. Next, we introduce the Pocket Fragmentation Index (PFI) and use it to select a clustering approach that preserves a consistent mapping between predictions and ground-truth pockets. Validated on the LIGYSIS and CryptoBench benchmarks, our approach ensures that pLM-based predictions are not only statistically accurate but also useful for downstream drug discovery, while maintaining state-of-the-art performance.Competing Interest StatementThe authors have declared no competing interest.},
	URL = {https://www.biorxiv.org/content/early/2026/01/31/2026.01.28.702257},
	eprint = {https://www.biorxiv.org/content/early/2026/01/31/2026.01.28.702257.full.pdf},
	journal = {bioRxiv}
}

License

This source code is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
img		img
src		src
tutorial		tutorial
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SASA.py		SASA.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seq2Pocket: From Sequence-Level pLM Predictions to 3D Structural Pockets

Availability

Run Locally

Framework Components

Repository Structure

Installation

Data and Materials

Contact us

How to cite

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Seq2Pocket: From Sequence-Level pLM Predictions to 3D Structural Pockets

Availability

Run Locally

Framework Components

Repository Structure

Installation

Data and Materials

Contact us

How to cite

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages