This repository contains the scripts and analysis described in:
Predicting Experimental Success in De Novo Binder Design: A Meta-Analysis of 3,766 Experimentally Characterised Binders
- OS: Tested on AlmaLinux 9.6 (Sage Margay). Other modern x86_64 Linux distributions may work but are untested.
- CPU: All analysis scripts in this repo run on CPU. PyRosetta-based relaxation is CPU-bound and benefits from multiple cores for parallelization.
- GPU (required for structure prediction): Structure prediction workflows (AF2 initial guess, ColabFold, Boltz, AF3) require a compatible GPU. Please see the respective repositories for exact GPU/CUDA requirements (linked below).
- Environment: A Conda environment is provided (
environment.yml).
Note: PyRosetta is used in compute_rosetta_metrics.py and rmsd.py, which requires a license.
Clone and set up the environment (~30min):
git clone https://github.com/DigBioLab/de_novo_binder_scoring.git
cd de_novo_binder_scoring
conda env create -f environment.yml
conda activate binder_scoring_env
chmod +x ./functions/DAlphaBall.gccNote: The used structure predictions tools (AF2 initial guess, ColabFold, Boltz and AF3) require seperate installations.
Convert input PDBs into standardized inputs (run.csv, cleaned PDBs, and MSA FASTAs):
python ./scripts/process_inputs.py \
--input_pdbs ./example_input/input_pdbs \
--output_dir ./example_output- Binder is expected as chain A (
A:no_msaby default). - Non-A chains are merged into B in PDB for downstream analysis (also works if target chains of input are already merged).
- Unique target sequences will get target IDs (
target_1,target_2, …). - Outputs:
run.csv,Binder_seq.fasta, andunique_msa/.
All names are automatically sanitized: lowercase letters, underscores, and numbers are allowed; all non-alphanumerics are replaced with _.
Be careful that your input files do not sanitize to the same name (e.g., abc.pdb, AbC.pdb, and abC.pdb all become abc.pdb).
You can overwrite columns in run.csv for customization using the --mode flag:
python ./scripts/process_inputs.py --mode {pdb_only, seq_only_csv, hybrid}pdb_only(default): sequences and other columns are automatically inferred from input PDBs.seq_only_csv: sequences and other columns are taken only from a CSV. Useful if you do not have input PDBs.
Example:
python ./scripts/process_inputs.py \
--mode seq_only_csv \
--input_csv ./example_input/input_sequence_only.csv \
--output_dir ./example_output_seq-
hybrid: sequences are extracted from PDBs by default, but can be overwritten with CSV-specified sequences. To overwrite, the CSV must specify:target_chains– all target chains (including those not overwritten)target_subchain_X_seq– one column per chain to overwrite (e.g.,target_subchain_D_seq)
Example:
python ./scripts/process_inputs.py \
--mode hybrid \
--input_pdbs ./example_input/input_pdbs \
--input_csv ./example_input/input_overwrite.csv \
--output_dir ./example_output_overwrite- Logging behavior: overwritten sequences are stored in
pdb_extracted_trg_subch_{X}_not_usedto preserve original PDB info.
It is possible to specify ions in the target structure which will only be modelled by AF3.
This is done by running the process inputs in the hybrid mode, and having a column called ions_in_target with the following syntax: "[""CA""]" or "[""CA"",""CA"",""MG""]" ect.
Generate MSAs using the ColabFold server (MMseqs2). Requires a separate ColabFold installation. Only use the API if you only running a couple of sequences (i.e. only target sequences):
colabfold_batch ./example_output/unique_msa ./example_output/unique_msa/msa --msa-onlyGenerate inputs for structure prediction models (AF3, Boltz, ColabFold):
python ./scripts/generate_model_inputs.py \
--run-csv ./example_output/run.csv \
--out-dir ./example_output4. Relax input structures & compute Rosetta metrics (inspired by the Bindcraft repo)
python ./scripts/compute_rosetta_metrics.py \
--run-csv ./example_output/run.csv \
--out-csv ./example_output/input_rosetta_metrics.csv \
--folder input:./example_output/input_pdbsRun AF2 prediction on relaxed PDBs. Requires a separate AF2 initial guess installation:
predict.py \
-pdbdir ./example_output/input_pdbs/relaxed_pdbs \
-scorefilename out.sc \
-outsilent af2.silentRequires a separate ColabFold installation:
colabfold_batch ./example_output/ColabFold/input_folder ./example_output/ColabFold/ptm_output \
--calc-extra-ptm --num-recycle 3 --num-models 3Requires a separate Boltz installation:
boltz predict ./example_output/Boltz/input_folder \
--recycling_steps 10 \
--diffusion_samples 3 \
--write_full_pae \
--out_dir ./example_output/BoltzRequires a separate AF3 installation:
python run_alphafold.py \
--input_dir=./example_output/AF3/input_folder \
--model_dir=/path/to/alphafold3_weights \
--db_dir=/path/to/alphafold3_database \
--run_data_pipeline=False \
--num_diffusion_samples=3 \
--output_dir=./example_output/AF3/outputspython ./scripts/extract_confidence_metrics.py \
--run-csv ./example_output/run.csv \
--out-dir ./example_outputpython ./scripts/run_ipsae_batch.py \
--run-csv ./example_output/run.csv \
--out-csv ./example_output/ipsae_and_ipae.csv \
--af3-dir ./example_output/AF3 \
--boltz-dir ./example_output/Boltz \
--colab-dir ./example_output/ColabFold \
--ipsae-script-path ./scripts/ipsae_w_ipae.pyThere is a possibility to extract specific chain pair ipSAE values:
use the argument
--specific-chainpair-ipsae "A:D,A:B,A:C"
which takes a string formated like the above.
It is also possible to specify several thressholds for the AF3 contact prob metrics --confidence-threshold "0.5,0.6,0.7,0.8,0.9" \
python ./scripts/dockQ.py \
--run-csv ./example_output/run.csv \
--input-pdbs ./example_output/input_pdbs/ \
--folder af3:./example_output/AF3/pdbs/ \
--folder af2:./example_output/AF2/pdbs/ \
--folder boltz:./example_output/Boltz/pdbs/ \
--folder colab:./example_output/ColabFold/pdbs/ \
--out-csv ./example_output/dockQ.csv 12. Compute Rosetta metrics for model PDBs (inspired by the Bindcraft repo)
python ./scripts/compute_rosetta_metrics.py \
--run-csv ./example_output/run.csv \
--out-csv ./example_output/rosetta_metrics.csv \
--folder af3:./example_output/AF3/pdbs/ \
--folder af2:./example_output/AF2/pdbs/ \
--folder boltz1:./example_output/Boltz/pdbs/ \
--folder colab:./example_output/ColabFold/pdbs/ \python ./scripts/rmsd.py \
--folder input:./example_output/input_pdbs/ \
--folder af3:./example_output/AF3/pdbs/ \
--folder af2:./example_output/AF2/pdbs/ \
--folder boltz:./example_output/Boltz/pdbs/ \
--folder colab:./example_output/ColabFold/pdbs/ \
--out-csv ./example_output/rmsd.csvRequires the open-source PyMOL installation:
OUTPUT_DIR="$(pwd)/outputs"
PYMOL_DIR=$OUTPUT_DIR/pymol_files
mkdir -p "${PYMOL_DIR}"
# Create JSON file listing PDB directories
echo '{"input": "'$OUTPUT_DIR/input_pdbs'", "af2": "'$OUTPUT_DIR/AF2/pdbs'", "colab": "'$OUTPUT_DIR/ColabFold/pdbs'", "boltz1": "'$OUTPUT_DIR/Boltz/pdbs'", "af3": "'$OUTPUT_DIR/AF3/pdbs'"}' > "${PYMOL_DIR}/pdb_dirs.json"
# Run PyMOL analysis script
cd $OUTPUT_DIR
python -m pymol -c -d "run ../scripts/pymol_metrics.py"This script will merge all the individual outputs into a merged csv:
all of the scripts that generate csv outputs can alternatively take the flag --update-runcsv which will merge results directly into the main csv by the "binder_id," but this can be anoying if reruning the script
source "$CONDA_PATH"
conda activate binder_scoring_env
echo -e "\nMerging all metrics into merged_run.csv" >> "${LOG_DIR}/log.txt"
START_TIME=$(date +%s)
python ./scripts/merge_csvs.py \
--run-csv "${OUTPUT_DIR}/run.csv" \
--metric-csvs \
"${OUTPUT_DIR}/input_rosetta_metrics.csv" \
"${OUTPUT_DIR}/ipsae_and_ipae.csv" \
"${OUTPUT_DIR}/dockQ.csv" \
"${OUTPUT_DIR}/rosetta_metrics.csv" \
"${OUTPUT_DIR}/rmsd.csv" \
"${OUTPUT_DIR}/pymol_files/pymol_metrics_af3.csv" \
"${OUTPUT_DIR}/pymol_files/pymol_metrics_input.csv" \
"${OUTPUT_DIR}/pymol_files/pymol_metrics_boltz1.csv" \
"${OUTPUT_DIR}/pymol_files/pymol_metrics_colab.csv" \
"${OUTPUT_DIR}/pymol_files/pymol_metrics_af2.csv" \
"${OUTPUT_DIR}/AF2/AF2_metrics.csv" \
"${OUTPUT_DIR}/AF3/af3_metrics.csv" \
"${OUTPUT_DIR}/Boltz/boltz_metrics.csv" \
"${OUTPUT_DIR}/ColabFold/colab_metrics.csv" \
--out-dir "${OUTPUT_DIR}"
END_TIME=$(date +%s)
echo "CSV merging completed in $((END_TIME - START_TIME)) seconds" >> "${LOG_DIR}/log.txt"See example_run.sh for a complete pipeline example including environment loading/unloading.
Running the full example with 3 structures; including all structure prediction models and relaxation of all input and output structures takes ~40min on a L40S.
The ./analysis folder contains all scripts and notebooks used to generate the analyses described in the paper, provided here for reproducibility.
If you use this code, please cite:
Predicting Experimental Success in De Novo Binder Design: A Meta-Analysis of 3,766 Experimentally Characterised Binders. bioRxiv (2025).
DOI: 10.1101/2025.08.14.670059v1
If you use any of the following tools or methods, please also cite:
-
ColabFold (MSA generation and/or structure prediction)
10.1038/s41592-022-01488-1 -
AF2 initial guess
https://doi.org/10.1038/s41467-023-38328-5 -
Boltz-1
https://www.biorxiv.org/content/10.1101/2024.11.19.624167v4 -
Boltz-2
https://www.biorxiv.org/content/10.1101/2025.06.14.659707v1