de_novo_binder_scoring

This repository contains the scripts and analysis described in:
Predicting Experimental Success in De Novo Binder Design: A Meta-Analysis of 3,766 Experimentally Characterised Binders

System requirements

OS: Tested on AlmaLinux 9.6 (Sage Margay). Other modern x86_64 Linux distributions may work but are untested.
CPU: All analysis scripts in this repo run on CPU. PyRosetta-based relaxation is CPU-bound and benefits from multiple cores for parallelization.
GPU (required for structure prediction): Structure prediction workflows (AF2 initial guess, ColabFold, Boltz, AF3) require a compatible GPU. Please see the respective repositories for exact GPU/CUDA requirements (linked below).
Environment: A Conda environment is provided (environment.yml).

Note: PyRosetta is used in compute_rosetta_metrics.py and rmsd.py, which requires a license.

Installation

Clone and set up the environment (~30min):

git clone https://github.com/DigBioLab/de_novo_binder_scoring.git
cd de_novo_binder_scoring

conda env create -f environment.yml
conda activate binder_scoring_env
chmod +x ./functions/DAlphaBall.gcc

Note: The used structure predictions tools (AF2 initial guess, ColabFold, Boltz and AF3) require seperate installations.

Usage

1. Process inputs

Convert input PDBs into standardized inputs (run.csv, cleaned PDBs, and MSA FASTAs):

python ./scripts/process_inputs.py \
  --input_pdbs ./example_input/input_pdbs \
  --output_dir ./example_output

Binder is expected as chain A (A:no_msa by default).
Non-A chains are merged into B in PDB for downstream analysis (also works if target chains of input are already merged).
Unique target sequences will get target IDs (target_1, target_2, …).
Outputs: run.csv, Binder_seq.fasta, and unique_msa/.

Sanitation of names

All names are automatically sanitized: lowercase letters, underscores, and numbers are allowed; all non-alphanumerics are replaced with _. Be careful that your input files do not sanitize to the same name (e.g., abc.pdb, AbC.pdb, and abC.pdb all become abc.pdb).

Input Modes

You can overwrite columns in run.csv for customization using the --mode flag:

python ./scripts/process_inputs.py --mode {pdb_only, seq_only_csv, hybrid}

pdb_only (default): sequences and other columns are automatically inferred from input PDBs.
seq_only_csv: sequences and other columns are taken only from a CSV. Useful if you do not have input PDBs.

Example:

python ./scripts/process_inputs.py \
  --mode seq_only_csv \
  --input_csv ./example_input/input_sequence_only.csv \
  --output_dir ./example_output_seq

hybrid: sequences are extracted from PDBs by default, but can be overwritten with CSV-specified sequences. To overwrite, the CSV must specify:
- target_chains – all target chains (including those not overwritten)
- target_subchain_X_seq – one column per chain to overwrite (e.g., target_subchain_D_seq)

Example:

python ./scripts/process_inputs.py \
  --mode hybrid \
  --input_pdbs ./example_input/input_pdbs \
  --input_csv ./example_input/input_overwrite.csv \
  --output_dir ./example_output_overwrite

Logging behavior: overwritten sequences are stored in pdb_extracted_trg_subch_{X}_not_used to preserve original PDB info.

Incorporating ions in target structure

It is possible to specify ions in the target structure which will only be modelled by AF3. This is done by running the process inputs in the hybrid mode, and having a column called ions_in_target with the following syntax: "[""CA""]" or "[""CA"",""CA"",""MG""]" ect.

2. Generate MSAs

Generate MSAs using the ColabFold server (MMseqs2). Requires a separate ColabFold installation. Only use the API if you only running a couple of sequences (i.e. only target sequences):

colabfold_batch ./example_output/unique_msa ./example_output/unique_msa/msa --msa-only

3. Prepare model inputs

Generate inputs for structure prediction models (AF3, Boltz, ColabFold):

python ./scripts/generate_model_inputs.py \
  --run-csv ./example_output/run.csv \
  --out-dir ./example_output

4. Relax input structures & compute Rosetta metrics (inspired by the Bindcraft repo)

python ./scripts/compute_rosetta_metrics.py \
  --run-csv ./example_output/run.csv \
  --out-csv ./example_output/input_rosetta_metrics.csv \
  --folder input:./example_output/input_pdbs

5. AF2 initial guess

Run AF2 prediction on relaxed PDBs. Requires a separate AF2 initial guess installation:

predict.py \
  -pdbdir ./example_output/input_pdbs/relaxed_pdbs \
  -scorefilename out.sc \
  -outsilent af2.silent

6. Run ColabFold

Requires a separate ColabFold installation:

colabfold_batch ./example_output/ColabFold/input_folder ./example_output/ColabFold/ptm_output \
  --calc-extra-ptm --num-recycle 3 --num-models 3

7. Run Boltz

Requires a separate Boltz installation:

boltz predict ./example_output/Boltz/input_folder \
  --recycling_steps 10 \
  --diffusion_samples 3 \
  --write_full_pae \
  --out_dir ./example_output/Boltz

8. Run AF3

Requires a separate AF3 installation:

python run_alphafold.py \
  --input_dir=./example_output/AF3/input_folder \
  --model_dir=/path/to/alphafold3_weights \
  --db_dir=/path/to/alphafold3_database \
  --run_data_pipeline=False \
  --num_diffusion_samples=3 \
  --output_dir=./example_output/AF3/outputs

9. Extract confidence metrics

python ./scripts/extract_confidence_metrics.py \
  --run-csv ./example_output/run.csv \
  --out-dir ./example_output

10. Compute ipSAE and interface confidence metrics

python ./scripts/run_ipsae_batch.py \
  --run-csv ./example_output/run.csv \
  --out-csv ./example_output/ipsae_and_ipae.csv \
  --af3-dir ./example_output/AF3 \
  --boltz-dir ./example_output/Boltz \
  --colab-dir ./example_output/ColabFold \
  --ipsae-script-path ./scripts/ipsae_w_ipae.py

There is a possibility to extract specific chain pair ipSAE values: use the argument
--specific-chainpair-ipsae "A:D,A:B,A:C" which takes a string formated like the above.

It is also possible to specify several thressholds for the AF3 contact prob metrics --confidence-threshold "0.5,0.6,0.7,0.8,0.9" \

11. Compute DockQ

python ./scripts/dockQ.py \
  --run-csv ./example_output/run.csv \
  --input-pdbs ./example_output/input_pdbs/ \
  --folder af3:./example_output/AF3/pdbs/ \
  --folder af2:./example_output/AF2/pdbs/ \
  --folder boltz:./example_output/Boltz/pdbs/ \
  --folder colab:./example_output/ColabFold/pdbs/ \
  --out-csv ./example_output/dockQ.csv

12. Compute Rosetta metrics for model PDBs (inspired by the Bindcraft repo)

python ./scripts/compute_rosetta_metrics.py \
  --run-csv ./example_output/run.csv \
  --out-csv ./example_output/rosetta_metrics.csv \
  --folder af3:./example_output/AF3/pdbs/ \
  --folder af2:./example_output/AF2/pdbs/ \
  --folder boltz1:./example_output/Boltz/pdbs/ \
  --folder colab:./example_output/ColabFold/pdbs/ \

13. Compute RMSDs

python ./scripts/rmsd.py \
  --folder input:./example_output/input_pdbs/ \
  --folder af3:./example_output/AF3/pdbs/ \
  --folder af2:./example_output/AF2/pdbs/ \
  --folder boltz:./example_output/Boltz/pdbs/ \
  --folder colab:./example_output/ColabFold/pdbs/ \
  --out-csv ./example_output/rmsd.csv

14. Compute PyMOL metrics

Requires the open-source PyMOL installation:

OUTPUT_DIR="$(pwd)/outputs"
PYMOL_DIR=$OUTPUT_DIR/pymol_files
mkdir -p "${PYMOL_DIR}"

# Create JSON file listing PDB directories
echo '{"input": "'$OUTPUT_DIR/input_pdbs'", "af2": "'$OUTPUT_DIR/AF2/pdbs'", "colab": "'$OUTPUT_DIR/ColabFold/pdbs'", "boltz1": "'$OUTPUT_DIR/Boltz/pdbs'", "af3": "'$OUTPUT_DIR/AF3/pdbs'"}' > "${PYMOL_DIR}/pdb_dirs.json"

# Run PyMOL analysis script
cd $OUTPUT_DIR
python -m pymol -c -d "run ../scripts/pymol_metrics.py"

14. Merge All CSV Metrics

This script will merge all the individual outputs into a merged csv: all of the scripts that generate csv outputs can alternatively take the flag --update-runcsv which will merge results directly into the main csv by the "binder_id," but this can be anoying if reruning the script

source "$CONDA_PATH"
conda activate binder_scoring_env
echo -e "\nMerging all metrics into merged_run.csv" >> "${LOG_DIR}/log.txt"

START_TIME=$(date +%s)

python ./scripts/merge_csvs.py \
  --run-csv "${OUTPUT_DIR}/run.csv" \
  --metric-csvs \
      "${OUTPUT_DIR}/input_rosetta_metrics.csv" \
      "${OUTPUT_DIR}/ipsae_and_ipae.csv" \
      "${OUTPUT_DIR}/dockQ.csv" \
      "${OUTPUT_DIR}/rosetta_metrics.csv" \
      "${OUTPUT_DIR}/rmsd.csv" \
      "${OUTPUT_DIR}/pymol_files/pymol_metrics_af3.csv" \
      "${OUTPUT_DIR}/pymol_files/pymol_metrics_input.csv" \
      "${OUTPUT_DIR}/pymol_files/pymol_metrics_boltz1.csv" \
      "${OUTPUT_DIR}/pymol_files/pymol_metrics_colab.csv" \
      "${OUTPUT_DIR}/pymol_files/pymol_metrics_af2.csv" \
      "${OUTPUT_DIR}/AF2/AF2_metrics.csv" \
      "${OUTPUT_DIR}/AF3/af3_metrics.csv" \
      "${OUTPUT_DIR}/Boltz/boltz_metrics.csv" \
      "${OUTPUT_DIR}/ColabFold/colab_metrics.csv" \
  --out-dir "${OUTPUT_DIR}"

END_TIME=$(date +%s)
echo "CSV merging completed in $((END_TIME - START_TIME)) seconds" >> "${LOG_DIR}/log.txt"

Full workflow example

See example_run.sh for a complete pipeline example including environment loading/unloading. Running the full example with 3 structures; including all structure prediction models and relaxation of all input and output structures takes ~40min on a L40S.

Analyis

The ./analysis folder contains all scripts and notebooks used to generate the analyses described in the paper, provided here for reproducibility.

Citation

If you use this code, please cite:

Predicting Experimental Success in De Novo Binder Design: A Meta-Analysis of 3,766 Experimentally Characterised Binders. bioRxiv (2025).
DOI: 10.1101/2025.08.14.670059v1

Additional citations

If you use any of the following tools or methods, please also cite:

ColabFold (MSA generation and/or structure prediction)
10.1038/s41592-022-01488-1
AF2 initial guess
https://doi.org/10.1038/s41467-023-38328-5
Boltz-1
https://www.biorxiv.org/content/10.1101/2024.11.19.624167v4
Boltz-2
https://www.biorxiv.org/content/10.1101/2025.06.14.659707v1
AF3
https://doi.org/10.1038/s41586-024-07487-w
ipSAE
https://doi.org/10.1101/2025.02.10.637595
DockQ
https://doi.org/10.1093/bioinformatics/btae586

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

de_novo_binder_scoring

System requirements

Installation

Usage

1. Process inputs

Sanitation of names

Input Modes

Incorporating ions in target structure

2. Generate MSAs

3. Prepare model inputs

4. Relax input structures & compute Rosetta metrics (inspired by the Bindcraft repo)

5. AF2 initial guess

6. Run ColabFold

7. Run Boltz

8. Run AF3

9. Extract confidence metrics

10. Compute ipSAE and interface confidence metrics

11. Compute DockQ

12. Compute Rosetta metrics for model PDBs (inspired by the Bindcraft repo)

13. Compute RMSDs

14. Compute PyMOL metrics

14. Merge All CSV Metrics

Full workflow example

Analyis

Citation

Additional citations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
analysis		analysis
example_input		example_input
example_output		example_output
functions		functions
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
example_run.sh		example_run.sh

License

DigBioLab/de_novo_binder_scoring

Folders and files

Latest commit

History

Repository files navigation

de_novo_binder_scoring

System requirements

Installation

Usage

1. Process inputs

Sanitation of names

Input Modes

Incorporating ions in target structure

2. Generate MSAs

3. Prepare model inputs

4. Relax input structures & compute Rosetta metrics (inspired by the Bindcraft repo)

5. AF2 initial guess

6. Run ColabFold

7. Run Boltz

8. Run AF3

9. Extract confidence metrics

10. Compute ipSAE and interface confidence metrics

11. Compute DockQ

12. Compute Rosetta metrics for model PDBs (inspired by the Bindcraft repo)

13. Compute RMSDs

14. Compute PyMOL metrics

14. Merge All CSV Metrics

Full workflow example

Analyis

Citation

Additional citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages