Skip to content

TCoulth/dynamicMPNN

Repository files navigation

DynamicMPNN

This package is an adaptation of LigandMPNN (and ProteinMPNN) which adds a new biasing method. Biases can be applied dynamically, meaning the biases during sequence inference adjust to the current decoded sequence. This allows design towards new properties, such as a target pI.

The full instructions for all options of (https://github.com/dauparas/LigandMPNN) can be found here. Descriptions of the additional constraints, as well as some computational examples, follow below.

Covered in this README are:

How to use Dynamic Constraints

Targeting pI

Option Default Description
pI_target None Target pI. Required to run. Values between 0 and 14.
pI_strength 0.5 Overall strength of bias. If you are pushing to the extremes of pH, consider increasing. 0.65 is viable. If going closer to 1.0, your sequence quality will decrease.
pI_urgency 1.0 Modifier to bias that increase strength as positions to decode run out. The default value of 1 translates to a 1x strength going to 2x by the last decoding residue. Not extensively tested.
pI_dead_zone 0.0 Removes bias when current pI within the range from target pI. Values showed no major effect.

For the first run, I'd recommend keeping all defaults and just entering a target pI.

python run.py --pdb_path my_protein.pdb --out_folder output/ --pI_target 7.0 --number_of_batches 25

If you aren't hitting your target pI, or it is a more extreme target, increase the strength option.

python run.py --pdb_path my_protein.pdb --out_folder output/ --pI_target 11.0 --pI_strength 0.65 --number_of_batches 25

For the first run, I'd recommend keeping all defaults and just entering a target pI. You can adjust as needed. If you aren't hitting your target pI, increase the strength option first.

Surface Patches

Option Default Description
surface_patch_type None Options are positive, negative, or hydrophobic. Biases (KR), (DE), and (FWYLIVM) respectively. Required to run.
surface_patch_center None Residue number to center patch search on. Is the 1-indexed residue number. Not required to run (global search if not)
surface_patch_center_jitter 20.0 Angstrom radius around the patch center that can be randomly sampled to be the true center for that batch. Allows some variation in targeting to escape a bad seed residue. Value of 0 would force designated patch center to be center in each run
surface_search_radius 12.0 Angstrom radius around the true batch center that can form the surface patch. Only residues within this distance can receve a bias.
surface_patch_lock False Setting this to True would lock all residues outside the search radius from being decoded by proteinMPNN. This would result in only residues in the patch search area from being decoded by proteinMPNN. Useful when you need to make minimal modifications to your protein, for instance with an enzyme, to try to maintain function

Build a positive surface patch anywhere on the protein surface:

python run.py --pdb_path my_protein.pdb --out_folder output/ --surface_patch_type positive --number_of_batches 25

Build a positive surface patch near to a specific residue on surface:

python run.py --pdb_path my_protein.pdb --out_folder output/ --surface_patch_type positive --surface_patch_center 33 --surface_search_radius 16 --number_of_batches 25

Build a large negative surface patch very close to a specific residue on surface:

python run.py --pdb_path my_protein.pdb --out_folder output/ --surface_patch_type positive --surface_patch_center 33 --surface_search_radius 20 --surface_patch_center_jitter 12 --number_of_batches 25

**Note for surface patches: Each batch gets the same center and decoding order, so you may want to skew towards more batches for higher variety.

Below are options less likely to be adjusted:

Option Default Description
surface_patch_tiers 5.5:1.0,7.5:0.6,10.0:0.3 The tiers for the bias applied, with their associated strength. Residues (by c-beta distance) under 5.5A get 1.0, under 7.5A get 0.6, and under 10A get 0.3. Tiers can be adjusted, removed, and added.
surface_patch_seeding_bias 0.5 The bias applied to eligible residues being decoded before a correct type already has. So the first patch residue to be decoded gets this strength applied. This bias value doubles until a correct type is decoded.
surface_patch_max_bias 1.5 The upper limit on the bias value a position can have. Can be increased to allow even stronger biases.
surface_sasa_threshold 0.20 Relative SASA cutoff for designation surface residues. Not extensively tested for variation and outcomes.

Dynamic Constraint Sets

Currently implemented in dynamicMPNN

Target pI

The purpose of this constraint is to bias sequence to your desired pI value.
The main factors in determining the strength of the bias applied during decoding are the delta pI (target pI - current pI) and number of residues to be decoded. Strength increases with larger delta pIs and few undecoded residues. For the delta pI, there is a modifier (tanh) so large delta values saturate towards 1.

pI Benchmark Setup

Two proteins were chosen for a benchmark. The first is pdb 4eb0, which is Leaf-branch compost cutinase (LCC). The second is 1tpw, a triosephosphate isomerase (TIM). These were chosen due to being monomers, having appropriate crystal structures, and having different pIs, 9.3 and 6.8 respectively.
Conditions:

  • Target pI: 4, 5, 6, 7, 8, 9, 10, 11, 12
  • Strength values: 0.20, 0.35, 0.5, 0.65, 0.8, 1
  • Vanilla proteinMPNN with the same parameters, minus those related to pI, were run for baseline comparison. For each conditions, 25 batches were run with 5 sequences per batch.
  • A deadzone, where if the calculated pI was within a small range from the target pH, no bias would be applied. It was found during development that values from 0 to 0.25 had no effect, so 0 was used for all benchmarks, and is the default value.

Metrics:

  • Delta pI
  • % of sequences within a specific delta pI range
  • Confidence (from proteinMPNN)
  • Sequence recovery Select sequences were also refolded with AF2 for pLDDT

pI Results

Below are a series of scatter plots for both of the benchmark proteins. They show similar behavior in terms of how strength interacts with the generated sequences. Higher strength generated sequences closer to the target pI, but with a cost to confidence. The default value is 0.5, but may be increased if targeting pIs at the extreme.

pI_scatter_thr05 pI_scatter_thr05

Below is a box plot for each protein, at the designated default strength of 0.5. One observation, albeit outside the scope of this work, is the very different distributions of pI from the vanilla proteinMPNN runs. 4eb0 had a higher average pI, as well as a wider distribution, whereas 1tpw had a lower average pI, and tighter distribution. These differences explain why 4eb0 was better at reaching the high target pIs with the constraints. These plots show the generated sequences are in the appropriate pI range.

pI_boxplot_s05 pI_boxplot_s05

For each benchmark at all the target pIs with strength 0.5, one sequence was chosen for refolding (best combination of confidence and closest to target pI). In general, the top vanilla ProteinMPNN sequences had higher confidences than the pI sequences, albeit not very large, so the top 3 sequences by confidence in the vanilla runs, as well as 3 sequences in the confidence range of the pI generated sequences were refolded. The higher confidences in the vanilla runs make sense, as proteinMPNN is unrestrained, and will be able to select its top choices at all positions without being biased towards the pI goal. The table below shows the statistics for these representative sequences. Overall, the pI designed sequences do not seems to show a defect compared to sequences from vanilla ProteinMPNN. Experimental verification of the folding and stability of these pI designed sequences are still needed.

Sequence Mean pLDDT Stdev pLDDT Mean pTM Stdev pTM
1tpw pI 93.7 1.10 0.917 0.005
1tpw vanilla 94.2 0.65 0.920 0.005
1tpw highconf 94.3 0.43 0.921 0.005
1tpw midconf 94.1 0.91 0.919 0.005
4eb0 pI 96.7 0.28 0.939 0.003
4eb0 vanilla 96.5 0.25 0.938 0.003
4eb0 highconf 96.3 0.22 0.937 0.003
4eb0 midconf 96.6 0.13 0.939 0.003

Surface Patches

Surface patch creation is guided similarly by dynamic constraints. For calculating and applying bias, a center surface residue is chosen at the start of each batch. This can either be supplied by the user (along with a jitter radius so the centers can be slightly skewed) or it will be chosen randomnly over the surface. All surface residues, within a specific search radius from the center, as determined by beta-carbon distances, are considered bias-eligible residues. When the first bias-eligible residue is decoded, there is a small seeding bias applied to push towards a desired residue type (ie K or R for positive patches) and away from the non-desired type (ie avoid D or E for positive patches). This seeding bias doubles until an eligible residue is successfully decoded to the correct type. After this point, the seeding bias no longer applies, and a new dynamic biasing formula takes over. The bias is increased the closer it is to any bias-eligible residue that has been decoded to the correct type. It take a tiered approach, and applies the same bias to all correct types under 3 threshold tiers: 1.0 for under 5.5A, 0.6 for under 7.5A, and 0.3 for under 10A). These biases are summed for the decoding residue, so the more correct types are near it, the stronger the bias. This is to encourage the creation of contiguous patches without forcing those residues to be placed. In addition to biases, the random decoding order is altered before decoding begins. After the center residue is chosen, the bias-eligible residues are re-orderd to radiate from the center residue, and are decoded before all other residues. This means the residue with the closest beta carbon distance to the center residue will be the first of the bias-eligible residues to be decoded, after the center residue. This continues until all bias-eligible residues are decoded, after which the decoding of the full protein continues.

Patches Benchmark Setup

The same two benchmark proteins, PDBs 4eb0 and 1tpw, were used for the surface patch benchmarking.

Conditions:

  • Surface patch type: Positive, Negative, Hydrophobic
  • Search radius (A): 12, 16, 20
  • Seeding bias: 0.2, 0.5
  • Maximum bias: 1.5, 2.0

Metrics:

  • Largest connected clusters (LCC)
    • One of the main metrics used here is the size of the largest connected cluster of residues in the bias-eligible set. This is calculated by connecting all residues that have a beta-carbon distance within varying thresholds. The count of residues in the largest cluster is the cluster value.
  • Patch size and Patch members
    • PEP-patch was used to calculate surface patches across packed models. A patch that centers on a bias-eligible residue is considered a successful patch. A patch centered on the designated center residue would be a high-quality patch
  • Target Fraction
    • The number of eligible residues that were decoded to the correct type for the surface patch. More of a diagnosis metric than a metric with a target. A value of 1 would be bad because it would indicate we are making every eligible residue contribute to the patch, which could be accomplished by flat biases from the beginning. A value of 0 would be bad, as it means no patch would form
  • Confidence (from proteinMPNN)
  • Sequence recovery All sequences were run with the pack flag to easily evaluate patches. Select sequences were also refolded with AF2 for pLDDT.

Patches Results

A series of heatmaps are shown below, for the major conditions. The two proteins, and the 3 surface patch types are separated. Some trends are consistent:

  • Larger surface searches create larger patches
  • Increased seeding bias has a slight positive effect on building better patches (runs that go through decodes more eligible residues before finally decoding into the correct type lead to poorer patches)
  • Hydrophobic patches seem to be of lower-quality compared to positive or negative. This may be intrinsic to proteins, or may be a function of methodology. For now it is suitable, but may improvement here is possible.
  • Minimal affect on confidence values from ProteinMPNN
  • Slight improvement in regards to seeding residues being centers with larger search radii
  • Different scaffolds will have different propensities for types of patches. For example, negative patches in 1ptw were much larger and more easily made than positive patches.
  • The target fraction (# of target type residues/# of total eligible residues) decreases as search radius increases. While the absolute size of the patches increase, the percentage of eligible residues decreases. This suggests the biasing is helping to form tight patches, rather than just unanimous increase across the eligible area.
Negative Patches
heatmap_4eb0A_negative_2 heatmap_1tpwA_negative_2
Positive Patches
heatmap_4eb0A_positive_2 heatmap_1tpwA_positive_2
Hydrophobic Patches
heatmap_4eb0A_1tpwA_hydrophobic **Note that PEP-patch was not run for the hydrophobic patches. There were errors when trying to run the hydrophobic pep-patch script with the packed structures. There may be some compatability issue. The other metrics show similar trends as the polar patches, but we do not have definitive patch size data for this set.
Selected AF2 Patch Comparisons

One sequence per protein:surface-type:search-radius combination was refolded with AF2. Their patches are shown below. The two proteins are on each half of the image, the y-dimension represents the surface patch type and the x-dimension the search radius. Visualization done in pymol using apbs. Hydrophobic surface look to be of lower quality. Surface_patches_overview

Sequence Motifs

As of 4/21/2026, this feature has not been benchmarked (although some relevant code is included). I want to make sure it is robust and provide something that simple post-sequence generation filtering cannot adequately accomplish.

Citing this work

If you use the code, please cite the original papers, and this repo for new constraints:

@article{dauparas2023atomic,
  title={Atomic context-conditioned protein sequence design using LigandMPNN},
  author={Dauparas, Justas and Lee, Gyu Rie and Pecoraro, Robert and An, Linna and Anishchenko, Ivan and Glasscock, Cameron and Baker, David},
  journal={Biorxiv},
  pages={2023--12},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

@article{dauparas2022robust,
  title={Robust deep learning--based protein sequence design using ProteinMPNN},
  author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others},
  journal={Science},
  volume={378},
  number={6615},  
  pages={49--56},
  year={2022},
  publisher={American Association for the Advancement of Science}
}

Claude Code Disclosure

Code was written in conjunction with Claude. The idea of dynamic constraints and the current three constraints were solely the work of me (Tim). Initial ideas of implentation and benchmarking analysis was also done by me. Claude assisted primarily in the writing of the codebase.

About

LigandMPNN but with dynamic constraints. This allows the biases applied during inference to adjust to a desired goal during the generation process. Current implementations are for pI targeting and surface patch generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors