-
Notifications
You must be signed in to change notification settings - Fork 5
QM MM Tutorial
In this tutorial you will learn how to perform a QM/MM calculation in CAST with a protein structure you downloaded from the Protein Data Bank.
- External programs you need
- Creation of tinkerstructure
- Choosing the QM region
- The real calculations
- Conclusion
The first thing we have to do is to prepare a tinkerstructure (for explanation of this format see here) that contains the protein system we want to investigate. The Protein Data Bank contains a lot of protein structures which can be downloaded. Here we will use the structure 2p7u which contains the enzyme rhodesain bound to an inhibitor.
The preferred program to process pdb files is pymol. So now we open pymol and first get our structure. This is done by typing fetch 2p7u into the command line inside this program. Now the protein-inhibitor system appears in the viewer. In order to better investigate the system we can display the aminoacid sequence by typing set seq_view, 1 . There are a lot of aminoacids, displayed by their one-letter-code, the symbol "D1R" which is the inhibitor and a lot of water molecules, displayed by "O". We want to take a closer look at residue S88 which is a Serin. So we click on this letter in the sequence, then type hide all to hide everything and then display the selected residue as sticks by typing show sticks, sele . We see that the serin residue looks strange. It seems to be there twice. This is called alternate locations and means that in the crystal where the structure was taken from the residue had one position and in others it had the other position. As CAST can't deal with that we remove all locations except the first (for the whole system) by typing remove not (alt ''+A), followed by alter all, alt=''. We see that the serin residue looks "normal" now. The second thing we have to do before we can use the structure in CAST is to add hydrogen atoms which are not present in most X-ray structures. For some reason it is not possible to do that straight ahead but we have to save the structure without alternate locations and open it again. We do that by typing:
save 2p7u.pdb
delete all
load 2p7u.pdb
Now we can add hydrogen atoms by h_add. We immediately see how the single oxygen atoms are converted into complete water molecules.
For most structures we could now save our molecule as .xyz file and open it by CAST. However in this structure there is a small mistake in the structure. We take a closer look at the C-terminal Glycin. We see that the C-terminus doesn't end in a carboxyl group but in an aldehyde. To correct this we have to convert the hydrogen atom into an OH group. We do this by using the "Builder" which is opened by clicking on this button in the upper right corner of Pymol. In the window that appears we click on the button "O". Pymol tells us to pick atoms to replace with Oxygen so we click on the aldehyde H. According to our wishes it immediately gets converted into an OH group. So now we save our structure as a xyz-file by typing save 2p7u.xyz and as a pdb-file with save 2p7u.pdb Then we can close pymol.
Now we will use CAST to convert this xyz-file to tinker format. So we open the CAST inputfile "CAST.txt". In it we have do choose the following options:
name 2p7u.xyz # name of our inputfile
outname output # name of the outputfile (can be anything)
inputtype XYZ # input filetype: XYZ
xyz_atomtypes 1 # automatically create OPLSAA atomtypes
task WRITE_TINKER # write a tinker stucture
interface DFTB # can be anything except forcefields
Now we run CAST. In the folder where we run it we need the inputfile and the structurefile. CAST now creates a tinkerfile "output.arc" which can be opened py Pymol. We see our molecule here. But what is more interesting is the terminal output of CAST which tells us for which atoms no forcefield atomtypes could be created. It says:
No atomtypes found for (to be copied into pymol):
select sele, id 339+340+341+342+343+344+345+346+347+348+3087+3088+3089+3090+3091+3092+3093+3094+3095+3096+3097+3098+3099+3100+3101+3102+3103+3104+3105+3106+3107+3108+3109+3110+3111+3112+3113+3114+3115+3116+3117+3118+3119+3120+3121+3122+3123+3124+3125+3126+3127+3128+3129+3130+3131+3132+3133+3134+3135+3136+3137+3138+3139+3140+3141+3142+3143+3144+3145+3146+3147+3148+3149+3150+3151+3152+3153+3154+3155+3156+3157+3158+3159+3160+3161+3162+3163+3164+3165+3166
We do what we are told here so we open pymol again and open the pdb-file we have saved before by load 2p7u.pdb. Then we copy the stuff above into the pymol command line, then do hide all and show sticks, sele. In the viewer we see that the atoms that are not recognized are the inhibitor and the cystein residue to which the inhibitor is bound. So what we have to do now is to assign atomtypes to these atoms. This is a very annoying task where you have open the .arc file in a text editor, look at every atom that is not assigned yet (the first one here is 339) and determine the atomtype by comparing it to the description of the atomtypes in the forcefield parameter file. Our parameterfile is called "oplsaa_mod2.prm" and you can download it from here. It is an extension of the file "oplsaa.prm" that you can download from the tinker homepage. To see which index belongs to which atom we can do label sele, ID to show the atom indices as labels in pymol. If they are wrong maybe pymol has changed the order of the atoms which you can prevent by set retain_oder, 0. So in our example atom 339 is a backbone nitrogen atom. In the parameter file you find "Amide -CO-NHR" as description for atomtype 180 which seems to be suitable here so you replace the 0 in the file "output.arc" by a 180. The same we have to do for all the other not-recognized atoms. To make it easier for you the resulting atom types are shown here:
339 N -1.747000 1.908000 9.073000 180 330 340 348
340 C -1.798000 1.114000 7.844000 165 339 341 343 345
341 C -1.972000 2.012000 6.595000 177 340 342 349
342 O -1.332000 1.765000 5.554000 178 341
343 C -2.949000 0.108000 7.940000 156 340 344 346 347
344 S -4.562000 0.878000 8.259000 144 343 3121
345 H -0.852365 0.582917 7.735252 85 340
346 H -2.733424 -0.544419 8.786154 85 343
347 H -3.007901 -0.440157 6.999705 85 343
348 H -2.329688 1.638838 9.852824 183 339
...
3087 C -6.347000 -2.406000 12.971000 90 3088 3089 3131
3088 C -5.097000 -1.976000 12.455000 90 3087 3090 3094
3089 C -6.647000 -2.151000 14.324000 90 3087 3091 3128
3090 C -4.157000 -1.364000 13.278000 90 3088 3092 3129
3091 C -5.709000 -1.499000 15.159000 90 3089 3092 3130
3092 C -4.463000 -1.116000 14.660000 90 3090 3091 3132
3093 N -9.278000 3.944000 8.013000 180 3102 3103 3133
3094 S -4.835000 -2.299000 10.934000 434 3088 3100 3101 3122
3095 N -7.134000 1.467000 9.526000 180 3111 3112 3134
3096 O -10.829000 2.974000 9.301000 178 3102
3097 N -11.030000 5.210000 8.817000 181 3102 3124 3126
3098 O -6.847000 3.703000 9.206000 178 3111
3099 N -11.602000 7.717000 9.856000 770 3123 3125 3127
3100 O -3.500000 -2.490000 10.564000 435 3094
3101 O -5.549000 -3.532000 10.721000 435 3094
3102 C -10.396000 4.020000 8.745000 177 3093 3096 3097
3103 C -8.593000 2.648000 7.976000 171 3093 3104 3111 3135
3104 C -7.950000 2.459000 6.575000 81 3103 3105 3136 3137
3105 C -8.871000 2.619000 5.395000 90 3104 3106 3110
3106 C -10.197000 2.237000 5.529000 90 3105 3107 3138
3107 C -11.056000 2.363000 4.423000 90 3106 3108 3139
3108 C -10.525000 2.854000 3.235000 90 3107 3109 3140
3109 C -9.216000 3.229000 3.123000 90 3108 3110 3141
3110 C -8.344000 3.082000 4.219000 90 3105 3109 3142
3111 C -7.445000 2.639000 8.957000 177 3095 3098 3103
3112 C -6.002000 1.319000 10.483000 171 3095 3113 3121 3143
3113 C -6.546000 0.977000 11.887000 81 3112 3114 3144 3145
3114 C -7.249000 2.153000 12.572000 81 3113 3115 3146 3147
3115 C -7.579000 1.822000 13.996000 90 3114 3116 3120
3116 C -8.535000 0.853000 14.277000 90 3115 3117 3148
3117 C -8.802000 0.508000 15.609000 90 3116 3118 3149
3118 C -8.138000 1.187000 16.632999 90 3117 3119 3150
3119 C -7.220000 2.179000 16.348000 90 3118 3120 3151
3120 C -6.942000 2.534000 15.028000 90 3115 3119 3152
3121 C -4.976000 0.231000 9.871000 149 344 3112 3122 3153
3122 C -5.592000 -1.166000 9.865000 81 3094 3121 3154 3155
3123 C -11.962000 9.034000 10.351000 733 3099 3156 3157 3158
3124 C -10.570000 6.377000 8.057000 736 3097 3125 3159 3160
3125 C -10.508000 7.641000 8.909000 736 3099 3124 3161 3162
3126 C -12.299000 5.387000 9.530000 736 3097 3127 3163 3164
3127 C -12.254000 6.566000 10.473000 736 3099 3126 3165 3166
3128 H -7.610201 -2.458744 14.730986 91 3089
3129 H -3.187489 -1.072335 12.874167 91 3090
3130 H -5.961298 -1.294353 16.199465 91 3091
3131 H -7.062234 -2.924221 12.332260 91 3087
3132 H -3.731841 -0.635221 15.309891 91 3092
3133 H -8.922134 4.743480 7.508726 183 3093
3134 H -7.687429 0.653530 9.297783 183 3095
3135 H -9.320177 1.870370 8.209679 85 3103
3136 H -7.614879 1.422224 6.545166 85 3104
3137 H -7.148853 3.191208 6.474326 85 3104
3138 H -10.567554 1.845370 6.476320 91 3106
3139 H -12.107267 2.084047 4.494559 91 3107
3140 H -11.174785 2.942572 2.364348 91 3108
3141 H -8.846310 3.642287 2.184585 91 3109
3142 H -7.285611 3.328384 4.134093 91 3110
3143 H -5.445266 2.246964 10.613503 85 3112
3144 H -5.689773 0.710201 12.506511 85 3113
3145 H -7.257367 0.156625 11.791911 85 3113
3146 H -8.176520 2.362567 12.039188 85 3114
3147 H -6.593601 3.023665 12.549780 85 3114
3148 H -9.074257 0.363943 13.465751 91 3116
3149 H -9.518294 -0.279705 15.842545 91 3117
3150 H -8.347087 0.930565 17.671568 91 3118
3151 H -6.707439 2.689765 17.163168 91 3119
3152 H -6.247972 3.343511 14.801908 91 3120
3153 H -4.071056 0.095607 10.463323 85 3121
3154 H -6.616967 -1.050255 10.217344 85 3122
3155 H -5.528267 -1.560928 8.851062 85 3122
3156 H -12.950261 8.991778 10.808884 85 3123
3157 H -11.231688 9.355893 11.093382 85 3123
3158 H -11.975821 9.742730 9.522984 85 3123
3159 H -11.286585 6.553201 7.254779 85 3124
3160 H -9.574313 6.169541 7.664994 85 3124
3161 H -9.576827 7.617278 9.475086 85 3125
3162 H -10.550103 8.510090 8.252486 85 3125
3163 H -12.483510 4.489136 10.119827 85 3126
3164 H -13.094086 5.547143 8.801787 85 3126
3165 H -13.277838 6.844875 10.722168 85 3127
3166 H -11.702019 6.283206 11.369350 85 3127
After inserting the missing atomtypes we can test if everything works by just performing a singlepoint calculation with CAST and the OPLSAA forcefield. These are the options we have to change:
name output.arc # name of our inputfile
inputtype TINKER # input filetype: tinker
task SP # do a singlepoint calculation
interface OPLSAA # use OPLSAA forcefield
paramfile oplsaa_mod2.prm # file with the forcefield parameters
Now we run CAST. Besides the inputfile and the structurefile we now also need the forcefield parameter file which fits with the atomtypes that we have assigned in the step before. As a result we should get a total energy of -3873.511217990343 kcal/mol.
Now we have created a structure file that we can use for any kinds of CAST calculations. In the next steps we will define a QM region and then perform a QM/MM calculation.
For doing a QM/MM calculation the next thing we have to do is choosing region which should be described by our QM program. This region will be given to CAST as a list of atom indices which are separated by commas. So we should now create this list. The easiest thing to do this is again the use of pymol. In order to do that we first convert our tinkerstructure back to a pdb file which can be better handled by pymol. (Of course we could also use the original pdb file but this is an approach which you can also use if your original structure is no pdb.) Converting tinkerstructures to pdb can be very easily done by CAST. The only option we have to change is task which we set to WRITE_PDB. Then we run CAST.
We get a structure "output.pdb" which we can open with pymol. What is very important is that pymol somtimes changes the order of the atoms. To avoid this the first thing we do after launching pymol is typing set retain_order, 1 into the command line. Then we display the sequence by set seq_view, 1 . If we take a closer look at the sequence we see 2 unkown residues XXX: The first is displayed between residues 24 and 25, the second at the end of the protein chain at field 215. Interestingly the first XXX also has the residue number 215, so in reality it is only one residue. If we select those residues and show them we see that the first part is the cystein where the inhibitor is bound to and the second is the inhibitor itself. The explanation of this is that CAST recognizes the cystein together with the bound inhibitor as one modified aminoacid that it doesn't know. So it puts them into one residue with the name XXX. But as the cystein atoms are between the atoms of S24 and W25 in the structure and we told pymol not to change the order of the atoms they are displayed in two different places in the sequence.
So now we want to choose the atoms of our QM region. The QM region should be the one you see in the picture below. So all atoms are part either of the inhibitor, the bound cystein and the aminoacid histidine 161. So we select these 3 residues in the sequence and show only them by typing hide all and show sticks, sele. So now we want to select just certain atoms in this residues. For this we have to switch the selection mode by clicking on Mouse -> Selection Mode -> Atoms. Then we click on all the atoms that are part of our QM region. At the end that should be 43 atoms. Then we can display the atom indices by typing x = ','.join(str(x[1]) for x in cmd.index('sele')) , followed by print(x). The output should be:
343,344,346,347,2271,2272,2273,2274,2275,2276,2278,2279,2281,2282,2283,3087,3088,3089,3090,3091,3092,3094,3095,3098,3100,3101,3111,3112,3113,3121,3122,3128,3129,3130,3131,3132,3134,3143,3144,3145,3153,3154,3155
These are the atom indices that define the QM region of our QM/MM calculation. We copy this into a textfile and save it until we need it in the next step.
Now we finally really want to start the QM/MM calculation. So we open the inputfile "CAST.txt" again and set a few options:
task SP # do a singlepoint calculation
interface QMMM_A # use additive QM/MM interface
paramfile oplsaa_mod2.prm # file with the forcefield parameters for MM
#################### QM/MM OPTIONS ##########################
QMMMqmatoms 343,344,346,347,2271,2272,2273,2274,2275,2276,2278,2279,2281,2282,2283,3087,3088,3089,3090,3091,3092,3094,3095,3098,3100,3101,3111,3112,3113,3121,3122,3128,3129,3130,3131,3132,3134,3143,3144,3145,3153,3154,3155 # these is the atom list from above
QMMMmminterface OPLSAA # interface for MM part
QMMMqminterface DFTB # interface for QM part (DFTB because it is fast)
QMMMwriteqmintofile 1 # we want to see our QM region as seperate structure
QMMMlinkatomtype 85, 85, 221, 85
QMMMcenter 0 # central atom of QM region, 0 = default
QMMMzerocharge_bonds 3 # default option for electrostatic embedding
#################### OPTIONS FOR DFTB+ ##########################
DFTB+path /apps/dftbplus/bin/dftb+ # path to dftb+
DFTB+skfiles /apps/dftbplus/mio-1-1/ # path to slater koster files
DFTB+charge 0 # total charge of the QM system
Something where we have to pay attention is the option "QMMMlinkatomtype". When you look at the system we have chosen as QM region you see that there are 4 bonds that are cut in order to get it. This system will now be put in a QM program (or better semiempirical program, if you want to be very exact) and QM programs don't like loose bonds. So all the loose bonds are saturated by hydrogen atoms. With the option "QMMMlinkatomtype" you have to give the forcefield atomtype for those hydrogen atoms in the order of the QM atom they are bound to. So to determine them take your former view in pymol and show labels (label sele, ID) So you see you have the following QM atoms which are bound to a MM atom:
- index 343: CH2 group, hydrogen gets type 85 ("Alkane H-C")
- index 2271: CH2 group, hydrogen gets type 85 ("Alkane H-C")
- index 3111: carbonyle group, hydrogen gets type 221 ("Aldehyde/Formamide H-C=O")
- index 3113: CH2 group, hydrogen gets type 85 ("Alkane H-C")
Of course the paths to DFTB+ and to the slater-koster have to be adjusted to your system. Download links are https://www.dftbplus.org/download/ for the program itself and https://www.dftb.org/parameters/download/ for the slater-koster files.
Now we run CAST. The result is:
QM-atoms: 43
MM-atoms: 3630
Potentials
QM MM VDW BONDED COUL TOTAL
-36357.9 -3959.9 -40.3 -0.8 0.0 -40272.9
Total energy: -40272.9 kcal/mol
What we see here is first the energy of only the QM system (which is also written into the file "qm_system.arc" so we can look if we did everything right when defining the QM region), then the energy of only the MM system and then the interactions between those two systems. Those are discriminated into van der Waals interactions and bonded interactions. The coulomb interactions are zero as we used electrostatic embedding so that the coulomb interactions are included into the energy of the coulomb system. As we used an additive scheme here the total energy is just the sum of all the partial energies.
In the first calculation we used electrostatic embedding which means that electrostatic interactions between QM and MM system are taken into account by including the charge parameters of the MM atoms into the QM calculation as external charges. As in the case of bonded interactions some of the MM atoms are too close to the QM system and the corresponding interactions are already included in the bonded terms those atoms are not taken into account when creating the external charges. QMMMzerocharge_bonds 3 means that MM atoms are excluded if they are 3 bonds or less between them and the next QM atom. In the same manner you can choose 1 or 2 here then atoms are only excluded of they are directly bound to a QM atom (1) or seperated by either 1 or 2 bonds (2).
We can also use mechanical embedding, i. e. calculating the electrostatic interaction between QM and MM system as pairwise coulomb interactions of the MM charges. In order to do this we this we run the same calculation again but set the option QMMMzerocharge_bonds to 0 for activating mechanical embedding. Most of the partial energies are the same but now the coulomb interaction has a value (6.3) and the value of the QM energy has changed to -36315.9 so the total energy now is -40224.6 kcal/mol.
There is also a subtractive QM/MM scheme implemented into CAST. To choose this you have to set the interface to QMMM_S instead of QMMM_A. All the other options are the same for additive and subtractive scheme. If you run the subtractive scheme with the same options as above you get:
QM-atoms: 43
MM-atoms: 3630
Potentials
MM_big MM_small QM TOTAL
-3873.5 -17.5 -36357.9 -40213.7
Total energy: -40213.7 kcal/mol
Here the first value is the energy of the total system, calculated by MM method, the next is the energy of the QM system, calculated by the MM method and QM is the energy of the QM system, calculated with the QM method. In the case the total energy is calculated by:
TOTAL = MM_big - MM_small + QM
This approach has a few advantages over the additive scheme. As we calculated the QM system twice and include one of the calculations with a positive and one with a negative sign some errors are cancelled out. Furthermore as only "complete" systems are calculated, no individual interactions as in the additive scheme it is not necessary to use a forcefield as "MM method" but you can also combine tho QM or semiempirical interfaces (just change the option MMinterface to some QM program).
Furthermore it is also possible to combine more than two different methods within the subtractive scheme. In CAST there is a THREE_LAYER interface implemented which can combine three different energy interfaces which is what we want to do in the next step.
In order to perform a three-layer calculation we first set the interface option in the CAST.txt file to THREE_LAYER. Three-layer means we have a region which we want to describe very precisely, in our case we will use the program Psi4 for this. This region is called the small system and the method for it "QM method". Around this region we have another layer, called intermediate layer that should be described by an intermediate method (in our case DFTB, in CAST it is generally called SE for semiempirical). The rest of the molecule is still described by a forcefield.
So we go to the QM/MM options in the CAST inputfile. By QMMMqmatoms we define which atoms we want to include in the inner system, i. e. calculate with Psi4. We take the same atoms as before. The QMmminterface remains OPLSAA. For the QM, i. e. the innermost interface, we choose PSI4. Now we have the choose which atoms belong to our intermediate layer which are defined by QMMMseatoms. We want to apply this method on the benzyle group of the inhibitor which is cut off by the QM system. For getting the atom indices we can again use pymol and get: 3114,3115,3116,3117,3118,3119,3120,3146,3147,3148,3149,3150,3151,3152
As QMMMseinterface we choose DFTB. QMMMsmall_charges is an option for electrostatic embedding the three-layer interface. It is recommended to take either 1 or 2 here, we take 1 . We also have to give an option for the central atom of the small system which is called QMMMsmall_center. As this still doesn't matter we again give 0 which is the default option.
Something else we have to change now are the link atom types. They have to be given for the intermediate system, i. e. for a system consisting of all QM and SE atoms. If you compare this system to the former QM system you notice that the atom with index 3113 doesn't need a link atom any more as the benzyle group is not cut off anymore. So this option is:
QMMMlinkatomtype 85, 85, 221
Furthermore we have to set a few option for the Psi4 interface (of course the path again depends on the location of Psi4 on your machine):
PSI4-path /apps/psi4/psi4 # path to Psi4
PSI4-basis 6-31G # basisset
PSI4-method HF # calculation method
PSI4-spin 1 # spin multiplicity (of small system)
PSI4-charge 0 # charge of small system
PSI4-memory 4GB # resources for PSI4
PSI4-threads 4
Now we run CAST. The result is:
QM-atoms: 43
SE-atoms: 14
MM-atoms: 3616
Potentials
MM_big MM_middle SE_middle SE_small QM TOTAL
-3873.5 39.8 -45287.9 -36359.1 -1130415.1 -1143257.2
Total energy: -1143257.2 kcal/mol
So first the total system is calculated by the MM method (OPLSAA). This number is identical to the one before. Then the middle or intermediate system is also calculated by the MM method. This system consists of all QM and SE atoms, i. e. in our case it contains 43 QM + 14 SE + 3 Link-Atoms = 60 atoms. This system is written into the tinkerstructure "intermediate_system.arc". Then the same system is calculated by the "SE method" (DFTB). Then the small system is calculated by the same method. This system only consists of the QM atoms and can be viewed in the file "small_system.arc". Last but not least this system is calculated by the "QM method" (Psi4). The total energy is then calculated by:
TOTAL = MM_big - MM_middle + SE_middle - SE_small + QM
You have now seen how you can prepare a structure from the Protein Data Bank for calculations with CAST. Furthermore you have learned how to perform QM/MM and three-layer calculations with CAST by the examples of singlepoint (SP) calculations. Of course you can use those energy interfaces also together with the other tasks of CAST like optimizations or MD simulations. Have fun with it!
Date: 29 Aug 2019 (name of interfaces changed on 04 May 2020)
