 All right. So well, good morning everyone. Thank you for coming to my, to my now, now pre recorded talk just to be like everyone else I'm using an electronic prophylaxis. But yeah, so, so strategies for ab initio biomolecular force field development this whole idea of ab initio force fields has been very attractive for at least 15 years or so ever since essentially we had, we had the compute capacity to do the quantum calculations that would describe any of these molecules that we're interested in. And so in the last, in the last five to 10 years, we've really started to see a raise hope in this direction. So I want to start just by reviewing the differences between the small molecule force field, which a lot of the other talks have covered in I think great detail and I've been I've been watching very, very impressed with a lot of the work that's being done in open force field, the small molecule force fields versus the bottom molecular force fields and you can see these differences manifest themselves even in the way they even in not just the fitting strategies but even in the way people talk about their force fields and then it becomes an issue okay well how do we pair the small molecule force field or even like the third category of force fields or major category that I'm not mentioning here is water models. In fact, but you know people talk about pairing the bottom molecular force field model or pairing small molecule force field to the bottom molecular force field I think we just have to look at details on the goals of what they're what they're trying to model at first. So with the small molecule force field you want to capture a space of trillions or thousands of trillions of potential molecules chemical space with the bottom molecular force field you have a much more limited palette. You can go up to 100 common monomers because you think about 20 amino acids in nature and then some modifications of those amino acids phosphorylation and such. And after that there's probably a few dozen non natural amino acids that people are most interested in even though there are thousands of theoretical, you know, thousands of amino acids. So we're talking really about a palette of up to 100 or maybe 200 individual residues in a bio molecular force field. With the chemistry and the small molecule force field you have potentially a lot of nasty chemistry you can have some toxicity, but you can have things that take a lot of energy to synthesize. And in biology, you could have things like you can you can have things like the penicillin and some of these lactum range that are pretty highly strained, but really you're not going to see quite the chemical diversity in that and and and the common theme being of course that these things are products of metabolism they're also going to be recycled and broken down by metabolism. So there is a there's a palette of chemical bonds and ways to connect these things and of course the the ATP driven processes that can create these things there's a there's a limited energy budget for making them. So then the key properties for the small molecule force will become things like hydration free energy as you want, you want binding free energies we want to know what the equilibrium constant is going to be for this for this drug binding or for this, or even like the the properties of this neat liquid. We'd like to know how you know, not just benzene or some benzene derivative is going to be solubilized in CO2. Then the biomolecular force field we want structure, primarily we want secondary and tertiary structures of polymers we want to see how hydrogen bonding particularly dictates the secondary structure and then how weaker interactions dictate the interactions of the secondary structures to fold into protein or an RNA enzyme. This all translates into the ways we train these force fields so the parameter libraries and interpolation strategies that I think open FF is doing very well to explore and to and to hone for the small molecule force fields. You see sort of an analog in the ways to to perhaps have libraries and we can start to think about, you know, bond order corrections and things for bottom molecular force fields but largely the bottom molecular force fields because they are so sensitive to small perturbations and we were seeing small perturbations create large differences and tertiary structures whether your protein folds or not. The bottom molecular force fields the strategy tends to be improve on something that already works modify only a few parameters of it and throw lots of graduate students at the problem. So then the validation strategies being for the small molecule force field validation is still the bulk of the work generating the fitting data and doing the fitting is important but the validation is for either of these categories is quite lengthy. The fleets of TI or alchemical binding free energy calculations getting back to the goals of the model for for the small molecules and then for the bottom molecular force field you're looking at microsecond time scale simulations. And then the latest amber force field and I'll talk about a little bit amber FF 19 as the that was validated using five microseconds so 5000 five milliseconds 5000 microseconds of combined replica exchange simulations on various healer sees to try and get those to get the properties right. But then you can see things other other things on the micro second time scale enzyme binding that can occur there but mostly you're looking at NMR terms and NMR relaxation constants. And so by molecular force fields in this sense they've been pretty successful at replicating a lot of these things again on the micro second time scale and we're now in a regime where on the micro second time scale we can start to see these binding events these drug ligand interactions that are relevant to biology so in some sense the computation is ready now to to have both force fields working in conjunction in concert and now it comes a question can we can we align the ways that we make both of these models. So a brief history of the amber protein force fields this is sort of my stomping grounds. The protein force fields are a vaunted a very respected class of the Obama like our models, particularly the protein force fields. It started in 1995 back in the Lake Peter Coleman's lab with Wendy Cornell you had Kenny Burr as you had Chris Bailey, all in that crowd. So they developed the harm 94 set along with the what's been called the Cornell charge set so FF 95 incorporated harm 94 which is the bonnet parameters. And then the Cornell charge set was all based on 631 G star Hartree Fock calculations which nowadays they they take minutes if not seconds on a on a modern computer. It was pretty significant back in 1994 of course. And that that charge set because it has some propensity towards polarization in certain chemical groups that actually performed very well and it's, and it is still in use today in fact even in FF 19 SB so 24 years later they're still using the same charge set. So here on this slide is trying to indicate, you know, as, as seen in the legend here. If you have a new charge model then you did a torsion refit maybe you did some Leonard Jones refit so June may Wang, who's still with the amber amber group and the consortium. He started in the late 90s working on a derivative of FF 95 became FF 99. It really took off when Victor Hornak came in so the hydrogens. Particularly the hydrogen Leonard Jones parameters got diversified by June May. But then Victor Hornak came in and along with people like Robert Abel and Carlos Sarling's group. And it was also involved in this effort but this this thing in 2006 the next torsion refit when we started to incorporate more mp2 level calculations. This became FF 99 SB and this 99 SB has been a, it was a very long lived and very successful protein force field but really you have to recognize that it was a limited refit of what was already in FF 95. It got to as far as 2014 before James Meyer also in Carlos summer group came along and did some additional torsion refits. They got even better reproduction of publicity and perhaps beta sheet propensity still using the same charge that of course around 2014 2016 we had sort of a flowering of different different force field so I got in there and I made a complete refit of different charge set all new torsion parameters and my own some of my own Leonard Jones edits. That first one actually didn't work very well but a student of Lillian Chung's came along and I was able to work with him I'll talk more about this in detail but in 2015. We came out with what's called now FF 15 IPQ so successor to 14 IPQ but in fact a complete refit of everything. And this one got basically backtracked a lot of the Leonard Jones refitting, but it got the salt bridge propensity is correct which 14 IPQ really was the Achilles heel. And so 15 IPQ and 14 SP they they are very much neck and neck they're both very, very good force fields and our only Ping Wang in the open force field consortium came out around the same time so there were essentially three very good amber force fields. Ping made a complete torsion refit not a limited torsion refit but a complete one of all the side chains and all the protein parameters, but again, still keeping the Cornell charts that so I sort of put him as as a branch in that original line, whereas Ping Wang force field which has since been modified by Robert Best and later D Shaw group, and then the IPQ line of force fields over here on the very right, which I kind of started and Carl DeBec really did an excellent job on and since I've just been involved in writing a manuscript for Anthony Bogey so Italian cars, and perhaps Italian force fields we're going to say they're very well made, but Anthony Bogey and another student with William Chong they've made another 900 parameters torsions and new charge parameters for unnatural amino acids. So I'm really actually very pleased with their success. I helped them through generating about the first 30 of those parameters but I'm very pleased that I didn't hear back from them thinking that maybe they dropped the project and all of a sudden no in fact they finished it. They didn't need my help. So they needed my program but not my help and I was very happy about that. But there so there there's a quite a crop of force fields within the amber community, and it's remarkable how long lived one of the chart sets is, but we still have many options for many different ways we can go. And so the thing I'm going to talk about is the branch that I started back in 2013 2014 so this I pull Q the IPQ line of force fields. This idea started with Cameron's honest and it goes all the way back to something in the around 2000 or so. From some work by phone in stone, but Cameron's honest took a look at a set of dipoles and or just imagine the theoretical set of dipoles. That's in a in an electrostatic field strong enough or or generated by things that are large enough. It's basically like in a temperature control bath, but this is an electrostatic field that the charges do not reciprocally influence. And if you look at the way those charges if they're inducible dipoles if you allow them to polarize to equilibrium. And then you look at the energy of that system you realize, ah, in fact, what you have is is the energy of a system of fixed dipoles as if you had just taken the configuration of the polarized dipoles neglected the fact that it took energy to polarize them. And then average that with the energy of the dipoles in vacuum phase. So transitioning to the monopole systems we've got. In fact, the argument became what we should do is compute the polarized charge distribution neglect the fact for the moment that the energy of polarizing is not trivial. And then average that polarized charge distribution with the vacuum phase charge distribution for the same molecule and same configuration. And thereby account for the polarization energy difference in the fact that the charges of our fixed charge model did not polarize to the full extent that they would in the polarized model simply because we have to account for that energy in some fashion. And this idea is one that's sort of gained a lot of traction in various circles other people have made edits to the article Q force fields, but particularly the the open force field team has has taken this a little bit further, making it a sliding scale how much do we mix the polarized and the the unpolarized the vacuum phase charge sets how much do we mix them and the mixing parameters they come up with around point four to point six and leaning in fact towards the the polarized charge distribution. Which I'm not going to talk about too much more but which seems to be easier to fit electrostatic potential fits of a polarized charge distribution, although it sounds like it would be stronger are in fact easier to do with point charges nuclear centered models and the vacuum phase charge distribution. So you might have been seeing in that last slide there was sort of a cyclical process and indeed it was it was an iterate to convergence sort of thing because with the polarized charge distribution. If you polarized the the charge in the molecule what we were doing is we were getting a bath of solvent molecules and doing and dissimulations to to get an average equilibrium solvent charge density and then use that in the quantum calculation and the q m m m sense to to induce the molecule to polarize and then measure you know what would the polarized charge distribution come out to be. Well, if you change the initial charge distribution in your molecular mechanics model then the solid charge distribution is going to change a little bit and you have to do that iterate to convergence. Similarly with the the torsion fitting. What we were doing is looking at well there's there's a lot of different parameters here to fit. As we, you know, as we do more and more of this, the generational learning in this case that we were doing was, let's make a molecular mechanics model that perfectly or as close as possible mimics the quantum mechanical energy surface. Over the configuration space that that molecular mechanics model is going to explore in a simulation. So we would fit a force field, and then we would do molecular mechanics optimizations to those same configurations and then recalculate q m energies for those re optimized configuration and see okay did our model learn correctly did it actually track the q m surface as it as we allowed it to drive the molecular configuration into a new state. And by doing that so we had another sort of the generational learning a cyclical process. And we were doing this all in conjunction with the ipel q protocol. And by the first time I had, I had done a torsion fit for this new polarized charge center this semi polarized charge that I realized, okay, we have all this gas phase quantum data. But we have polarized charges and the polarized charges the internal electrostatic energy is a part of the energy surface so what do we do. And the answer that I came up with which sort of really started the ipel q line was that okay, we do have to charge sets we have to look at them individually even though we're going to combine them. We have to say okay take the gas phase charges and use those during the torsion fitting we're going to fit torsion parameters in the context of those gas phase charges using gas phase quantum data. And then we're going to transfer those those torsion parameters to work in concert with the polarized charge set. And do our simulations. And so that, you know, wheels within wheels sort of validation process roll forward. And the first thing that came out was ff 14 ipq. As you can see in this timeline here. It was all this is all least squares torsion fitting is least squares electrostatic potential fitting so it's it's a one off you you fit the you you fit the force field but then you have to let the force field drive the molecule and see where it's going to take things. And then make new fitting data and that's where again the cyclical process is coming from. So we, we went through a lot of the formalities of how we do the, the ipel q line. And I made a sort of a disastrous mistake with refitting some of the winter Jones in too harsh a manner, but Carl WX is the one who caught this so after in 2014 publishing 14 ipq. He called me up and said, you know, I think I think there's something wrong and I said, I think you're right I've been seeing a lot of stickiness like sidechain sticking to backbones and things starting down for where they shouldn't. And he said, Well, I think I know what's wrong. And I said, I think you're right. So what we did was we rolled back a lot of the Leonard Jones refitting and we put it all instead on the polar hydrogens. We got a preliminary 15 IPQ and we had all this infrastructure in place so in only three months or so, we were able to go down here and incorporate a lot of new torsions, lots and lots of new data we had nearly a quarter million data points, single point calculations on various dipeptide tripeptide residues. And then we had three months later we had our finalized 15 IPQ for still which, like I said is, you know, it's very competitive in terms of its performance but there's a very there's also a very high bar to meet in terms of the way molecular force fields for proteins behave these days. So one thing that we did along along the way was to incorporate this these angle refinements so we weren't just working with torsion refinements anymore we're also at least squares optimization of the angles. And again this is all done in one big matrix fitting the angles and the torsions are all fitted simultaneously. It really does appear that the angle of refitting did did a lot of good for not just the secondary structure stability but also things like these alanine five j coupling so this is really looking at. It's looking at backbone propensity things that will sort of evolve into secondary structure as you get from a five mer to a 10 mer and a 15 mer things like that. But 15 IPQ was one of the first force fields to get below one mean high squared value in any of these by any of these metrics these cardless coefficients depending on which set you choose, you may or may not get different answers. But the 15 IPQ force field did extremely well by several of these metrics in in getting the getting these j couplings covered. And just in case anyone's wondering well how did you fit the angles we were fitting both equilibria as well as the stiffness constant the stiffness constant spending those things it's it's pretty obvious because it's just like putting a torsion amplitude. But in order to fit the equilibrium so we're we're not saying the angle has got to be 109 it could be 104 114 least squares to configure that out. And the way that works is there's two parabolas two basis functions, the sum of two parabolas is another parabola. And okay you can see that the black here the black line the minimum, the value of the minimum actually changed while the position the minimum change but the overall height of the minimum changed the steepness of the parabola is is a consequence that we're fitting. But in order to account for the fact that the value of the tide changed for these models, we have one additional constant, essentially per system that it can that pretty much obviates all of that so all of this is being taken care of we just have two basis functions and we are fitting with complete mathematical rigor the equilibrium value as well as the stiffness of these angles. And so then we can look at things like alpha helical propensity and K-19 we'll look at alpha helical propensity in the FF-19 SB force field as well. The only thing I can say here that's, you know, particularly successful it's a little bit shy of the 40% target we want these things in fact to be metastable. And in fact we do see the helix unfolding and then refolding at various times in the simulation. I still think that you know that the overall propensity maybe even less than we measured we did throughout first microsecond of these runs in order to measure it but it's it's it's very difficult to look at these things and think okay, we do need enhanced sampling and probably even something on the tens of microseconds time scale for each of these temperatures to really understand exactly what's going on here. And in the later slide I'll show you that for FF-19 SB that is what they did. Beta sheet propensity again you know we're getting beta sheets folding and unfolding this thing is supposed to be about 50% folded at 295 Kelvin. I think it's again a little bit shy but it's it's ballpark, you know, and we do see transient stability of this thing so this is this is actually all very very good news especially considering we've completely refitted the charge set. We've completely refitted the torsion set and some of the angle parameters as well so this is this is different force field this is completely sui generis. And finally I'm just going to review a trip cage folding we got with this thing we we did get about the right melting temperature for trip cage it's about it's supposed to melt around 300, 305. And we see it you know this about what we see happening for FF-14 SB the contemporary to 15 IPQ. They also got about 305 to 310 melting temperatures so again these are both the force fields derived by very different parameters of all different parameters, perhaps it's safe for a few of the bonded parameters and some of the heavy out of Leonard Jones, but they're very very different force fields they get very similar results. And we can ask a question so if we run on the way back here and look at this so right I've got polarized charge set I've got a gas phase charge set and I did this funny thing of fitting the torsions and the conjunct you know in concert in the context of the gas phase charges, I'm just going to combine them with polarized charges to actually do simulations. Well, what if I just run with the gas phase charges because everyone tends to take a charge set and then fit torsions in the context of that charge and then roll. Well, if I had done that in fact we did this study. We did trip cage as well as a number of other, just repeating the validation that we've done for 15 IPQ and a follow up. And if we take just the vacuum phase charges the yellow line that you see here trip cages melting way too early. In fact, it's already melting at 285 kelvin I'm sorry I don't have the temperatures on these plots but the one in the upper left here again is 275. And so then 285 it's still melting 295 we didn't see a melting event but it looks like at the very end here we might have one of might have started and then for all the other temperatures it melts. So what if we had done, which is essentially the 14 SB fitting protocol take the polarized charges fit torsions in the context of the polarized charges with gas phase data and run simulations. If we have done that things actually do look better. But if we have just followed that protocol strictly and again making an ab initio force field that way without any intervention without, you know, again many graduate students looking over the thing and making sure we still got some problems. Okay, so the 15 IPQ to solve that is a fake force field made with what I think is the wrong protocol it turned out to be slightly worse than the canonical protocol that we that we put forward. So in fact, you know, we did see a melting event at 275 kelvin we didn't see anything else bad until about, you know, any other melting until about 315 kelvin. And so, you know, we can, there was this and there was some little less stability in a GB three protein it's a small globular protein, but from what from this what we can say is that the way we made 15 IPQ using the polarized charge set in simulations but training torsion and bonded parameters with the backing phase charge set. That was the best option of the these three. You know, certainly you need a polarized charge set that's that's a given from from this and other data we collected. So having the polarized charge set seems to be, no matter how you did the torsions seems to be the bulk of the of the battle there, you know, once you've got that polarized charge set the way you fitted the torsions. There is some consequence to it but in fact just having the charges be polarized ensures a lot of tertiary structure is maintained in these simulations. And I do want to not just to my own warrant for for this entire talk but I want to mention some of the latest work that's been done in the similar group which is this very impressive FF 19 SB force field. This is a CMAP based force field I'm going to talk more about CMAPs a little bit later, but they did the CMAP fitting and I have to commend them for for how they did it or for the effort they put into doing it. And the ambition they had because essentially they had 9000 parameters and they came out with a force field that doesn't appear to be too overfitted. So this is I think this is a very good thing Carlos is quite circumspect when he talks about this saying well we think it's better than 14 SB I think he's correct. I think you know this this is a very again a very solid product from the similar group. And from this slide here this is helical propensity this slide represents 5000 microseconds so five milliseconds of MD across about 80 different simulations or 80 different systems each amino acid, a helix of each amino acid. And with these four different force field combinations. So the OPC water model is a four point water it's, it's a funny looking water model but in a lot of ways, it seems to have a lot of the right properties so we'd like to see, you know, proteins behave correctly in the context of that better water model than this old tip 3p that doesn't even work with evolved. But when you finally run these things out to 10s of microseconds per system, you can start to see in fact, the water model does have an effect on the secondary structure. You know we say it doesn't you know it has only a minor effect on hydration free energies and such. Well it does have a noticeable effect in the long run on the secondary structure. And in fact, 19 SP this combination with OPC. You, you might be a cynic and say okay well it's because as partate, or maybe glutamate fell in the right spot. And, you know, otherwise had that had that not happened. And I've seen something very similar to 14 SP if only as partating glutamate had been in the right spots for that one. That would have been a, you know, pretty tight to the trend line sort of sort of slope as well. But I think, you know, and it's, it's encouraging to see that all the effort they put into 19 SP and to see a better model like this. They just refit torsions or replace torsions with C maps in that force field they they also took into account they they changed all the side chain torsions and they were taking into account some of these arguments I was making on the previous slides but they came up with their own solution to it which I think is probably even a better solution, because they incorporated salvation effects in the torsion fitting by finding a DFT method that looked very similar to GB or GB SA energetics. So the, the, the GB SA energy or energy profile the solution phase energy of these molecules, they found a DFT method along with the salvation correction that track that pretty closely and then they use that DFT method with salvation correction to do all of their torsion scans, and then consequently they use the GB SA as part of the internal energy against which they were fitting all those torsions so in the context of all that. They have their torsions fitted in, you know, in the context of some sort of salvation effect. And I think that probably had a very positive influence on their overall results. So now I'm going to transition to a little bit of just talking about options for improving these biomolecular force fields and the ones that really jump out before we go to polarization and things that will make it take at least four times as long to do a simulation. You can have additional models and to whatever limited degree you want so we can transition something like tip 3p or SPC into tip 5p. Or you can have these tabulated potentials, these CMAP terms you can, you can turn dihedral cosine series into cubic spline based map functions. And the other route that you can go is you can improve the fitting process so make the model more elaborate or improve the fitting process we've talked a lot about incorporating these salvation effects. Also, could we, when we think about the fitting, if we just do a least squares fit, we get one answer. We get the optimum least squares answer. But could we do a genetic algorithm or something around there? Could we find ways to conditions to perturb the least squares fit that we would actually get many degenerate solutions, some of which in fact are very different from one another, but all come very close in terms of the same figure of merit. So if we could do that, we could perhaps prune them using a diversity of different solutions to the entire molecular mechanics parameter set, prune those using experimental data and perhaps pick the winner. So the first thing to look at is I'm going to look at these extra points for elaborating the model. These are six frame styles of extra points and extra points in simulations sometimes called virtual sites. They do not have any maps, but they are, their positions are entirely determined as a consequence of the positions of frame atoms that do have maps. So if water is the frame atom, you've got the oxygen being the parent atom probably and you've got the two hydrogens being the other frame atoms. The simplest style here, style one, this is really good for putting things along a bond vector if you want charges there, which is actually a very good option, but it's also viable to do something like a sigma hole for for an organic halide. So to put an extra slightly positive charge out beyond the chlorine on a carbon chlorine bond along that bond vector, you're actually putting a positive charge out beyond what you think is a very electromegative chlorine atom. And that's creating what's called the sigma hole and that does very, very well in fact for improving electrostatic potential fits as well as hydration free energies. I'm not going to go too much into that. But that's a, that's a real success in terms of the way extra points can improve hydration free energies. Otherwise, the hydration free energies for these small molecules are largely a consequence of the overall dipole and improving the electrostatic potential fit using these extra points does not change the overall dipole molecule that much. So there are select cases in which case these extra points do improve hydration free energies. What I'm hoping is overall we can improve the details the nuances of the electrostatic potentials to get either better hydrogen bonding propensities or or also you know just better as a consequence better secondary structure propensities. You can look through all these other frames here. I really encourage people to look at the grow max manual that's where virtually all of this stuff came from. All of these things will now be implemented in the P memby package they are in the code and in a working branch, and they will be incorporated probably as a patch somewhere in the middle of the amber 20 release cycle. When we think about these extra points. A lot of people initially default to okay these are Lewis structures right or we can make Lewis structures out of them and this is what I what I think of is sort of taking the freshman chemistry off ramp in the computational chemistry and I think this comes about as a consequence. Of the fact that we have a lot of people who are very well versed in mathematics. They know a lot of computational science. They may not have actually taken as much chemistry in college, but the, you know, it's not to say that we've got lots and lots of education but the weight of that mathematical background I think sort of tends to pull us towards these these other models so tip five p. It's actually no not really much more successful than tip four p, even though most chemists would not think to put a, you know, to put a lone pair somewhere like at the, at the, you know, at a site sort of like in the elbow of that water molecule. But if you're looking only at these special chemistry positions, you're kind of missing out on a lot of stuff you could do with extra points. So if you if the electrostatic potential fit is your figure of merit, the lone pair Lewis structure positions often do not have that much effect, even in sulfur, where you've got the lone pairs as well as a quadruple. Even a little bit better off putting things along the bond centers. And it's kind of surprising, but in some sense that's where a lot of the electrons are in these molecules as well. So many extra points can have a moderate effect on the accuracy of that electrostatic potential fitting. If you have say 50% more particles in your monopole distribution so you have extra points along a number of the bond vectors, particularly in the polar groups. You've improved, you've increased the cost of your, of your simulation by a factor of perhaps two. But you have improved the fitting of the electrostatic potential by up to 40%. So how much effect that has in the accuracy, not just a hydration free energies but again, again a hydrogen bonding propensities will have to see no one's ever really made a an extensive extra points field for general biopolymers. There is a sugars extra point force field that seems to perform better than a nuclear charge only force field that works very well with tip 5p. But again, these are, it's, it's about 10 or 15 years old there's, there have been a lot of simulations with its sense, but there's a lot more development to be done in this area. The other way to elaborate these models that I want to do a little more of a deep dive into is the CMAPs. So these are, you know, a CMAP is basically a tabulated potential and you're going to do cubic spline interpolation between the tabulated grid points. If you look in any topology that has CMAPs it gives you the values at the grid points. And then the program is kind of trusted to fill in the details figure out how the cube explain interpolation works and roll with it. And that it sort of tells you right there that all it matters are the values the grid points right so if we know those. How do we get everything else. Well, you know the question is okay if we have some point anywhere in space. How do we then go back to the values the grid points and figure out what the exact value at our point of interest is because that's you know we have a particular position on the Ramachandran plot. How do we figure out what the value of the CMAP is given all these grid points. Now here is by cubic interpolation here's the formula for it all. And you can see here that if you just you know you have to plug in you know fill in plug in. I don't even know all these definitions here. But this is the concise compact form if you look in the code itself in amber or in charm, you're going to see a big 16 by 16 matrix doing this matrix multiply this is a more compact way to say the same thing. Okay, but what you can see here is that you need the values at the four surrounding grid points. So those would be no you can see these four surrounding grid points around here, you need delta x and delta y within the one bin. And you need to know the values of the derivative, the derivative and x or derivative and y and the cross derivative dx dy at each of those four grid points. Okay, so once you know all those things all the corner values, then you can do your matrix multiplication, you can come out with this a matrix and you can multiply it by your powers of delta x, and of course one is delta x to the zero. One is delta y to the zero. And with this big matrix vector multiply you come out with a scalar, you know, the, the, the power or the value of delta x delta y comes out from this big old formula so it essentially what this is saying is if I know the values of the grid points, and I know the derivatives at the grid points. I know what the value is anywhere in space. Well that's interesting because then if I wanted to fit a CMAP. I need, naively you think okay I need to know the values of the function of all the grid points and what we're going to see why that's actually trickier than it sounds a little bit later, but I'm going to show you my friend Arianna. She knows a lot of math. Okay, and this is something that she taught me about how to do cubic spline fitting and interpolation and thereby fit a cubic spline interpolated map using arbitrary data. So let's actually give me data anywhere in space, as long as it covers the map with reasonable density. I'm going to tell you what the ideal map should be to get that data. Let's put her in between LeBron and James and Dua and Abel and Enrique and Selena and Arnold and Ellie. Okay, and but for the small perturbation caused by the cause by James in this case which will not be a disaster. We could trace a cubic spline function over the tops of each of their heads, and it's practically symmetric. So Arianna can just look up and say, oh, the derivative at her point is zero. Okay, it's, it's, it's kind of just a symmetry argument, it's a no brainer. What happens though, if we mix up all these people now is Arianna going to get worried. Well, she knows a lot of math. So her formula is going to be, you come over here, you look at the height of the guy next to you, and you multiply by this constant I'm showing here. And then you go, go to the other side, and you, you know, multiply by the negative of that constant, and you take, you know, multiply this guy's height or the negative of that constant. And Arianna is just going to keep going like this side to side, if you will. And she can apply this stencil to get the value in fact of her derivative so she's multiplying by this one constant you see here and powers of this constant you see at the bottom. So it's an exponentially decreasing contribution from the heights of each of these people further and further away from her, that determines what her derivative is going to be but again her derivative is entirely a function of the heights of all the other people at these equally space positions around her. Now, Selena can do the same thing in fact Selena can apply the same stencil she's going to look left and right left and right, left and right. So, Ellie might be able to do the same thing we're getting really close to the boundary though and Arnold is going to realize oh he cannot terminate the series, but do I has new rules, and they sort of speak to the general case. So, with that, with the general case which I'm not going to go into too much because it's not needed for periodic boundary conditions but for some of the boundary conditions I'll talk about later it is. So, or actually Ariana help you work all this out. And with these formulas you can make stencils to get the derivatives at any point, any grid point as a function of the values of your of your tabulated function that all of the other grid points. Okay, and so that that implies by all of this math here you just run it back through this big long series of equations plug and chug. The surface value anywhere is a linear combination of the values everywhere. Okay, and so what this looks like in practice you can imagine. If you're right on a grid point. Well, what's the value of your function. It's 100% the value at that grid point. If you start to step off the grid point you start to have slight positive dependence on your your nearest neighbor values. And as you get directly over here on the very right if you if you get right in between for good points. The value of your function is an average of the four corners and then a little contribution from once further out. And what you're seeing here is kind of like it's like a wavelet distribution. Okay, but in fact the value at any arbitrary point depends on the values at all of the grid points. With exponentially decreasing decreasing importance. If you have like a 64 by 64 grid you can clip this series at 30 grid points in any direction. And if you do if you clip it right there, you've already got your answer to machine precision. In fact, so this is a very interesting way if you have if you have to spline tabulate something that you already know at regular intervals. You don't have to recalculate the values at very, very small discretizations and take a numerical derivative or something like that you can actually just take the values you've already got. If your function was very expensive to compute, and you can get your to within FP 64 machine precision, the correct answer just by looking at those values it's it's really remarkable what that's how it works as the math. So the upshot of this is that, again, I said naively, if you wanted to fit a CMAP, you might think you need to put your protein at every one of these grid points and Ramachandran space and nail it there. And then look at the energies and see what you got. If you try to do that, what you quickly realize is that, well, you can't quite restrain your molecule to any of these particular grid points it doesn't want to stay there and even if you put harmonic restraints on it. It's going to, it's going to drift off just a little bit. Now you could get really tricky and you could say okay well I'm going to, you know, move the restraints around the origin point of the restraints so that in fact all the molecules do fall on the grid points that would be very laborious but you could do it. What FF-19SB actually rolled forward with was something like the orange plot here. They had things that were very close to the grid points and in fact close enough I would say. Again, they still got very good results in their simulations, but they were rolling with this and thinking that they had to snap those values to the grid points. And what Arianna's math is telling us though is actually you could have something like this green data set which not coincidentally just looks a lot like the Ramachandra distribution of a canonical amino acid. But you could have amino acids in their native configurations. As long as you've got a smattering of coverage elsewhere all over the space, you don't want to leave big gaps in your data like two grid spacings. That's about as far as I would go. But as long as you have decent coverage and you have enough data points to inform what your map should be, any data set will actually do and it's just as valid as any other. So here are some glycine fitted C maps. I have my own force field fitting code again which is what Carl DeBiak and Anthony Bugatti have been using to fit their force fields. But I've taught it to do the C maps now. And again the C maps they're being done alongside in the same least squares fit as all the other torsions and all the angle parameters. So you do not have to fit one set of parameters and then another on top of it and keep propagating here that way it's all done in one fell swoop. But we can compare like the FF-19SB glycine C map to the ones that I fitted using some data. I just plucked the data from the FF-15IPQ training set. So there are some differences you can see in the spacing of the two blue peaks here in the middle and that's probably largely a consequence of the fact that the data is different. I mean I was using a different quantum method, mines and vacuum and theirs is in a solution phase environment. But it's, I think it's remarkable about 19SB. Their map is very nearly symmetrical except for a few small points here which I'm showing in these different plots to the right of each of the C maps. They did a very, very good job of curating and massaging and making sure these configurations were well optimized, well relaxed. They got a very symmetric map overall for glycine. Now obviously you don't expect symmetry in something like alanine or any of the other amino acids, but for glycine you should see this two-fold symmetry. The only place I really recover that well even with 6,000 data points spread all over the map is in the 8x8 C map. So I've made my fitting program so that I can fit 8x8, 12x12, 6x6, and if you take that you can make a 24x24 map out of any of those things without losing any of the information. So that's what the code is essentially doing. It's publishing 24x24 C maps even though I've fitted 8x8 or 12x12. You can see the 24x24 C map is wrinkled as a prune. It's definitely overfitted. But even the 12x12, you're seeing a lot of loss of symmetry despite the fact that I got 6,000 data points for only 144 fitted parameters. So the upshot here is that we have to be very careful how we design this data set. But I think probably overall a 12x12 map is a good compromise in terms of the detail it can capture. In terms of the figure of merit just getting the RMSE to these quantum energies, 12x12 is about as good as you can get. 24x24 is slightly better, but not much. It's very, very marginal. And again, the overfitting risk gets very, very high. So lastly, I just want to look at what can CMAPs do in terms of the RMS error? How can they help improve our figure of merit? Well, just looking at the FF19SB torsion set. So FF19SB deleted some torsions and replaced them with CMAPs. If I instead said, okay, now put those torsions back, delete the CMAPs, these are the RMS errors I would get for various systems along the X axis here. And I've just ranked them in terms of their overall ease of fitting, mostly like the single amino acid residues are down here on the left. And as you get further to the right, you get the charged amino acids and then the multi residue systems that I had in this. But this data actually comes, the fitting data is coming from the FF15IPQ data set. I'm fitting it with the FF19SB torsion space. Okay, and so then if I fit, if I fit new CMAPs using FF19SB CMAP space, that's when I get this blue and or this purple and orange line. So this is the effect of the CMAPs in almost all the cases they are doing better than torsions. And, you know, by about half a kcal up to, you know, up to one kcal per mole per system. However, if I hit it with an even larger torsion space, so FF19SB actually truncated the torsion space quite a bit. They compacted a lot of their atom types back into one. They had this diversification in 14SB that they sort of rolled back. And I think that was the right move for 19SB. With the FF15IPQ, we went forward and maybe even pushed the diversification a little bit further. So if you have a very vast torsion space, as opposed to a CMAP space, what do we get? Well, the FF19SB CMAP space, again, does this blue and this purple and orange line. And the FF15IPQ torsion space actually seems to beat the CMAPs in some of these larger systems. But this is, again, this is like a beware kind of thing because it may indicate that 15IPQ is overfitted. Especially in these larger systems where we didn't have as much sampling and it was finding ways to cheat perhaps in the larger systems at the expense of the smaller systems, which you can see the expanded torsion space still does not do as well as the CMAPs on. Now, just to just to assure anyone who still likes to use 15IPQ. This is not the entirety of the fitting data for 15IPQ. There was a generational learning process in there. So I was using basically just the first generation of data and not the subsequent stuff. So this is 150,000 data points that you see across various systems. There were 250,000 data points in the entire training set for the finished 15IPQ. So hopefully some of this overfitting has been dealt with and we did see some of that effect during our fitting. But, you know, this is the moral here is we need to understand what the overfitting problem really is because when we step in the CMAPs, we're talking, if you look at CMAPs for 16 amino acids and you combine things like, you know, if you combine things like phenylalanine and tyrosine into one CMAP, you've got 9,000 parameters there, potentially. So we need to understand how much we can afford, how much fitting we can afford to do, and maybe what overfitting problem do we already have that we didn't understand. So I just want to wrap up by proposing some strategies in the bottom polymer force field development. I think at least for our purposes for open force field, we want to make our own polymer force field. We are porting some of the amber force fields into open FF format right now. So for our own force field, just to have a product that we demonstrate, you know, we can apply our methods to create a viable polymer force field that is expansible and easily incorporates, again, the hundreds of potentially unnatural amino acids. I think we should stick with common sets of nitrogen, hydrogen, carbon, oxygen, backbone charges, very much like the Cornell charge sets down and what we had to do at 1592, we found that to be the best choice. Then for if we have non-native residues, we should also pick some common atoms there and have some common charge sets, then pair those with common sets of backbone torsions. And really the reason to have common backbone charges is to make it so that the common backbone torsions can have an easier time fitting their part of the problem. Leaping and I have been discussing, we think that B3 lip with, I was using minimally augmented, but a DEF2, TZVPP basis set. Also, I didn't have it in the slide here, but back Johnson D3 dispersion corrections and we think that's a good balance of all the different forces that we need. And I would suggest using tripeptides as well as even some tetrapeptides, particularly for the charges where you've got this YAA, that's any amino acid of the serine, glycine, valine, alanine. So I'm putting a polar amino acid in there, I'm putting a glycine in there, but really making it so that your amino acid of interest, the X amino acid in the center there, it doesn't know always what is to either side of it when you're doing these charged set fits. So I think that's important. Now, I think it's also important to have a lot of confirmations of each amino acid. So in making the common sense of backbone charges and torsion parameters, I think we should extend that AD to DNA and RNA, probably a similar level of quantum theory will be sufficient. And then for carbohydrates, assigning common charges to each of the four most common elements based on what their neighboring atoms are. So carbon double bonded to another carbon with a nitrogen on the other side or carbon to a carbon to another carbon, you can see all these different permutations. There are probably at least 200 or 300 unique charges you could fit there, but I think we should probably try for something like that, and then do a direct fit to, you know, to ESP quantum data. But we can we can talk about making an AM one BCC like charge set or some sort of bond charge correction thing for for a even more expansive carbohydrate force field after that but I think it's it's quite feasible to take a few hundred common carbohydrate monomers and look at the different ways carbon oxygen and nitrogen can come together along with different hydroxyl groups and figure out some common charges but a set of a few hundred common charges to fit for that. And then future directions for force field development. I want to get back on this the CMAPS idea because now that we know how to fit CMAPS and they're actually not as scary as the the people who you know when making FF 19 SP I have to commend them for their inhibition because again looking at 9000 parameters how are you going to do it and having to generate the data with it as meticulous as they had to be about it. It's not nearly as scary as they had to deal with. And so, now that we know how to fit the CMAPS, and we know how to fit essentially any tabulated function. Well, let's think about, you know, it would be great if all our bounded parameters were good enough to just interpret raw quantum data. And we don't have to do this re-optimization. One of the things I've not mentioned at all in this talk is that when you take a torsion drive, what you're actually doing is doing a molecular mechanics optimization to relax away orthogonal what we hope we're on orthogonal degrees of freedom so we can isolate just the torsion or the set of torsions that we want to fit. Well, that changes the coordinates. So the coordinates from the molecular mechanics calculation give you an energy, and they correspond to an energy from the quantum calculation, but this is all being done because the quantum and molecular mechanics models don't agree in terms of what the bond and angle a high frequency degrees of freedom, the way those should behave. But what if our molecular mechanics parameters were good enough to follow the quantum model wherever it goes? Well, that would mean all we have to do is take the quantum optimized coordinates and feed them right into the molecular mechanics model and you've got it. Now for the IPQ force fields what we were doing, which actually works in some sense surprisingly is we were feeding molecular mechanics optimized coordinates into quantum single point calculations and what we were doing, you know, that we were justified by saying, well, this is where the molecular mechanics model is going to go in a simulation. So wherever it goes, let's just make sure that it's still tracking what the quantum energy of that surface would be. Now there's still a quantum energy, it's just not where the quantum model would necessarily have it go if quantum were driving the whole thing. But if we could make our molecular mechanics models better, which I think is sort of getting into this class two force field idea, you have bond and bond angle C maps. So with those things we could probably have enough detail to make these molecular mechanics models track the quantum very tightly. And then what we can do is, first of all, it would accelerate force balance fittings tremendously for very, very big data sets. There's no ambiguity into what coordinates and what energy you meant. So everyone's going to look and they're going to download the data set and they would look at it pretty much the same way. But the real advantage I think is we've been able to train to gradients at particular atoms. So we know we know what the force on this atom is supposed to be. We don't have to combine all the terms together and get an energy where you can have all sorts of compensatory effects. We know what every atom gradient should be and we have 30 times as much data coming out of every quantum calculation. So then in a very long view, you could even look at these C maps as a way to perhaps do hydrogen bonding. I mean, Dave Baker's group has been very successful with their Rosetta program and their tabulated hydrogen bonding potential. I think it's only about 12 grid points or so. I think they probably do a cubic spline interpolation of some sort, but again, any data set that has reasonable coverage of what hydrogen bonds look like could be enough. Now there's a lot of wrinkles here because the different hydrogen bonds, it all happens in the context of fitted charges. And who knows, you know, what the fitted charges on this whole hydrogen or that might be. How would you fit a common set of hydrogen bonding surfaces for that? Well, that's why this is sort of a long, long view in the future kind of thing. But I think it's time to start looking at this because once you have bond angle C maps, a hydrogen bond C map is hardly any different. So with that, I would like to acknowledge some of the people I work with at Rutgers. I'd like to also acknowledge NVIDIA. A lot of the people there have been very, very helpful for getting amber into the state that it is and allowing us to simulate at the level, the microsecond timescales that we do. And Scott Legrand is a person, he's absolutely a brilliant guy. He's, I'm sorry, he's not at Amazon. He's now back at NVIDIA. Brilliant guy, a little bit fickle, but a brilliant guy. And we're very, very glad to have him working with us again on the amber team. Thank you all.