 All right, thank you for the introduction. My name is Chapin Cavender and I'm excited to tell you about open force fields work on modeling proteins and small molecules self consistently. So you've heard a lot of results yesterday and today about open force fields mainline releases parsley and sage. And what I'm here to tell you about is that a major milestone for our next mainline force field called Rosemary is going to be self consistent treatment of proteins and small molecules. And by self consistent, what I mean is that we should be assigning parameters based on chemistry. That is if we have a functional group like a primary alcohol next to an amadvon that you receive the same parameters, no matter how we label that molecule of it's the most small molecule drug or the protein side chain. So, I'll talk about some design goals for open force fields efforts to model proteins to limit the scope we're going to have explicit parameters for covering chemical amino acids and their common protomers. And we're also looking at trying to identify a minimal perturbation from the current set of parameter types that we use for small molecules. So we have good transfer ability to proteins. Because we use Smirnov typing to assign parameters that should allow us to have pretty seamless extension to cover covalently modified proteins. So as long as we have good transfer ability from our small molecule parameters actually captured on chemical amino acids and other covalent ligands. And one of the main goals here is that we want to have something that's going to be a minimum viable product that we can use as a sandbox to do open science and that means that we're going to release the data sets and the protocols for training the protein force field so that anyone can grab those data sets and refit them to new targets or tweak parameters analysis trained and allow us to be something that's a useful tool for the community to advance the science of how we model proteins. So because of that, our goal and accuracy here is not to be best in class but to be about as accurate as ever FF 14 SP. And we choose that because that's a well of very commonly used protein force field that has about the same number of parameters what we're targeting. So for this talk only to keep this question in mind, which is that if we have good transferable parameters to cover drug like small molecule chemistry. Do we need to have the spoke torsions to model proteins well. So, today I'm going to tell you start by talking about the training data and how we optimize parameters. And then I'll show you some results from validation based on quantum chemistry data and our data for small peptides and folded proteins. For our quantum chemistry data sets, we are using model peptides that consists of cap one murders and three murders. So we're using marine terminology like died died because that makes a big difference to a computational chemist and a biologist. So we're using here, um, capped and murders to describe these molecules. And so here, a cap wonder is a single amino acid with caps on both ends and the three murders three amino acids. The data sets that we're using our using the open FF default to the method that bond just talked about. And then we also have to generate optimized geometries for cap one murders and cap three murders. And we also generate two dimensional torsion drives for cap wonders. Those torsion drives involve scanning the backbones by a side angles, where we keep the side chains constrained to the most popular motor from motor libraries. And then we also stand up to two side chain dihedral, where we keep the background dihedral constrained to the values and either an alpha helix or a beta sheet. And then we also have library charges for the proteins, we've used the am one BCC of 10 method and open eye to parameterize charges for cap fibers, and then we get the charges for an amino acid by averaging over all the flaking residues for the residue in the middle. In this slide, the charges for an alanine main chain residue, where the amber rest charges are in blue and the L 10 charges aren't orange. And you can see from this that the backbone atoms that are involved in hydrogen and bonds are more polar that have higher magnitude charges. So we're looking for that at the am bonds on the left and the carbon on the right. And we expect from that this means these charges will interact more strongly with other charged species that would mean they have stronger hygen bonds both within the protein and also to solve the molecules. This is how we are deriving parameters. So the initial force field comes by taking most things copied from the small molecule sage force field, and also adding in the library charges for proteins I just talked about. The initial force field that is allowed to see the training data for small molecule QC data sets the same set use to train sage, and also the new protein QC data. I then optimize the valence parameters and the amplitudes of proper torsions while keeping letter Jones the same, and then that produces the final force field. I'm going to tell you today about two different models. So if you think about the starting force to it as a fork of the sage force field where I've included the protein library charges. I'm going to call that initial point with no optimization sage dash CC that's my initials. And if I then take that and optimize it, I'm going to with I add no additional parameter types to the same number of parameters and types as in the sage force to respond molecules. So that's the null model after it's been optimized. A second model I'm going to call specific model, and that one has about 100 extra bespoke torsion parameters, which are all modeling proper torsions and proteins, and these are inspired by the same number and types of parameters you see in the FF 14 SP force field. So next let's tell you about validation of those optimized models on quantum chemistry data. So this is a standard step plot that we use to look at relative comparable energies, joining a histogram over those relative comparable energies. And the conclusion from this plot is that if we look at the new optimized models and all in civic and orange and green compared to the starting point in, in purple here that we don't see any degradation in relative comparable energies. This means that when we train on protein QC data, we don't degrade our goal you to match comparable energies for small molecules. So this is the same kind of histogram for geometric targets here showing our MST of low energy compromise. And for this we see a slight degradation in geometries, especially in the second bin right here compared to the purple starting point. And we see a similar story for torsion fingerprint deviations looking at internal coordinates. So if we're worried about that geometric distortion, we can tune that by tuning the relative weights between the protein data and the small molecule data we do optimizations in the future. So we've got some validation data sets for proteins in particular. So here I'm showing torsion drives scanning phi and psi in two dimensions for CAP3MERS that are not used in training. And here the root mean square error for the null and specific models is about two K tells per mole. And we don't see much difference between the null and specific models on this particular experiment. So briefly, I just showed you that when we train as protein QC data, we don't see very large deviations or degradations and are going to model small molecules, and also QC data does not discriminate between the null and specific models that have different numbers of parameters. So next I'll show you some validation for NMR data on peptides. We have a vision doing protein benchmarks in three tiers, where the first tier involves only small unstructured peptide. We can get very quick data on in less than about 12 hours. And we're going to use these as a way to get rapid evaluation of ideas and evaluate models quickly. The second tier involves a small handful of folded proteins and disordered proteins, which will take a little bit longer to get information back on. And when he needs these as validation sets to select a release candidate. And then finally, the third tier will involve a much larger set of folded proteins, which have more diverse structures, and also a set of protein they can find in graduate calculations. And we'll use this last third tier as a way to assess the performance of the release candidate. So today I'll tell you about just the tier one and one protein from tier two. We have data on NMR scalar couplings, which are a way to assess conformational preferences for backbone dihedral. So these are parameterized by a car plus equation to the bottom of the slide where the coefficients A, B and C are fit either to a static experimental structure or to DFT calculations on model compounds. This expression looks like this where you have high values for scalar couplings when the dihedral angle is at zero or 180 degrees, and there's a minimum somewhere in between there that depends on the value of the car plus parameters. So I will show you data on about 121 observables from 13 uncapped peptides. So for assessing backbone dihedral, these have a permanency terminus, which is used to model the low pH in the NMR experiments. And I'm going to solve these include neutralizing counter ions, and then simulate them at 300 Kelvin for 500 nanoseconds. To assess performance I'll show a chi-squared value, which is basically just a mean squared error that's weighted by the experimental uncertainty in the measured scalar coupling. This means that the chi-squared values that are much bigger than one have poor estimation of the observable and values that are less than or equal to one are have good performance on that benchmark. So this is the result of chi-squared values for the FF14SB and then my null and specific models with both tip 3P and OPC water. So an immediate conclusion we can see is that in general these force fields will form better for null and specific in OPC water than in tip 3P water. So we're getting better reproduction, better agreement with the NMR experiments when we use OPC water. I can look also at scatter plots for individual points, individual scalar couplings here. We're showing this for FF14SB with tip 3P on the left and OPC on the right. And here the colors represent different residues that we have NMR data on. The x-axis here shows the experimental coupling and the y-axis shows the difference between the computer and experimental values, so perfect agreement is going to be exactly a y equals zero with a flat horse on the line. So one conclusion here, so we look at the colors for alanine and glycine in blue and purple. These tend to cluster closely around y equals zero, so we say that NMR FF14SB does very well on small residues like alanine and glycine. But it goes poorly on residues in the middle of the plot here that involve bulkier and more hydrophilic residues, so things like methionine, valine, and phenylalanine. And I'll show you the same plot for the null model. And here we can see that we do much better on those bulkier hydrophobic residues, but we in general will do a little bit worse on the smaller alanine and glycine side chains. We see a similar story for the specific model where compared to FF14SB, this model does much better on larger side chains at the cost of a slight worse performance on alanine and glycine. So I'll summarize this section by telling you that we can use OPC water as opposed to TIP-3P water to improve the treatment without more observables. That on this benchmark, again, the null and specific models are pretty much indistinguishable. And that when we compare across all residue types, these optimized spin-off models have about the same accuracy as FF14SB for these small unstructured peptides. So now I'll show you a result from a larger folded protein. It's going to be more realistic for running protein simulations. The model we're going to use here is called GB3, which is an immunoglobulin binding protein. I chose this because it's only 56 residues, so it's relatively small. It also has no cysteines and no histidines, so it has unambiguous protonation states. And the structure of this protein is a beta hairpin that then goes through an alkylhexate with second beta hairpin, where beta strands 1 and 4 interact in a parallel arrangement in the middle right here. So I'll show you first traces of the RMST over time for 10 microsecond simulations in TIP-3P water. And the main result you can see here is that both the null and specific models have a large deviations in RMST that have them all within about two microseconds. And when we look at those strategies by eye, what we see is that those are associated with unfolding of an alkylhexate. So here I've showed you the highlighted the c-termors of the alkylhexate in GB3 and orange on these slides. So in orange is the c-termors of the alkylhexate. On the left is the starting structure from an experimental model. And on the right is one of the null simulations after about two microseconds where we see that the c-termors unfold in TIP-3P water. So here are the same traces in OPC water. These are now only at about four microseconds. So we've already exceeded the point where we see unfolding happen in TIP-3P. And what we see here is that all four fields sample some fluctuations in RMST. They kind of maxes out at about reactions compared to the experiment. If you look at all those fluctuations, none of them are caused by alkylhexate unfolding. They're all caused by fluctuations and relative motions between beta hairpins. So if this were a sheet composed of entirely parallel or entirely anti-parallel beta strands, this would be something to worry about. But because these are two parallel hairpins that are then anti-parallel on the other side, we expect these to fluctuate some. And an unknown question is how much do we expect to fluctuate? So to try to answer that, we can go back to some more NMR experiments. The GB3 has scalar couplings like I showed you before for the small peptides, but GB3 also has inter-residue hydrogen and bond scalar couplings, which assess the backbone hydrogen bond genealogy. It's a relatively complicated functional form for this, but the important part is that the scalar coupling has an exponential dependence on the hydrogen-oxygen distance in the hydrogen bond in the back. And the dominant feature of this scalar coupling is that it falls off exponentially with distances the hydrogen bond is populated. These are the chi-square values for both the three bonds, scalar couplings assessing backbone dihedrals, as well as the hydrogen bond scalar couplings. So the question from here is that in general, the null, in tip-3p water at least, the null and specific models have much worse of people with NMR data than amber FF14SB. But when we combine it with OPC water, the scranoff models get much better. And the null model with OPC is just kind of at the point where it's smashing the accuracy at FF14SB. So there's some more details for these scalar couplings. So here the blue points are residues that are in the alpha helix. The orange points are in the beta strands and the green points are in loops in between those secondary structures. So the conclusion here is that FF14SB is kind of insensitive to water models. It performs about the same in tip-3p and OPC. So looking at the scranoff models, the null model has pretty poor in tip-3p on the left. Many of the helical residues have very poor green with the NMR experiments. And when we use OPC water on the right, we see improvement in those helical residues specifically. And finally we see a similar trend in the specific model where helical residues in the top center of the plot on the left have poor green and tip-3p that has matched much better in OPC water. So I'll summarize this by saying that the null and specific models here can in tip-3p water partially unfold in alpha helix that seems to be improved by using the OPC water model. And that is improving specifically the alpha helical residues. And finally, that if we use the null model with OPC water, that is comparable in accuracy to reproducing NMR data on this folded protein in GB3. So I'll end by referring to my question I posed earlier in the talk where if we have good transferable parameters for small molecule chemistries, do we need to have bespoke protein portions? And so far the answer seems to be that no, we don't. That if we use what I call the null model here, which has only small molecule stage types, we get very similar reproduction of NMR data for both peptides and folded proteins. And that's really encouraging for us. That means we're doing a good job at modeling chemistry that's transferable to other contexts than just drug-like small molecules. So there are many people to thank for this work, especially the, I'm a postdoc in my Gilson's lab at UCSD, so I expect a lot of mentorship and guidance for this project, as well as other members in the Gilson lab. Our collaborators in the industry, including Chris Bailey and Bill Swope, and many people from the open FF team, which includes people like Kamana David Doxson and help with wearing the QC data that we use to train the force field. Kamana and Trevor have been important in helping to run the force balance optimizations to actually do parameter optimizations. And Bill Lee helped develop the protocol for deriving library charges, and many people on the infrastructure team have helped to be able to even just load and run these things in the first place. So with that, I'm happy to take questions.