 All right, so I'm David Mobley and this is Caitlin Bennett in my group and we're doing about half and half here to sort of update you on where we currently are with this runoff force field and what we're doing on automating chemical perception and for those who don't know what we mean by that, that's part of what this talk is about. Okay, so by way of background, we're used to thinking about force fields involving atom typing and atom typing is an input and I'm going to call this typical way of doing things indirect chemical perception. Some people are going to be really familiar with this already and some people aren't. So in a typical force field, what you do basically is you have a human expert, a wizard we can call them, who decides some molecules they want to cover, they decide some atom typing they're going to use for those molecules. They develop a thing that can do atom typing, a thing that can assign parameters and then a parameter file that will be used in parameterizing molecules. Then separately from that, they have a parameter assignment process. Once they've built a force field, we have some atom type definitions and a typing engine and that thing assigns atom types to a molecule and those atom types serve as labels. Then the labeled molecule goes, the labeled molecule, it goes into a thing that's going to assign parameters based on those labels and that gives you a parameterized system. The most force fields that are most widely used in the biomolecular simulation community right now use this labeling process where notice this molecule, as I've tried to hear, no longer has the bond types and bond orders. It just is a chemical graph with labels on it. And so those labels have to encode all the chemical information you're going to need to use in assigning parameters. You can't later decide, oh I want more chemical information, no it already has to be present in the labels. This is what we call indirect chemical perception. You start with a molecule, you label it, you throw away the molecule, then you use the labels to assign the parameters and that is the indirect part. So alternatively you could have a different approach that's direct chemical perception, you have a molecule and you use the molecule to assign parameters rather than passing through labels. And so that's basically what we're trying to do. You can imagine a different force field development process and a different parameter assignment process. So now instead of having a wizard decide on labels in advance, a human expert decide on labels in advance, the wizard or the human expert. It's just building something that can build force fields and something that can assign parameters and then maybe they encode some parameter definitions and then those are used on the molecule to assign parameters. This might sound like semantics but it turns out it's not and we'll get to that. Okay so why do we need new force fields? You know we're here because we're interested in this so I don't need to talk about it too much but even with today's functional ones we can do better. This is something simple we've done on dielectric constants where we basically just took amber gaff and adjusted hydroxyl, glenarjones and charge parameters for pure methanol and then took those improvements and transferred them to a whole set of alcohols. So the original performance for dielectric constants on alcohols is the left on the right is after we refit for just methanol you see the improvement for this whole set of alcohols. So there's pretty systematic errors that are in today's force fields that are just waiting to be fixed so we know we can do better. It's hard to do better because refitting is hard so part of why we're here is we want to make refitting easier. It turns out the same refitting also improves hydration free energies for the same molecules and we weren't fitting to this at all. And this happens a lot. Similar things show up wherever we look. This is something we just did on infinite dilution activity coefficients and the color scheme here is less than ideal. But basically the stars are sulfoxides, particularly DMSO actually. Activity coefficients where DMSO is, actually this should be DMSO is a solute. DMSO is poorly described as a solute for whatever reason. There's systematic errors for it. We can pick out other systematic errors for specific functional groups from this type of data. And so there's a lot of this data and it's not being used in fitting presently that we're aware of. So great data that can be used to improve these types of forces. And you know this is from Ben Severs and Alberto Gaudi at Genentech. Different forces might not even agree on the location of a minimum in an energy landscape and these are some examples they had picked out. So we know we need better force fields and we know there's room for improvement even within current functional forms before we start changing those. But what we have right now is our Parmaphos small molecule force field that is sort of a sibling of GAF in some sense, but it's in our new force field format. And this originates from work that Chris did in two different eras. There was the Merckfrost era where he developed his... So he developed Parmaphos at Merckfrost and it originated as a spin-off of Parm99. And so then from that we have created... Chris on sabbatical in my group created a descendant of it in our new force field format that I'm going to be telling you about or reminding you of for those you've heard about already. And that uses... I thought there was another slide here, but it uses Merck's a substructure query language basically based on SMARTs to assign parameters. So we do a substructure query on a molecule when we get a substructure match that results in assigning a particular parameter and these substructure queries are done separately based on the different type of parameters we talk about. So there's one set of substructure queries for bonds, another set for angles, and so on. So effectively each different type of parameter uses its own typing. So for example, we could use a SMART query to assign a generic carbon-carbon bond length. We can use one to assign a set of parameters that are specific for a single carbon-carbon bond or a double bond between two trivalent carbons, which is the bottom one here. And yeah, there should have been another slide there, but it's disappeared. So why would you want to do this? Well, let's think about where traditional indirect chemical perception or atom typing leads us. So in amber force fields or par-99 force fields, par-99 for example, we might have this molecule and we could atom type it in this way. We could say, okay, we just need two atom types for this type of molecule, no problem. And so this human expert is deciding this as an input. So then we assign the labels to the molecule and we throw away the molecule itself. We no longer retain the bond orders. And then we're going to use that to assign parameters. And so we just look up the parameters from the parameter file. And so we get thinking just about the torsions for the moment, we get two different types of torsions, low barrier and high barrier torsions, and that's no problem. But then, and so that's basically how current force fields work with some small exceptions or adjustments to that. So the atom types have to encode all the chemistry we're ever going to need in our force field. And yeah, okay, so and atom types again, serve as labels. This is really important though. So if you don't encode the right enough chemistry with atom types, you end up missing key details. So here is an example. So somebody now gives the wizard this molecule. And okay, look, now we have a CA bond here. That is a problem because it's a single bond, but it looks like it should be getting the same parameters as all of the double bonds. Yes. It's a silly question there. Isn't that actually just based on picking a bad type? Yes. CA that might not be a CA. Well, you could say you need to change the type of it. That's part of the point I'm making actually. So you can, you can go back in and you could say, all right, well, to fix that, I'll just introduce a new type. Okay. So, so my new type is the same as CA except for, it's for CA atoms that happen to be at bridgeheads. And so now I can, I can fix that. I can give that one a big torsion, but then I give you this molecule. So we've said that CP, CP bonds are single bonds, but then we have this bond that is also a single bond now. And so that's going to get a low barrier torsion, but that's in an aromatic ring. So that's a problem. So now what do we do? Well, let's just make another new atom type. This also the same as CA and CP, except it's CQ, and that's going to get CP, CQ bond gets a high barrier torsion and a CQ, CQ bond gets a low barrier and so on. So we've just invented a whole bunch of additional atom types just to solve this labeling problem where the labels need to encode all of the chemistry. So if thick, you know, thick, if we're clever enough, the problem is solved and this works. But this also results in these files that have a lot of redundant parameters. This is an example from Parmafrost for some angle parameters. There's more than 60 identical parameters for CP alone that are just copied. Redundancy isn't necessarily a huge problem, but it is a huge problem if you start trying to fit with something automated like force balance, because it raises this question of how many parameters are you actually trying to fit anyway? Also it creates problems with human error. Even though the GAF looks new about this problem, I've been highlighting with CA, CP and CQ, human error creeps in. And so if you run an MD simulation of this molecule with GAF and GAF2, the central aromatic ring still buckles because of human errors with respect to the torsions. And as I said, there's this issue of how many parameters do you even have? In GAF2, we have 16 different Vanduols, sets of Vanduols parameters for carbon. But really that's just three sets of parameters. There's carbon, there's triple bonded carbon or monovalent carbon. And then there's carbons in some very specific three and four membered rings. And that's all it is. So if we're trying to force balance it, are we trying to fit 16 sets of parameters, are we fitting three? Probably we should be fitting three unless we decide carefully that we want to do something different. So in our new format, we're using substructure queries instead of typing. And so that means we avoid a lot of these problems sort of automatically or accidentally. Or you could and so in our format, here's an example of a simple force field for methanol. And it's simple. You're just using substructure queries to say, okay, look, let's find a carbon hydrogen bond. It's going to get these parameters. A carbon with an electron withdrawing group is going to get those and so on. Carbon hydrogen bond. Yeah. And so it's simple. It's not really intended to be human readable, but you can read it. It's hierarchical. So last one wins. You can put more specialized patterns at the bottom and they'll override higher ones. Here's an example of that down here. Here's a generic hydrogen for its Landry-Jones parameters, and then here's a hydrogen connected to some electron withdrawing groups. Okay. So this is a slide I should have had earlier, and I thought I did, but I'm coming back to it now. So basically, we have a substructure query that acts on a molecular on several molecules and I've highlighted the different cases where it occurs. And then we can use that to assign a specific parameter. In this case, this is one for a bonded parameter. And so there's, we use numbers to indicate which specific atoms we're going to refer to. And so this is atom one in yellow and atom two in red. And so I could highlight the same structure multiple times in a particular molecule or in other molecules. And so this uses bond order. And so this automatically and accidentally fixes the biphenyl problem I showed earlier just by the fact that on aromatic carbon, which is highlighted here, bond between aromatic carbons gets a high barrier torsion and a single bond gets a low barrier torsion. And so this central aromatic ring ends up staying flat like it ought to. And that's not a case we paid attention to, it just happens to fix it because this is a better way of representing the force field. Okay. So Chris in the group basically ported Parmifrost into this format, taking an amount of editorial discretion at the same time, took advantage of the parameter compression offered by this to produce a simple for starting point force field we call Smirnoff 99 frost, recognizing it's it's permanent routes and our new Smirnoff format. And so basically we have about 330 lines of parameters in this full force field as opposed to the 3600 in Parmifrost. And if you look at coverage, it covers more of drug bank of a zinc subset and more of e molecules than these much larger force fields. There's a lot of redundancy. It's a less than a tenth the size of the original force field and almost completely covers from a physical chemical space. Now this is a starting point that we think is a pretty good starting point. It's not we're not saying this is where we're going to end up. We expect we'll be adding more parameters back in as we go. But really it's a pretty good starting point and it does fix some problems out of the box. I mentioned the biphenyl problem where you have to keep introducing other atom types. Here's a related problem. This is sort of a sibling of this gaff and gaff to make these bonds between rings here, non rotatable, basically double bonds or essentially aromatic bonds. And so if you do MD simulations of this, this is what gaff gives for the bond distribution of these. And with our Smirnoff 99 frost force field, we get these rotatable. So with gaff and gaff two, you end up with a buckling of the partial buckling of the central ring and this ring due to steric strain because the bonds are wanting to keep that whole thing planar and it can't sit there. So this is not a problem we were trying to fix. We just accidentally fixed it. Okay. So we do have done some testing of our starting point to make sure it's at least reasonable. So if we look at hydration free energies for small molecules from Smirnoff versus experiment or Smirnoff versus gaff, here's Smirnoff versus gaff. This is basically just a test to make sure we haven't screwed anything up too badly. And they agree extremely well. And they both agree about as well with experiment, which fits with our picture of this being roughly a sibling of gaff. We've done densities and dielectric constants on our thermal ML benchmark set. Here's gaff, here's Smirnoff 99 frost. There's not really any significant difference other than this, which is flexible force field water that none of us intend to be using anyway. There might be a slight, very slight systematic shift went away a little bit worse than Smirnoff 99 frost, but we actually do slightly better on dielectric constants. So we're roughly comparable as a starting point. And Dave Slotschauer who's here has done some very nice work looking at using Smirnoff 99 frost for host guest binding data. So here's Smirnoff 99 frost, here is gaff, basically we do just as well, maybe a tiny bit better for host guest binding in terms of free energies. And it turns out it works fairly well for enthalpies of binding as well. So here's our enthalpy binding enthalpy data and gaff. So then you can use this now and the at least the open eye version based version has been out for a while. You can use, so we need a chem and dramatics toolkit under the hood. The open eye one is one we used initially, but Jeff is just bringing online the RD kit based version. So I think you should be able to try that out tomorrow if you want. So I think this is where we're going to switch to Caitlin, but are there any questions on this part of the talk before we go on? Yeah. Somebody run the mic too. I'm sorry about that. In naphthalene, do you wind up with all the carbons equivalent or the bridge carbons different than the other ones? Do you know the answer to that? Yes, thank you. So for the non bonding parameters, they're all equivalent, which is the same as what would happen in permafrost or gaff. And then because we're typing separately bonds, angles, torsions, those have separate types to assign the right torsions. So for instance, the bridge bond length is going to be different than the others. Is that right? The naphthalene bridge at bond length? Chris says they're the same. Oh, sorry. I was thinking of biphenyl, I think. I'm sorry. I was thinking of the biphenyl that we gave in the talk. No, those should all be treated equivalently. Because you have three carefully structures. So actually the bridge bond is substantially different than the other ones. Yeah. So there's a separate question of what they are treated as and whether they should be treated as that. So we just took the permafrost for a field and moved into this format. So you're raising an issue of maybe we should be treating them differently. And that's something. Right. If you look at the ab initio results, basically have an alternating one and a third, one and two thirds bond order, yeah, roughly speaking. And I would imagine in your smirks, they just come all come out to be the same. Chris, they so they will. So the and this issue is actually really important. It has to do with the use of our chemical representation and our direct chemical representation. We're using smirks and smarts. These depend on an aromaticity model. So naphthalene in the smirks and smarts aromaticity model, they will end up they're equally aromatic within that chemical representation. In order to distinguish the two, we would have to use a chemical representation at the level of our substructure search, which would recognize those differences. Otherwise, there's no way to carry the true sort of quantum, if you like, differences through the chemical representation. Now, one of the interesting things is that this effort. Now we're going with the smirks representation, but within, but their group has already begun to go beyond that to look at other more QM based representations of bond order. Yeah, so for example, we can use partial bond order to interpolate parameters, for example, and I don't have that in this talk that's mentioned in our paper. So for example, you might imagine, you know, if you have a bond that's somewhere between single and double, a substructure query is just going to treat it as single or double. But it's partial of a part of a partial charge assignment, a semi quantum calculation to assign partial chargers. We might do a semi empirical calculation, for example, is doing that, we could get a bond order for free. How to doing that calculation. And if we find that the bond is in between single and double or not even in between single and aromatic, for example, less, more towards single than aromatic, it could get parameters that are in between a single and aromatic length or between aromatic and double. So that kind of thing is I think what you and Chris are getting at it. Our at least first pass at this, we're just binning things into it's aromatic or it's not aromatic. And those are going to be the bond parameters we get. But one of the plans we want to carry out is not just binning them, but having a gradation between those bins, where it can account for those types of differences. Other questions on this part of it? Any heckler over here? Okay. So I'd like to return to my earlier question. And with regard to your curves, both force fields, Gaff and Smirnov showed what I would, good agreement seems to be in the eye of the beholder. But what I would have said about those curves, there are some unacceptable deviations, which getting back to what I would like to encourage is to get to the more sophisticated force fields, sort of like what the crystal community has found ASAP, because it seems like one may be looking around in a pool, but the answer is somewhere. So the question you could say is, well, why don't we try better force fields that can result in better agreement? So really fundamentally, the initiative is in large part an infrastructure project. We're building infrastructure for fitting force fields. And we're going to be, because of short-term interest in pharma in the very specific functional form that we're working on right now, we're basically just refitting force fields within this functional form, but building infrastructure that you or others can use to try out your favorite functional forms, given the same input data, just swapping the functional form, which I think is going to help everybody learn a lot. Then we can learn how much changing functional form buys us, given the same input data, or how much changing input data changes things given the same functional form. Well, it certainly makes sense to do it your way for building the infrastructure. But I think the jury is probably in, in a lot of studies that show that the force field Elinor Jones single-point charge harmonic force field is not adequate. Yeah. It also has, still has, you know, as I sort of motivated at the beginning, there's also still a clear room for improvement within it. So, yeah, but there's also still clear room for improvement within that framework. So, I agree with you, and I also disagree. It's certainly incredibly useful as model building tool in the recent free energy. Yeah. The question is, that's exactly the question. Yeah. Yeah. All right. So, we move on to Caitlin's part of the talk about improving the chemical perception going on from here. Oh. Hold on. Can you hear me? Yes. All right. So, as David mentioned, the smirks that are currently in our implementation of Smirnoff were all written by hand, primarily by Chris when he was doing his sabbatical in our group, and then have been hand tweaked by me over the last two years. As John said during his talk, we really want to put ourselves out of that job. I don't want to be hand tweaking smirks for as long as we're using this format. Replacing the wizard. There it goes. Oh, sorry. I was clicking the wrong spot on mouse. Okay. And then just to pull it back into this roadmap that everyone's been using, we also want to think about chemical perception as an output in our fitting, as opposed to atom types, which are traditionally an input or the starting point to fitting a force field. So, we've added chemical perception sort of in this part of the QM database, part of the experimental data going into that fitting process. Okay. And so, instead of handwriting, we want to be able to generate these smirks patterns based on input molecules and some kind of reference data using some kind of automated process. So, I'm going to focus on what's happening inside this little computer module image that'll allow us to output these smirks patterns. Hot goes. But before we do that, I want to break down this language just a little bit. It's not really all that human readable. It really starts looking like swear words. But just so you have an idea of what's happening. So, we can think about smirks or smurts as atoms connected by bonds that build on the smiles language. So, instead of just having element numbers connected by bond orders, you can include these decorators on the atoms and on the bonds. So, representing things like connectivity and charge, ring size, total hydrogen count, or if the bond is in a ring or not in a ring, if technically there's stereochemistry around, we're not using it because traditionally force fields haven't, but we cut in the future. So, we can look at how those build up together. So, we can go as generic as any, any two atoms connected to each other with any bond. We'll make this more complicated, such as carbons connected by an aromatic bond. We can also use Boolean operations. So, you could have carbon or nitrogen connected by a specific bond. They get even more complicated as we add more and more decorators. So, while this language might look simple when you're looking at it, the English language also only has 26 letters. Think about how many books are out there. Smirks can get very, very complex as we add all of these options for combining them. Why? So, the first thing we needed, if we're going to make changes in this smirks space for an degenerate smirks patterns, is a computational tool that allows us to actually edit them. You'd think that someone out there may have done that before me. They haven't, or it's all inside proprietary software that they wouldn't let me have. So, I've created these chemical environment objects that essentially let us break down those atoms and bonds into their individual decorators, which allows us to, for example, make changes. So, if we're trying to convert the smirks on the left to the one on the right, where we have an extra decorator on the second carbon, we could add that decorator on the second atom in the chemical environment. These can be initiated from smirks patterns and written back out as smirks patterns, meaning that you can integrate them with any other tool that you're working with. So, to start our initial kind of experiment was can we build a tool that automatically creates the same chemical perception in smirks or smirks patterns that exist in current force fields? So, John sort of referenced this briefly and I'm going to not spend too much time on how we did this on a first pass because it turns out it was very inefficient and I think there's a better way to do it that we almost have working. So, in this first pass we decided to take a fully random Bayesian kind of approach using a Monte Carlo algorithm where you pick a random smirks from a list. You decide if you're going to throw out that smirks pattern or generate a new child parameter that would go under it in the hierarchy and then evaluate it using some scoring function that in this case compares to a reference force field but in future things would need to compare to your actual reference experimental or quantum mechanical data. And then we followed sort of a metropolis Monte Carlo simulation for that. So, in this case our reference data was Smirnoff 99 Frost. We used three different molecule sets that increased in complexity and the initial smirks going into it were very, very generic. So, just to walk through, John already showed a slide similar to this, just to walk through what that actually looks like when you're making random choices, you could say let's start with a random, sorry, a very generic carbon-carbon bond. So, that's two carbons connected by any bond. You could say there's only one item in the list and we're not going to take it out of the list. So, we're going to create a child parameter, say we decide to change the bond. We decide to change the decorator on that bond. We're going to change the generic bond into an aromatic bond. So, that adds a new parameter to our list. We evaluate it, score it, decide to accept that move or reject that move. Okay. So, this works okay on small molecule sets. It even doesn't do too badly on something that represents the chemical space in drug bank but really we should be able to get a hundred percent agreement with existing force fields. We wrote the smirks patterns. We should be able to generate the smirks patterns automatically. So, one reason why this is so inefficient right now is just the random nature of how it's making moves. It doesn't know anything about chemistry. It's just we're going to try this decorator and see what happens. So, we can say this is a smirks pattern for a torsion in Smirnoff 99 Frost. It's not all that complicated. It's just a rotation between a carbon and a nitrogen that has a double bond. So, if we're going to try to generate that specific smirks pattern using this Monte Carlo algorithm, we need to make sure that we start one with the right kind of parent smirks from the list. And then we need to add decorators. So, there's a total of four changes that need to happen between the bonds and the atoms in this case. The relatively low probability of picking the right decorator. So, now we'll assume that it does that in the order that it needs to do it in to make these changes. That takes a billion iterations of that simulation. So, I was running smirking simulations that were running for about a week on a single CPU. We really don't want to be wasting our computational time generating smirks patterns. There are way harder things to do than all of you are working on. Okay. So, I want to learn from the lessons that, take the lessons that we learned from this and build a tool that's a little more actually takes in the chemistry, knows what it's doing. So, the first thing is we're going to get rid of these naive moves where the software doesn't know anything about the molecules. You have training molecules. You needed molecules in order to get any of your data. The chemistry is there. We just need to know how to interpret it. The other thing is that Spanky was specifically built for comparing to reference force fields. So, it's not very modular in the form it was in. Any steps going forward need to be completely modular so that we can plug in different ways of smirks generation. We could plug in different input molecules or formats for molecules. Try different reference data or scoring functions. Okay. So, I want to think about this from the perspective of let's assume we know how to cluster a set of bonds or we assume we know how to cluster any of these fragments. So, you could imagine taking that reversible jump Monte Carlo that John described and instead of making random changes in the chemistry space just making random changes in how you want to group things then once you've found the way you want to group things with that algorithm then assign the smirks patterns. The goal of the smirks patterns is to be able to recycle that clustering not just to make random changes in chemical space. So, let's assume we have these groups of bonds that you've decided need to be assigned the same parameters. Now we want to generate that list of smirks. One, two, and three. So, I have a new tool that we're calling Kemper, short for chemical perception. And I'm going to walk through three sort of phases of this. I'm going to talk about just in general how it goes about creating smirks patterns from these clusters of bonds and address some of the challenges that still come along with that even though we have the molecules that we're taking from the chemistry that we know in the training set. And then I'm going to talk about sort of how we can use that process with actual data instead of reference for its fields. Okay. I don't know why that. Okay. So essentially what happens with Kemper is it takes a training molecule that you've given it and you can identify atoms and it generates the smirks from those. We can say there's a carbon atom here that has this set of decorators. Here's all the possible decorators you could use for this carbon atom. Here's the type of bond that connects the two of them. We can then extend that out along the molecule away from just the index atoms that describe the bond that's actually assigned to the parameter. So that index, the colon one or colon two tells you where the parameter is assigned, but you can include more atoms than that as necessary. You can extend that even further to include the entire molecule if you wanted. So if you want to create completely custom parameters for all of your molecules and put them into the Smirnoff format, I can help you automate that process. Not really our goal as an open first field initiative, but if you want to do it, we can do it. All right. So we can take a sample. It's just a simple example of this where we have three different alcohols or two alcohols in a methane here and the three bond parameters that those molecules would get in the current Smirnoff implementation. So if we extract all of those decorators, we generate these three Smirks patterns that describe the three possible, the three groupings specified by Smirnoff 99 Frost. So we want a general force field. We don't want a force field that specifies the entire molecule in our training set and can't match anything else ever again. So we want to remove decorators in these Smirks patterns that are unnecessary for keeping the clustering. So the goal is I want to keep circles with circles and squares with squares, but I only want to maintain the decoration in the Smirks patterns that are necessary for keeping that grouping. So we do that by just iteratively removing decorators and checking that it kept the clustering. Okay. So the next three slides, I have a couple of examples of things that make this process slightly more difficult than what I just described it as though it just works. But most of them have been overcome and are working in the software as it is. So one thing is if you have a bunch of data for bonds, you don't necessarily have them aligned in a way that makes chemical sense. You just know you have atom one and atom two specifying a bond length and a force constant, for example. So we want to make sure that those, yes, there we go, make sure that those get aligned correctly before Smirks generation so that you're actually describing the chemistry that's there. So with bonds, that's relatively simple. Only two things can get a little more complicated with things like torsions where you have multiple atoms that need to align. So there's the one case where it's just flipped over, right? So that symmetry is easy enough to flip it back. You also might have cases where the middle atoms are flipped but the outer ones aren't in which case you need to make decisions about prioritizing where the torsion actually needs to be. And then that same principle can be expanded to describe the atoms that are non-indexed. You make sure you overlay the entire fragment being described, even the ones that aren't necessarily being used to assign parameters. So the other decision is how big of a Smirks pattern do I want? How far away from these indexed atoms should we actually include? So we could start with just the indexed atoms and one layer of next-door atoms to extend beyond that. I'm going to stop at two atoms away but we could keep going. So the way that this is handled right now internally is I just try it with the minimal number with the assumption that we'd rather keep it as small as possible and then add layers until we get the clustering we desire. But it adds that extra layer of atoms to every single Smirks pattern in the list, currently. So then those extra atoms can be removed when you're removing unnecessary decorators if they don't have to be there. But it might be ideal if we can find a way that doesn't require that extra step of adding things and then removing them. Then the other important part is when we're handwriting Smirks, it's relatively easy to say, okay, there's a lot of atoms that are just generic. They all need to get the same parameter. We're going to put those at the top, call it any atom. And then there are these carbons that are really specific that only apply to double bonded carbons in three and four membered rings. They don't come up very often but they definitely need a different parameter. We're going to put that one at the bottom and write this very complicated Smirks pattern for it. But if we're going to automate that we also need to decide what's the correct way to automatically make that decision. So do we choose things based on the size of the cluster? So how many bonds are in one? Should that be the most generic parameter and just put it at the top? Should we based on the number of types of decorators on an atom? So if this atom can be a carbon or a nitrogen or a oxygen or a fluorine, should it be at the top where it's generic? Or should we just pick a random order and test it and then pick a different random order and then test it? So that's not a decision I've actually made because we're working off of comparing to the existing force fields where I just keep that order for now. So now I have a toy example of how we might do this with some actual quantum mechanical data and then the sort of open-ended questions that are still coming into play with how we make these decisions. So in this example I have bond angles for a number of hydrocarbons. So not just the three on the side, but the three on the side represent the type of chemistry in this dataset. And we've plotted just population versus bond angle in that graph. I keep putting the wrong mouse key. All right, there we go. So with a very rudimentary clustering algorithm we find that there are two main peaks here. And we want to make smirks patterns for those two main peaks and identify the angles that show up in both of those clusters. And then we'll go through the same steps I did when we were describing the Smirnoff bond parameters. We pull out all possible decorators for all of those atoms and then we iteratively remove, I don't know why those are all animated separately, sorry, then we remove the decorators that are unnecessary and simplify them down. So we have a generic set of parameters that keep those clusters assigned correctly. All right, so then basically we want to be able to do this with any type of parameters. We might imagine with bonds and angles that you could use lengths or angles in their associated force constants. And working with higher to figure out if we can cluster torsions based on the number of periodicities and force constants that they require. And as you'll hear from David and Li-Ping, we're also working on improper specifically focused on nitrogen centers and working with Jessica Mott on that with how we can generate smirks patterns for those as well. All right, so as I said before we're trying to keep Kemper as modular as possible. So while I just use a clustering algorithm out of the box from scikit-learn because that's easy and it's not really my area of expertise quite yet, we could also, the gifts don't work for me either. So we could also imagine plugging in Josh's random reversible jump Monte Carlo. Algorithms is another way to cluster the same data. So there are also concerns around what data should we even be clustering. So Vicki and I had proposed a project earlier this year where we were going to look at bond lengths and force constants and bond angles and force constants and then cluster those and generate smirks patterns as a demonstration of how we might be able to automate this process. Chris nicely pointed out that maybe quantum mechanically calculated force constants isn't the best option to be clustering because they include features that are maybe not don't actually need to be worked into the bonding parameters. So some of these examples show what we might think of an angle around an ether oxygen that should probably get the same parameters alphatic ether but they get very different force constants largely sorry very different angles assigned to them and different force constants due to the environment around them that would largely contribute from ring strain which we don't necessarily need to put into a bond parameter or it's there because when you move away from the ring all the other things get twisted or because of Leonard Jones parameters with the large methyl groups nearby. So there's this open question of what what should we really be clustering and what's the best way to do this. Okay so we hope that between this and the papers we've put out they've convinced you that direct chemical perception is superior to atom typing it's much more flexible it's going to allow us to edit parameters separately we don't have to add a new atom type every time we want to put in a custom torque in. Also spurn off 99 fosters available for you to use now at least if you have open eye if you don't it will be available in the next few days. Okay and then as I've shown we've got this new tool Kemper that's allowing us to make smirks patterns once we've already isolated clusters of fragments that need to stay together in the force fields and while that's mostly working and getting documented there are still open ended questions into the best ways to use that tool moving forward. Right and we've got GitHub repositories for everything this is mostly on Smirnoff on Slack smarty is there to talk about chemical perception if you want it and I am a multi-fellow so the funding for this project actually comes through them. Yeah people on Zoom if you have questions feel free to unmute yourselves and ask. While we're waiting are there questions so I'm not sure how the ring perception comes into these rules for instance if you have the Ciri cyclopropane cyclobutane cyclopentane cyclohexane Are those going to be all completely different smirks or are they going to get lumped together? Ideally that decision will be data driven and not chosen by me the wizard or Chris the wizard. So the smirks patterns allow you to specify the size of the smallest ring so if the question is just specifically within the smirks language that's an option you can also include just if it's on a bond if the bond is a part of a ring you can also include the number of ring bonds and atom is in so that would be would match regardless of the size of the ring so they're decorators for both I guess. Thanks you said it's still an open question how how you should be clustering that shouldn't also be a problem as soon as you go to the Bayesian to the Bayesian inference right yes yes the goal is to blend this with all of the Bayesian work and then you don't have that question anymore because you can try all kinds of things so you still want to cluster because you want to reduce your search space right and how well will the fragment based smirks assignment patterns work for novel chemistry then yeah I'm not sure I understand for normal chemistry novel chemistry novel novel chemistry um so the the goal here is that if you as we've said with other things we're building an infrastructure that you should be able to train on your custom molecules so if you wanted to generate smirks patterns for a set of molecules that we didn't necessarily train on it should still be able to generate the smirks for those. Okay is that better okay um so the reason why these starting smirnoff 99 frosts that I made on my sabbatical works for 99.8 percent of chemical space is because of the generic parameters the problems they they will be overly generic so when you come up with novel chemistry will it work yes it will work will it be good we don't know so this is where this is works exactly using this kind of data driven approach using the ability to automatically generate new subspaces of the chemistry which will say the generic parameter which works for your novel chemistry is not the best kind and then it will automatically generate more specific parameters which deal with your novel chemistry so your novel chemistry I think should always be accounted for it will always be accommodated but if it will be done well that will be done it will progressively improve as the chemistry generates to accommodate that specific subspace. I just want to follow up on that so uh this is where we would love your advice on how we can get molecule spaces that are not going to show your hand about what the chemists are making but would be large enough and interesting enough that they would help us make sure we're ahead of this problem would be super useful your call your colleague um uh at Bayer um Clara of course has pointed us towards the call in groom paper about heterocyclic aromatic rings of the future as well as sure kindle so mining patent space of areas you've already looked at but if you have clever ideas about how again without tipping your hand you can give us public molecules or or enumerate public ways of enumerating public molecules we can feed into the quantum chemical pipelines we can get out ahead of this and make sure it will work at least not embarrassingly poorly on those new new types. Yeah to expand on that point I mean maybe it's just as simple this is on um just as simple as the advisory board folks getting together and identifying things that they would like to us to prioritize pulling out of e-molecules or something so just giving us tips for because uh uh we'll be talking about in some of the later talks like on torsions and on valence a big part of the first major round of refitting is just which molecules do we include in our first round of QM data for torsions and valence parameters build up a pretty big molecule set for that so which ones should we be paying the most attention to um I've heard about other methods that avoid smirks and add them typing totally and use just direct QM like through extended hookah theory and that kind of thing I'm wondering because Chris said that you're also looking into using more um QM based representations per bond order I'm wondering what kind of things you're thinking about and how that fits in. So I can comment and then maybe David can add to it something that we proposed in this Murnoff paper that's out already is using um just partial bond orders calculated from the am 1 bcc calculations we're doing for chargers anyway um and then instead of having a smirks pattern that says this is a single double or triple bond you could just have an any bond in between them and then expand that parameter depending on what the partial um calculated partial bond order is and we've explored potentially using that on other parameters as well including improper assignment. I'm looking to add oh yeah and definitely would love to talk about it during the discussion. It's good fun. I'm sure you looked already all the database like the gdb8 and the sure Campbell and Campbell and and think you really need more than that because it seems like like the ring how like the rings should also be covered by the gdb 1112 or whatever and then then all the single bonds should essentially be covered by that and by the combination with with sure Campbell and and think as well now the question is how do you drill it down to the most important ones but if you're focusing on on one examples for a torsion scan for each of the of the bond types in those you might be able to get down to a reasonable amount and still cover everything I think so gdb8 you remember the number of compounds gdb8 or 11 you remember the number of compounds off the top of your head yeah it has only 50 000 so we can you know check which parameters get exercised in that set fairly easily but in like the very first round of fitting where at least I don't expect we're going to be covering gdb11 with our qm data like we're not going to have qm on all those compounds in the first try if you limit yourself at a central bond and and the adjourn at the next ring then you might actually be able to exercise essentially everything right the question is partly what's the full molecule set we'll use which so we should be like we'll certainly make sure our full molecule set exercises all of the parameters we have which is a slightly different question from are we exercising all the parameters we have adequately which is do we need more so maybe we need to be testing them more yeah so we're very open we'd love to have input on how exactly to check that and how to make sure we have the right molecules from gdb8 or 11 or sure can go and even just prioritizing in the short term because once we get the infrastructure working it won't be hard to add things