 Hi, everyone, so I am part two of the automatic chemical perception so I'll try and explain kind of it's a very similar approach in terms of automatic splitting but maybe so I will be talking about automatic chemical perception as the second part in this. And in particular, I designed a way to encode the smarts in a very specific manner such that it becomes very easy to do automatic chemical perception. And so for the outline, I'll talk about what the problem of parameters splitting is in force fields and why it's difficult and why it hasn't been really done very well so far. And then I will talk about the encoding scheme that I have designed to do this. And then finally I will talk about with this new binary coded smarts encoding how you can actually do automatic chemical perception. And then finally to put them into like a real use I fit some force fields from scratch on alkanes and also looked at a single molecule that has kind of difficult physics in terms of a torsion profiles, and I fit a custom force field and got very accurate torsions with that. So the traditional approach to finding new parameters is usually you look at distribution of air after you've optimized the force field and so for example, I'd be showing in black by the mouse. That could be like one parameter and then that's the error over that single parameter. And what you want to do is, if you're lucky, you see a bimodal distribution and you say hey, I can split that and maybe I'll get better accuracy because those might be different chemistries. And if you do this you can essentially define it as two different distributions, and the trick is you have to label each of your data, either in one distribution or the other. But as you can see, it's up in the air where you should assign the overlapping part, because what you decide is either green or blue is going to define how your smarts pattern look because you have to distinguish between the two groups somehow. And then once you refit, you're hoping those errors will go down, they'll kind of peak a little higher, less variance and be more sensitive around zero and you have a better force field. And so this is kind of problematic for a lot of cases. Here I'm showing a very lucky case where you have a bimodal distribution. And so I kind of call this chemical perception based on the physics, you're looking at the physics you're looking at the air, and you're hoping that there's some chemical environment that can just split them out, and that everything kind of works. But the problem with that. Tobias kind of mentioned a bell space which is all the splits. Well it turns out, making a smarts pattern to encode to encode every one of those partitions is simply very challenging. For example, methyl hydrogens. You can't really distinguish them, but the bell space would say they're distinguishable, right. So the trick is, let's say you have some division here. This is possibly not encodable by smarts. And so you got to be very careful with this kind of approach. On the other hand, kind of inverting the problem if you do chemical perception based on a chemistry, you say well if I have a smarts pattern that this parameter is associated with using the spin off force field. So what partitions can we induce easily by looking at similar but not quite exact smarts patterns, according to our original parameter. So for example if you had a carbon parameter. Well if I can just make a single edit, I'll induce some partition, and such that everything is clearly labeled and defined. And then the hope is that that's actually a good parameter and we want to keep it. So if you can kind of encode this in a way such that you're only looking at close smarts patterns you're no longer solving the combinatorial problem, it becomes more polynomial because you're searching an incremental smart space. And so when I first thought about this problem. It was interesting to me and there was two questions that I asked myself and to see if there was a solution to. And they were as follows so given a smarts pattern or given to how do you determine if they're equal. This is kind of more solved it's just an isomorphism check to see if the grass match no no problem. The more tricky part was well what if I have a, if I have a smarts pattern, which pattern is closest or which pattern comes next. It's optional so it's kind of up in the air but you can kind of see with this example if I have an SP3 smarts pattern adjacent patterns would be an SP2 carbon, or an SP3 carbon not a ring so notice I only change the smarts pattern by one little bit, I guess, or one little decision. You can programmatically encode that to some data structure, then it would be really easy to easy to ask well if I can produce these smarts patterns. How many different partitioning can I generate. And are any of those partitions with the new smarts patterns, actually good performing parameters that we want to keep turns out most of the time yes there are some really good patterns as you can see. I think that's going to be a very good idea to split out SP3 versus SP2. And we had to do almost no work to get that smarts pattern and distinguish that that that physics. And so, very simple I'm not going to talk about the actual encoding of the data structure, but it's just a bunch of bits. And you can manipulate them using bitwise operators, a very more relevant example is this let's say I have two smarts patterns or two graphs, and I want to combine them or take the union. Well this is just a bitwise or operation on the entire graph such that you get a new pattern with all of the bits kind of set. And so here, for example, the, you have a carbon, this semicolon means. And, and then the colon means or so you're kind of checking to see if these three groups match. And inside the group you can say it's either X3 or X4. And so you've, you are able to match all of these two patterns on the left hand side in this larger condensed pattern. And so when you do this so we're getting kind of to the point where it's going to be useful. So if you have these multiple bits set of all the data that matched the parameter. Now you can just look at the close patterns by looking at the single bits. So you obviously have the four bits here, but keep in mind, if a smart pattern doesn't specify some primitive in the smart pattern, all the bits are set and there's still all that chemical information contained into it. So for us that's not useful right now, but it can do things like ask, well, what if we did need to separate on range or not, what would those molecules look like, because right now, we don't discriminate rings, and maybe you need to. And you can do that by looking at these, these simple, these simple bits. And then finally to actually make a new parameter. It's just one operation. So if this is your original parameter, and this is a bit that was determined from the data. The only thing you need to do to make a new parameter is just intersect them or do do an and on. And now we'll combine them into one. And you have a new parameter, and you can check that with a modified force filled with the new parameter fit it and see if there's actually an improvement. So this is a very simple example, but it works for bonds and angles and torsions. And it's just very general in the way it works, because it's kind of the same problem is more complicated on more complicated graphs. The last thing we need to do before we kind of get into the results of things. And because Smirnov is a hierarchy, you need to have some definition of subset. And so you would have to kind of ask and answer the question of whether this sp3 pattern is a subset of a general carbon. Of course, yes, because everything that matches sp3 carbon also matches this carbon, but it does get a little more tricky if you have like a carbonyl pattern, because it has two atoms, but you still have to say whether that is a subset of your general pattern. And this is also true so you need to, when you do these operations, you must consider the fact that not all graphs have the same number of edges and nodes and stuff. But you and you still need to have some mapping such that you can do the bitwise operations. With that in mind, the open force field is simply just a smart hierarchy that you can build. And so for example be to here with this general bond parameter sp3 to sp2 is a general carbon carbon bond, and then you specialize it with with a general oxygen here. And that's why you have this kind of direct association inside the tree. And now that you have a tree, you can just iterate over it and check to see which parameters are performing poorly, or not so poorly and you can target it on those and branch those parameters out and make nuance and check to see if any of them are viable. So to test the feasibility of this, I wanted to my end goal is to make a transferable force field so I want to make this a very general approach I don't want to necessarily custom this smarts to have a single molecule. And so the easiest smart space you can kind of look at is alkanes because those smart patterns only differ by ring membership and hydrogen membership. And so chemical space is very small. And so if I can't do this well. Some work would have to be done. And so you start with a very general force field and so this would be just bonds and angles. I leave torsion side for this, because this was a first look. And so I'm holding a lot of difficult things constant and I'm just trying to fit a simple balance terms for the fitting objectives, we we use force balance regularly. This is no different. We, in this case fit geometries and forces. And so that's, that's what we're trying to fit in the physics of the problem. But then, because we don't want to just parameterize everything and make every smart pattern unique or every every little bit of the molecule, its own parameter we have a chemical objective that essentially counts the parameters to make sure that we don't have a overfitting set of parameters. And finally, another trick to this sense, I do have this binary encoding scheme is quite easy to take a set of molecules, and there's a lot of redundancy in the smart space. And so if you look at all the torsion smarts, you can start picking molecules out such that you have a minimal set of molecules where all the smarts torsions are covered. And that would be a subset of all the outcomes I'm showing here, and then you use the rest for a test set. And so you have all of the smarts patterns covered, but you don't necessarily use all of the molecules. And then, because you are in a very large space you need a way to kind of prune things down such that you're not testing thousands and thousands of new smarts patterns on the regular. And so this simple diagram just shows like a scoring system to kind of filter out bad parameters. Not all smarts patterns are going to be very useful. And so you'll see that in the optimization because there will be no change in the objective. And so the idea is with like the first scoring system, you potentially just want to examine the worst performing parameters for example and so you hone in on those. And then you split those out. And that would go to the w two scoring system. And in this case I do exactly one optimization step. And this weeds out a lot of those parameters that don't really do anything because the objective either won't change, or it will go up. And at the same time the chemical objective will also go up, which means it just won't be a viable candidate, but the ones that do drop will drop almost always than the first optimization and so they become viable parameters and then in the final step you do a full fit and then take the best one. And then you take that one single parameter that with a single best drop in the objective, refeed it into the force field. And so now you have a new force field with another parameter, and then you repeat the process over until you've kind of balanced out the chemical complexity versus the physical objectives. And here are the results. And so you can see, starting from like a very basic to parameter system. Once you add a few, you become competitive with a hand curated set. So this would be sage so sage has 10 parameters I think for the for the same molecules, and I'm testing different kind of settings or hyper parameters of my scoring function functions. And as you can see if we kind of target the worst performing parameter, and then consider all the candidates, you get kind of similar, or you get good comparable results without checking all this stuff and adding these new parameters that don't seem to be very useful in terms of improving the force field. This is the test set and, as you can see, we do pretty much just as well as kind of these more expensive calculations with that have to check a lot more candidates for parameters. To kind of show that directly here you can see the number of parameters. And so, all of the models increase coincidentally to the same amount of parameters as sage, and then you can either I do perform parameter deletions. You can see the number drop sometimes, but in general, the quickest way, the quickest method, produce the same amount of parameters as sage. And you can see here that for the w three scoring you alternatively if you check everything you are doing 200. For example, 200 full force balance fits for every candidate and that takes a quite a while if you have more than 100 molecules. And so this was quite an expensive search to do, but if you're only looking at a few, and you're very smart about things, you don't have to do that many tests. And this is the same with the w two. And so this is just showing that the green model that kind of has the best set of hyper parameters performs really well and it's actually very quick to generate these parameters. Here is so finally so I talked about generating sets based on uniqueness of smarts. And so what I didn't mention is that I would randomly select a molecule that satisfies some smarts pattern. And so here I'm showing what happens if you sample over that. So I'm doing repeats, but I'm also changing how I define uniqueness in the smarts pattern. And so in the legend I'm showing R and H, which means I only include ring membership in the smarts pattern. So the ring smarts is included in uniqueness, but not hydrogen and vice versa for this so this H indicates we are not considering ring membership and our uniqueness. And so, in this case for example, you would be pulling out some cycloalkanes, but not other cycloalkanes, or you can potentially get all straight alkanes in your training set. So the green means I'm considering both smarts patterns such that I cover the entire chemical space of my data set. And then I compare it to random. And so here's the training. This is Sage to kind of give you a reference. And as you can see, the green kind of performs really well and it also has a very low standard error the mean. And so it's a very consistent model. Pick the right chemical space to fit on testing is very similar. Except what's very interesting is that you get really good performance on the green, the RH set with only a few parameters and again it's very low noise. And so it's a very stable setup parameters that we generate. Finally, last slide. So I looked at the just taking one molecule and fitting bonds angles torsions and improper using the same method, except this time it's only on one single molecule and so it's able to kind of fit it. It doesn't really overfit, but the idea was how accurate can I get with this method with building and finding new parameters. And here you can see the black is our standard open force field QM model, the three lip. And blue is the result of this work and then I'm comparing with sage and so you can see for alaning tripeptide for the kind of the important angles I find Omega. And there are some kind of issues. And this kind of approach will, will find the right values and give you very good torsion profiles. So with that in mind. That's my work so thank you very much for your attention. Let's take questions. Special thanks to Caitlin and Chris Bailey for kind of bouncing ideas off of them. This was a long time into making I've been working on this for a while now. I am a graduate student. This is my PhD for the most part, whether I like it or not at this point. And I'd also like to take thing to bias because we have very similar interest in, I meet him, or I seem regularly in the open force field calls, and also put on the Harris, he's also in the mobile lab like I am. And it's been great bouncing ideas off of him as well. So, without like to take any questions. Thank you very much.