 Hello everyone. Today I'm going to talk about automated inputs of forces parameter types and also a sampling of forces parameter types. So I want to begin by, you know, reminding people of what parameter types are in the first place and how we are using them to assign forces parameters to molecules, right? So I think there are basically, right now out there, there are two different concepts. One is called indirect chemical perception. This is the thing we've been doing for 30 or 40 years based on atom types, where we basically say, okay, so every atom has its own chemical environment and the combination of all these chemical environments allow us to define atom types, sorry, bond types, angle types, total types and so on. Then you can do another thing that's called direct chemical perception. This is what I've been, I will be using today and which is also a large part of open force field, is that we say, okay, we are skipping this intermediate step where we're defining atom types. Instead, we are directly defining the chemistry that is going into our bond types, angle types, and so on. And we are doing that by something that's called smarts. Smarts is a sort of one dimensional chemical notation language that allows us to incorporate substructure curies in a very intelligent and clever way. And we can use that to directly define the chemistry we're interested in for a given bond type and so on. So now let's set the ground a bit for the experiments we want to do today or in my talk, which is, we want to start on the left side, which is, we are totally ignorant about chemistry or we're not, we're uninformed about chemistry that's the dead way. And we say, we have never seen a molecule in our life, but the only thing that we know is there must be something like bonds and angles and torsions. So we're saying, okay, so there's just one bond type in this universe that we know, one angle type and one torsion type. Then all of a sudden we see some data. And after we've seen that data, we can actually say, and only at that point, that there are actually different bond types, different bond chemistries, different angle chemistries and torsion chemistries. And the whole idea is that we only want to discover the chemistry and the types based on the data that we have. So nothing more really just really fits into our chemical universe that we know about. So now that we have basically set the ground for everything, we want to talk about automated inference. So we want to infer false fields from scratch based on some data that we see or that we know. So we start on the top left corner where we say, for instance, for bonds, okay, so as I said, in my universe, there's just one bond type, which is encoded with this mark, meaning star, so star till the star means any atom with any bond, bonded to any other atom. And now we're going to split that into two bond types. So we're looking for a partition for two smart patterns that sort of split our initial parent smart pattern in an sort of optimal way. And this is the same as asking, please explore all possible ways to partition a set with n elements. And this is encoded by some, this is also described as something called Bell numbers, which is part of a like well known mathematical combinatorial problem. And this basically tells us that the number of ways we can split types is really huge. If you think about something like LNM dipeptide, which has 14 bonds, we can basically say with, you know, the molecules LNM dipeptide has 14 bonds, we can essentially have 100 million different force fields that have two bond types. So the combinatorial space that is spanned by, that is accessible for us, is really huge. So we must come up with a smart way to explore, to explore this space. And the first thing we need to do, I just want to talk very briefly about this, we want to aside a way to encode smart patterns in some sort of, in some sort of vector that we can use in the context of an algorithm. And this is done by this, by something that I'm calling a mapping vector. And this basically says, okay, so a mapping vector of a given length, and this length encodes the maximum amount of information I encode in a smart stream. And each element in that mapping vector basically says, informs about the things in the chemistry I can put into a smart vector. So this is really just a way to encode chemistry in a smart stream into some sort of class vector. The next thing is, and that's really important, is how do we make the decision? What splits, what types splits? We want to actually include for a full parameter optimization. And of course, we do not want to look at all 100 million possibilities, we can split a type. We could not look at all those individual and minimize them. We would want to have some sort of quick way to score a given split in order to know whether or not this is worth minimizing. This is worth optimizing. And we use gradient scores for that. And what that basically means is for a given candidate split that we think is worth splitting, we say, okay, so what are the gradients on all the atoms that match the left type and or what are all the gradients of the bonds that match the right type. And we look at those gradients and look at basically, you want to know in what direction are they pointing. And if they are pointing into opposite directions, we say that, okay, if we now throw this into a Newtonian optimizer that looks at the gradient, it will pull the parameters for these two types apart, which means that they end up in different areas of parameter space, which would be very beneficial for our optimization problem. So we're basically computing the dot product of the left seal and the right seal. And if I even say that this is called the gradient score, and when the gradient score is close to one, meaning that the gradients are pointing to opposite directions, we say, okay, this is a split we want to use further and we want to fully optimize that. So now that we basically know how to split things, we need to put this into a large context, right? So we're starting out by saying, okay, so it's always saying, okay, we need, the first thing is we're backing an objective function. The objective function is the log likelihood. So that tells me how accurate is my model. And then I'm including a penalty term that says how complex is my model. So I want to tune the complexity of the model represented by a number of parameters against the accuracy of my model. And there's basically there's one knob in there, but I'm calling K, this is a fixed penalty factor that I can use to tune the complexity of the model against the vectors. And now let's say we start out with a set of n molecules. And then in the first step, we're computing the objective function of all my molecules, then we generate a set of random and mini batches containing n molecules. So we want to partition our whole universe that we know into smaller universes. And then we try to find the K best that's in each of these individual universes using the gradient scores. I talked about this like before. And then in the next step, we want to find the forces parameters, what you know, we optimize forces parameters in the next split. So that we define a bunch of candidate force fields in each of these individual universes. And then we ask the question, how well does each of these individual universes generalize to my whole universe? So then we take the best force field that we find and we apply it to my whole data set. And in that way, we intrinsically tune our force field to generalize in some sense, because as I said, we want to expand, we want to take the small universe and the type. And then we do it over and over again until we convert to something. And by that, it will be increasing our parameter space. So the first test that I'm looking at is called the C4 test set. And C4 just means take the GDP data base and look at all molecules in there that have exactly four carbon atoms and whatever number of hydrogen atoms. Our target data are of the symmetries of the different energies, the separation of frequencies and also total rights. And the one important point is that we are not taking a QM level energy surface by the MM level energy surface. So that sort of ensures that the exact model is part of our solution space that is actually accessible to our algorithm. Very briefly, as I said before, we have a penalty factor pay that balances model accuracy against model complexity. And we see that if this penalty factor pay is not too high, so if it's point two or one and one would mean we have actually the actual information could be here and here. And we see that our outcome is not very sensitive to how we choose may this case is a bit larger actually than the number of then only then the number of types goes down. Okay, so how does the algorithm work in practice? On the top here, the black line indicates the value of the effective function, red dots are the number of holm types we are discovering through our consideration. The number of angle types in green and number of holm types in yellow. And these are the smart patterns we are finding. And on the right, you can see the actual target data as it's evolving. So here, the green lines are called in, the black lines are, I assume, dilation of frequencies and the red dots are also colloquial energies. And one of the nice things is that you see that it's so 20 generations, they take about one day and we are actually pretty much done with high information, meaning that we can get at least false views in less than half of the day. If you have a closer look at the smarts pattern here, you see that they sort of make sense to think about it. So, this is already ordered in a hierarchy and we find here that, you know, so it seems to find, the algorithm seems to find that there are single bonds that need to be these things from everything else. There are ring bonds here. And well, in our case, we just have three and four membered rings. So chemistry is limited. And then we find also a bit of other things. But if you look at this one type here, for example, it seems to manage to distinguish three membered rings from four membered rings, which is encoded here. However, so the C4 set has intrinsically metates with respect to chemistry. You can see that here, if we want to now expand that on a test set with the C5 set, same idea, we take all the molecules with exactly five carbon atoms and we see, okay, so a lot of these torsion profiles actually spot on, but there are a lot of problem profiles, for example, the guy over here is completely off. So the RMSE for torsion energies is roughly five kilojolt mole. Okay, so let me briefly talk about something of parameter space. Very quickly, now I'm doing, I'm representing chemistry a bit differently than before. Now we're not using smart vectors or smart mapping vectors or whatever. We're basically now looking into Bell space directly. So the space is the full space that is accessible to type assignments by just saying, okay, so there's something called a Z vector and a Z vector just encodes all the, in case of bonds, all the individual unique, chemically unique bonds in my molecule, a set of molecules. And for each of these bonds, in case of butane, four bonds, I can assign different bond types to these elements in the Z vector. And this allows me essentially to encode all different combinations of type assignments in a given molecule. So we can take that and put it into a posterior probability and say, okay, so my posterior probability is just Z. So my typing vector representation, my parameter value representation is conditioned on some data that I'm calling Y. And then you must come up with a smart way to sample this whole thing. And the idea here is that to separate out the sampling of the type space and the parameter value space by, I am just saying, okay, so you can do Gibbs sampling in the space of types, and we can do reversible jump point parallel and lung, and lung, you wanna do something in the space of parameter values. And I don't want to go into detail a lot about that, but I have some slides on it. And on that, and some of these interest, I'm really delighted to talk about it, getting a lot of static to come up with these folks. And this is how you play. So yeah, so this is now sort of a movie of the progression of this sampling algorithm. And you see that, for example here, the number of angle types, you see that when we start from scratch, we are slowly exploring the number of angle types space. So we are converging to, in this case, two angle types and also two one types for that small test system we are looking at. And that pretty much makes sense, because that just included butane and psychic butane. And there are some other nice things you can do in the context of sampling the posterior, which is we can play around with our error model. And what that means is, as you know, in a likelihood, we are assuming that there are some error contains in our model and in experimental data, we can play with that a bit. And if we assume large deviations, or if we assume a large error model, assuming that our error is in the last error, then we actually get, I don't know if you can see that, but then we find falsely ones that have just one angle type and two types. But one angle type, and it means if your reference data just has a large error, there's no reason that you should have a more complex model. However, if you're assuming that your reference data has a small error, meaning that you're assuming that, okay, I have pretty good knowledge about or a pretty good guess that my reference data is pretty precise, then you will find that you're assuming models that have two angle types, which proves that you can use this concept actually to tune the complexity against the accuracy of your reference data. To summarize this, we can insert false fields from scratch using a deterministic approach that is based on gradient scores. And we can also use sampling and parameter space and parameter type space. And for the near future, I'm looking into including more data, getting a better idea of using test sets. And in the distant future, I want to apply these things, non-modern interactions, different functional forms, and especially class two-false fields. I want to thank a lot of people. So basically, the Gilson Lab where I've done the first part of my post-doc, many people from the open-false field consortium that are also involved in other in the crystal structure projects I'm working on, and also people from my current lab in Frankfurt, and also the UFG for funding. Thank you.