 Hi, my name is Owen Maiden. I am a graduate student in the Shirts Research Group and an open force field fellow. And today I'm going to talk to you a little bit about some of the future plans we have for looking at non-bonding interactions, particularly using Beijing approaches. So currently we've made a lot of good progress in our force field fitting as an effort as a whole. So if you look at the partially force field, it's done a lot of improvements, but this is mainly through targeted fitting of bonded interaction parameters, so bonds, angles, and torsions. One thing that we haven't really looked at a lot so far is the non-bonded interactions, so electrostatics and van der Waals interactions. So I'm going to focus today a little bit on the van der Waals, commonly used the Lennard-Jones model, but we hope to use this kind of analysis for electrostatics of the future. So the problem that we have with Lennard-Jones fitting is that, one, there's limited data, so instead of training against quantum calculations, we have to train against macroscopic observables, so we're really kind of limited to what experiments people have actually done. And also there's kind of a black box nature to the problem. So for example, if you have a van der Waals model that is parametrized with some set of parameters, so say for example Lennard-Jones is the most common, then you can get microscopic behavior from there, but then there's also other factors that go into that, like electrostatics and bonded interactions, and then from there you can run equilibrium simulations, and then you get your macroscopic properties, which is the target that you're fitting in. So there's a lot of intermediate steps here, and these intermediate steps are not really all that visible to us. We don't have a really easy understanding of if I change x parameter, then I'll get x change in macroscopic property. So in addition to this, there's also a lot of decisions that we need to make about parameters and functional forms, so there are discrete choices of functionals or models and continuous choices of parameters within those models. So for example, a discrete choice is something like the number of atom types you have in a model. Mike Gilson talked at length about this yesterday. Another thing is Lennard-Jones combining rules. So when you have those atom types, how do you actually parametrize those interactions between unlike types? And then you also have continuous choices of parameters. So for example, in your Lennard-Jones functional, you have an epsilon and a sigma, something like that. And so there's a lot of factors going on here, and what we're interested in doing is making decisions based on statistical evidence for individual models and sets of parameters. So how do we plan on doing that? And the technique that we're interested in using is called Bayesian inference. So Bayesian inference provides us a natural framework for making decisions about models. If you're unfamiliar with Bayesian inference, we like to look at posterior distributions. Costerior distributions are a combination of a likelihood distribution and a prior distribution. And so what do these really mean? Well, a prior distribution is kind of your reasonable knowledge about the model that you have before you perform your specific experiments. So you might not have all that much prior knowledge. If this isn't a model you've studied before, so you might put a pretty flat prior on what you're looking at, or if this is something that you've studied before, you might have a reasonable idea of what's going on already, and you have a more pith prior. And then from there, there's a likelihood distribution, which is talking about how well does your model and set of parameters reproduce a specific data set. So you have some new data set in terms of physical properties, and then you evaluate your model based on its performance relative to that, and then you'll get an idea of what your likelihood looks like. When you put these together, then these become the posterior distribution that combines kind of that specific likelihood with the more general prior knowledge that you already had. And so how are we going to use this to decide between models? Well, if we look at the posterior distributions and integrate under the curve, then we'll get a normalizing constant. And the normalizing constant of a posterior distribution can be thought of as model evidence. So if you take ratios of those normalizing constants ratios of that model evidence, you can interpret that as odds in favor of a model. So for example, if you have one model that has three times the normalizing constant in another model, you can interpret that as three to one odds in favor of that model. And so this is obviously a bit of a simplistic picture in a one-dimensional case, but you can imagine if you have models that are more uncertain, that have lower values for their priors, then you're going to get those models are going to be penalized. So how do we apply this specifically to the problems that we're looking at? This is going to make allow us to make data-driven choices between models. So again, if we give, for an example, if we go from a model with three atom types to a model with four atom types, similar to what Mike Gilson was looking at earlier, and we see that there is the base factor favors the model with three atom types, then we can say, well, it's not really worth it for us to add that extra atom type. It's probably better off just to stick with what we had. We're not getting a lot of extra mileage out of that extra type. And so when we're computing these, we need a way of computing these posterior distributions. And because these are like these black box problems, we can't do this analytically. We have to use some sort of Monte Carlo sampling in order to get these values out. So that process looks something like this. We'll start somewhere, propose a new set of parameter values, and then you can evaluate the posterior distribution for the set and the previous set. Then from there, you'll accept or eject the move that you make. So in this case, you accept your move, and you treat that new set of parameters as a draw from the posterior distribution. And then from there, if we do this enough times, we're able to construct posterior probability distributions. And I want to note that I have done this. I've shown this example for parameters, but it's also possible for us to do this over model space using a technique called reversible jump Monte Carlo. We can make moves between not only continuous parameter spaces, but also discrete model spaces. That's the technique that we've been using to look at the differences between normalizing constants. And so the way that we interpret that is if you're sampling both models well and you sample one 10 times more than the other model, then that model is probably 10 times better. This isn't the only way to compute the base factors, but it's what we've been using currently. You can imagine you're probably thinking about this and saying, well, if you have to run an equilibrium simulation to evaluate your model every time you do a Monte Carlo step, this isn't really going to be feasible. And I agree. And so that's why our plan is to use surrogate models to accelerate the sampling of the posterior. As I mentioned earlier, you have these intermediate steps where you get the microscopic behavior and then do a simulation. And then you get your macroscopic properties with a surrogate model. There's obviously going to be a little uncertainty out of here. But then you can go directly from your model and parameters to the macroscopic properties. So essentially, this would be an analytical function that you can put in your parameters and you can get out the response of your observable. So in order to test this strategy, we looked at a simple case study. So this is a two-center Lender-Jones plus quadrupole model. You can use this to parameterize diatomic and diatomic-like molecules. So the model has two Lender-Jones sites that have an epsilon and a sigma parameter. It also has a variable bond length. It separates those two Lender-Jones sites. And a quadrupole parameter that controls the strength of the quadrupole interaction. And because we have these different parameters, we can split this into different levels of complexity. So for example, you can say, well, I want a quadrupole parameter or I want no quadrupole interaction, have it set to zero. Or you can say, I'm going to fix my bond length, not fix my bond length. And the nice thing about this model is that there are preexisting surrogate models for density, saturation, pressure, and surface tension already developed in the literature. So we don't have to do the legwork in this case of building a surrogate model ourselves. So previously, someone took advantage of these surrogate models and performed multi-criteria optimization over the parameter space. And they produced some parameter sets for a two-criteria case. So that's just liquid density and saturation pressure and a three-criteria case that also included surface tension. And what we noticed from their data was that when you go from the two-criteria case to the three-criteria case, the quadrupole is being driven to zero. And so this begs the question, is it worth it for us to include the quadrupole parameter in this model? And we can answer this using Bayesian inference. So we ran our Bayesian inference process doing reversible Jet Money Carlo over the model space and the parameter space. And so we got some model and parameter distributions out of that. So the model distribution looks a little bit like this. It's a discrete distribution between the AUA model, which is a model without a quadrupole, and the AUA plus Q model, which is a model with a quadrupole. And what you can see here is that there are about four to one odds in favor of the model without a quadrupole. And so what this is telling us is that for this specific molecule, a quadrupole is most likely not justified. And so this level of substantial evidence that I've written here is kind of a common heuristic for Bayes factors. Substantial evidence is something like a Bayes factor, three to 20, and then strong evidence above that is 20 or higher. So this is kind of, it's a little bit arbitrary, but it's kind of a commonly used heuristic in the field. And another thing that we get out of this is we get our parameter distributions. So this is really nice information as well because we get to see the shapes of our parameter distributions. We learn a lot about the correlation between these parameters and we can even get maximum a posteriori estimates of what these parameters that should be. And if there's some sort of multimodality, like there is a little bit in this case, we can also notice that too. So we get a lot of information about our parameter distributions. So looking at this for a whole bunch of different cases and molecules, we find that in general, the quadrupole interaction is not justified. For almost all of these, the model without a quadrupole is preferred. For the one case that the quadrupole, that the model with a quadrupole is preferred, that's for acetylene in a two criteria case. In a three criteria case, it goes back to, goes back to looking for a model without a quadrupole. So this is kind of the question that we've answered here is for this case, for these molecules and these properties that we're looking at, it's probably not worth it for us to include the quadrupole. So this is a relatively simple example. And the next step is, how do we extend this past these simple model systems? So obviously in open force field, we're building biomolecular force fields. So these models are gonna be a lot more complex than the simple model I was looking at earlier. And so we need to build surrogate models. And what we really need to do when we're doing that is reduce the computational cost as much as possible. So we're looking at a multi-fidelity approach where we start with direct simulation, which is kind of our gold standard for observables, right? That's generally how we get them to serve. And then from there, we can use thermodynamic reweighting to get some information about the response in the local region of that direct simulation. And then from there, when we combine multiple direct simulation points and some thermodynamic reweighting, then on top of that, we can build an analytical model that should be quick to evaluate and do a reasonable job of approximating the response surface. And so we need simple analytical models that work well when we can only sample the true surface minimally. So obviously, because these evaluations, these equilibrium simulations are expensive, we need to do that as little as possible. And so the models that we're interested in looking at for this process or caption processes, they do well when you have minimal observations and they scale reasonably well with dimensionality. So our next challenge in learning how to build these models is how do we place our observations? What can we do to make sure that the observations that we're using are giving us the most bang for our buck? And so we need to be looking for kind of a general strategy and there's a possibility of using some gradient base methods to get us in the right regions and then start to do this. But in general, we're interested in also adapting surrogate models on the fly. So you can imagine the schema would look something like this. You start in some region, maybe far from equilibrium or maybe you used force balance to get close to close to what you think is an equilibrium value. You start sampling and eventually you move out of your model trust region which is based on the uncertainty of your model. Well, in that case, you're gonna trigger a new simulation which is you're gonna expand that trust region. And then from there, you'll continue to trigger more simulations and eventually you'll have a pretty good picture of what your high probability regions look like. So that's kind of the general strategy that we're interested in pursuing. This is something that's kind of early in the pipeline. So there's still a lot of work to be done here. But these are some of the ideas, these are some of the areas that we're gonna apply this to. So we're really interested in using this a lot for our Leonard Jones fitting but also for electrostatics in the future. We can examine and improve models for combination rules. So the commonly used Laurence Berthelot combination rules are probably a little bit simplistic. We might be able to get some extra mileage out of our Leonard Jones there, number of Smirks types. So this is kind of again, what Mike Gilson was looking at. When is it worth it to actually split off an atom type into more than one atom types? This will allow us to quantitatively decide when we're doing that. Another bigger change would be functional forms of Vanderbilt's interactions. So obviously Leonard Jones 12-6 is the most commonly used but we could also look at me potential or Buckingham potential to kind of look at that repulsive exponent. And then moving into electrostatics further down the line, we wanna look at bond charge correction schemes and make decisions between those. So with that, I'll thank you all for your attention and open it up to questions.