 So my name is Jessica Mott and I am a graduate student in David Mobley's lab. Today I am going to be talking about training data set selection. So the training data set selection is for the open force field version 1.2 small molecule force field. Also known as parsley. So prior to 1.2 we were using generation 1 training data set. And today I will be focusing on our newly designed generation 2 training data set. So how does the training data set fall into the general workflow for open force field? So our training data is quantum mechanical torsion drives, optimizations and Hessians. The data sets are created and located in QC archive. So we use this training data in our fitting procedure. We use force balance. So here's the general workflow of force balance and where the training data comes into play is for the reference have an issue of calculations. So using our training data we can generate newly fit force fields such as Smirnoff version 1.2. So today I'm going to be focusing on how we redesigned our training data. The aim of the new data set is to improve the generalizability of our force field. So we wanted to look at how can we curate our training data set to create a force field that is able to model a larger range of chemistries. We had three major aims when creating the generation 2 training data set. The first aim was that all of our parameters were used in the force field or had any coverage. Second, we wanted to make sure all of our parameters were used a reasonable amount of times. We considered five times reasonable coverage and then we also wanted to make sure that our parameters were used in diverse chemical environments. So our approach to this was we developed this general workflow which I will be going into more detail during this presentation. So to start off, Stage 1 are starting data sets. So the main improvement for Gen 2 compared to Gen 1 is that we expanded the amount of data that we were selecting molecules from. So in Gen 1, we used only the Roche set and the coverage set. But in Gen 2, we expanded to three additional data sets. So the e-molecules discrepancy data set. All these data sets, their generation is explained on GitHub, but I will mention later where you can find everything. So the e-molecules discrepancy data set was created by Jordan Ehrman in our lab and it was created based off the differences of RMSD of different molecules using different force fields. And the molecules where Smirnov performed the most different were placed into this data set. We also added the Pfizer discrepancy set. Which selected molecules that performed differently for QM versus OPLS3. And then you've also added the Bayer data set, which consists of a patented collection of pharmaceutically relevant molecules. So next we parse the inputs and determine parameter usage. So the way we did this was we took each molecule from the data set and we labeled each molecule in the data set with the bond angle and torsion parameter that each molecule use. And then we expanded the Tautomeric and Isomeric state. So we had each molecule that utilized each parameter and each data set separated from there. We were able to do fingerprinting and clustering, which is how we generated more diverse data sets. So when we found the fingerprints, Hesu, who also worked on this, did excellent experiments to determine what type of fingerprint method we wanted to use for our analysis. So what Hesu did was took a limited set of molecules and she explored tree fingerprints, path fingerprints, lingo fingerprints and max key fingerprints. And then she found the distance matrix for this limited set of molecules. And then from there, we were able to find that the tree and path fingerprints were very sensitive to stereochemistry. And overall we found that the max key fingerprint was the best fingerprint to use for our case because it focused most greatly on functional group matches. And this supported our goal to increase the chemical diversity of our training data set. From there, with the fingerprints, the max key fingerprints, we were able to use DB scan to cluster all of the different fingerprints. After that, we performed molecule selection. So from each of the various clusters, we selected randomly a molecule from each cluster under the assumption that the molecule was representative of a unique chemistry within that cluster. We then generated QM data in QC archive. So next I'm going to talk a little bit about the resulting coverage data and exploring the diversity of the different data sets. So this is results from Hesu who generated the torsion data sets. So in her first round of selections, she selected one torsion per parameter randomly. She was able to get pretty good coverage. And then in the second round, she selected one torsion per cluster. So this means that each parameter could have more than one torsion and overall, we did have a lot of new calculations submitted to QC archive for our gen two training data set. So to talk more about the optimization data set, which is what I primarily focused on working on is. So I found that the coverage is improved greatly for the generation to optimization data set. So in these two bar plots along the X axis, we have the different angle and the bond parameters, and then we have number of molecules on the Y axis and the blue is generation one, which is overlapped by gen two training data set and orange. So, as you can see, there is improved coverage and gen two pretty much across the board for gen two compared to gen one so we have many more molecules utilizing the parameters and there's also an increased coverage for bonded parameters. So gen 44 and B 41 are now covered by gen two data set and we're not covered in gen one. So, looking at the distribution of heavy atoms for the generation one and generation two data sets, we can see that gen two optimization data set covers a much larger distribution of molecule sizes, which is really great for improving chemical diversity. So in gen one data set. The heavy atoms was mostly between 10 and 15 button gen two, you're able to get coverage of molecules almost up to 40 heavy atoms. This allows for a lot more intermixing of different chemistries for fitting. And then I performed another test that looks at functional groups so what I found was gen two training data set covers a broader range of functional groups, which is also very great. So, I performed these tests was a 200 form is pharmaceutically relevant functional groups and performed analysis to see which chemistry was being utilized in gen one gen two data set. So these parts are a little bit difficult to see just due to the size, but the nodes in the data set represent the different functional groups, and the edges represent the uses of the actual functional groups. So it was really great to see was gen two data set covered an additional 22 functional groups in this test compared to gen one gen two had 120 nodes and gen one had 98 nodes and then really interestingly gen two data set had many, many more edges. So this shows that there's a lot more intermixing of these different functional groups in the data set there was 19,000 edges and gen two and gen one had 1000 edges. So these these training data sets are available to everyone. We have all the torsion drive data sets and five optimization data sets in QC archive these in in the table we have the names of all the torsion data sets and the optimist optimization data sets. We have the generation scripts and our details on QC QC data set submission under the open force field repository. So, some future directions for training data set selection is we could consider increasing the number of data sets selected from the switch early increase diversity. So look into benchmarking of the data sets to determine if we were achieving our goal of generalizability. Also it'd be really interesting to look at automated approaches to force field fitting and benchmarking so we could perform more efficient experiments on training training data set selection to inform us better on how we can improve the method. And then lastly, it would be interesting to curate additional data sets with diverse chemistries to try to improve the coverage, even more. So, in conclusion, I've gone over our training data set selection procedure for gen two. And then I also was able to go over the analysis of the coverage and the chemical diversity of the new training data set. So, with that, I want to thank his sue for working with me on this project she did a lot of really great research. And I also want to thank leaping my PI David Mobley and Chris Bailey for all of their help and scientific input on this project. And I want to thank Jeff Wagner, Daniel Smith and Ben Pritchard for really helping with getting these data sets completed in QC archive. Thank you. Thank you. Are there any questions for Jessica. I have a question and a comment. What, what is the significance of naming a data set a discrepancy data set. Well, that's my fault. I can answer that one. So, we had a separate project done by an undergrad in the group named Jordan airman who was he was taking all of the molecules and energy minimizing it with a variety of different force fields. And then looking for molecules where the geometries were dramatically different across force fields. So the idea being that like, those are informative molecules. And also happens that some of that chemistry is like really weird and or sort of unusual or pretty has a lot of diverse functional groups in it. So we thought, you know, well molecules that have discrepancies across force fields is probably a good place to be looking for training our force field, but then also because it uses a bunch of rare parameters. It's also really good for improving coverage. So we pulled in some of those compounds and so it happens that, you know, we're using some of that here because probably because it uses a bunch of rare parameters and probably because it's informative. Thank you. So clearly, you know, as you try and get these obscure molecules. And you generalize more than that will degrade the force field overall. I mean that's expected. Well that's the, that's the question. Maybe Jessica should go first on that one that I can comment to reading Jessica. So the question is, if we continue to increase the diversity of the training data set. It's obviously increasing the diversity obviously is the right thing to do to develop the generalized force with I have no problems with that. But of course, as we as we move out and try and bring in more obscure molecules into into the domain, then I would expect the force field will of course be general, but it will generally degrade. And I am just just wondering what your thoughts were on this. And that's obvious. I believe that's expected. Yeah, that's a good question. I, I, I think that it would be interesting. I'm not, I'm not quite sure if it would necessarily degrade. It was, it's kind of the question I see it as, if we add more diverse chemistries. Do we know if this will will affect all of the chemical environment of the molecule. Maybe the results will not be changed very much. I don't know, David, do you have thoughts on this too. Yeah, I think it may partly depends on what one wants out of a general force field. And like effects issues like the domain of applicability so like I suppose we were trying to fit that, you know, build a really good force field for alkanes. Then you know the more molecules that we include that aren't alkanes the more we have to worry about degrading our treatment of alkanes right. On the other hand, if we're trying to build a force field that sort of generally covers everything. Like where you could argue that well the more the better representation we have of everything in our fitting data. The more general and accurate we might expect it to be in general, though for any particular class of molecules maybe it may you could find classes where making a more general will degrade it. So, so I think what you asked is a really profound question and what one wants may vary depending on like target application. Like target application and fields, and how generally you want your force field to be. So like your answer for what you might want might be different from, you know, from his answer for what you want. So that might mean that we eventually want to end up designing different training sets for the applications we have in mind. I'm not sure we know, you know, the answer to that really in general yet. I guess if we were to start from scratch. You would, you would carry along, you would you would also consider changes in the potential function as functions or potential function. So the various terms and so on and then then it's probably possible to sort of get a really good force field but at the moment you're kind of the limited to a potential function which I assume worked on a small class of molecules really well. And now we are sticking to the same potential function and extending the applicability. So we are kind of constrained on one issue, and that may wait and see what the results are really that may you think that if you bring in more obscure molecules it's likely to degrade. The other factor that works. What comes into play though is, you know, let's say we look at like, you know, a parameter that that's like a generic carbon with three connections or something. We're not extending where that parameter is used. That is to say, our force field applies it only in places that happen to use the generic carbon with three connections parameter. So it's never going to apply to a carbon with four connections. So we're not making the four as we bring in new molecules we're not making the force field more general in terms of what it covers, like if we bring in something really weird that our current parameters don't are typing does not cover it all. Our typing is still not going to cover it. So we could say what we're doing is we're just, you know, we have these groups of, you know, molecules that are typing can cover, and we would like to build out. So we cover all possible environments, maybe, maybe that's too broad but a broad representation of all the environments in which that parameter could occur. If it describes them well we're fine having them all lumped together. But then if we start finding that you know some of these molecules are treated well and some of them are treated badly then maybe we went to split it off into, you know, two types of carbons with three connections or increase the sophistication with which we treat them. And that's a little bit aside from the functional form issue you raised but just within a specific functional form we can ask how well are we treating them. And that gets back to some of what Vicki was talking about where she looks at whether specific parameters are overrepresented in compounds with higher errors. Thank you.