 Okay, hi everyone. My name is Jeff Wagner. I am the head software scientist with open force field. And in this talk, I'm going to give a few updates on what's been happening with our core infrastructure since the last annual meeting. So I think the biggest update that we have is that we have two new software scientists, David Dotson and Matt Thompson. So with two additional folks that makes three of us total. David is a former maintainer of MD analysis and a major developer for that project. He will be working on some parallelization and performance improvements for open FF evaluator, and that was formerly known as property estimator developed by Simon Boothroyd. Matt Thompson comes to us from the most def project where they have also been doing some chemical perception based typing. And he worked quite a bit on their interoperable system model. Unsurprisingly, he will be working also on our interoperable system model. Both of them have been have been marvellously productive. They both started work with us just about exactly one month ago, right around March 20, if I believe. And so far they have been they've been taking the initiative on a lot of efforts. The initial sort of onboarding tasks are largely centered around the open force field toolkit, but they've already expanded into helping us work on some QC submission and QC molecule to M and molecule quality checks, as well as a bit of work on our benchmarking dashboard. So I'm going to start this talk by chatting mostly about the open force field toolkit and then I'll be after that discussing other parts of our infrastructure that have been under development. So, since we talked last in 2019. We've made a lot of changes to the open force field toolkit. On the left here you can see what our main workflow looked like. At the time of the zero to zero release, there you load a molecule, grab its positions make a force field, turn the molecule into a topology and then parameterize a system with it. Now things are completely different. Oh no wait they're not. The only thing that's really changed in terms of the user experience for a major use case is that we've released a new force field. We expect to be highly stable. And I want to sort of reiterate our development values here, which is that this after workflow and I guess equivalently the before workflow, supporting this use case is kind of paramount. We want this to be buried deep inside of people's workflows where it's so reliable that they forget that it's even there. And so when we look at this code, and how we want it to evolve over the next few years we really don't want it to. This shouldn't involve people going in and making manual changes every time we have a release. And if we do change the behavior of this code it should be unambiguously better if we change how molecules are processed. Everyone should agree that this is a better way to process molecules. And the only thing that you should need to jump into this code to ever update is the force field name. So we anticipate probably 75% of our user traffic coming just through this workflow and sure we have a whole host of other functions and this wonderful API for power users who want to get more into depth about what's going on. But really, most of the functionality of our toolkit is hidden underneath these couple lines of code here. In addition to maintaining this functionality. We're looking to continue improving consistency between RD kit and open eye back ends. You know our science relies on both of these acting identically whereas in reality they're different code bases. But for the most part we've made positive progress in standardizing the behavior of these two. Also we recognize that we're putting out bug fixes kind of slowly. We're often offering people workarounds instead of just fixes to the code that they're trying to run. So talking about software scientists on the team right now we're discussing different workflows for how we manage future requests bug fixes and releases to make these come out faster. So, talking about the major technical milestone since we had our last annual meeting. We've had a couple of major open force field toolkit releases 050 added gbs a 051 added a utility notebook for checking whether parsley could cover molecules of interest. We've also added support for library charges and I'll talk a little bit about those in a minute. And 070 which is in the works now and expected fairly soon maybe even by the time the this video shown will include a whole host of new features that I'll talk about. We've also made a new content package called open force fields. This is the place to look for all of our major force field releases. So this is where parsley lives this is where the parsley updates live this is where sage will live. And in here we've had a couple major releases, we've had one zero zero and that's the original parsley. We've had one one zero and this is parsley update. It has a few added angles proppers and improper's that were added in refit to the training data. And then just recently we made to bug fix releases indicated by changing this last digit of the version name version number. And those are 111 and one zero one and these are simply the parameters same as before, but with the addition of mono atomic ion library charges. We always had the Van der Waals parameters for ionic lithium, but we didn't actually have a way to specify that the charge to be plus one. So this is a small cosmetic improvement. We've also added to new spherinoff 99 frost versions. These are not expected to be particularly performant but it's important that we version these in order to record the provenance of our force field. So let me talk a little bit more about gbsa support. This is the first major release since the last in person meeting. And currently we support three popular gbsa models the HCT OBC one and OBC two models and those are acronyms for a bunch of author names. These correspond to amber keywords for the IGB keyword values. And one note about these is that these parameters work in open mm but permit does not support transferring them to other packages so this is an open mm only implementation right now. And we've been we've been kind of keeping our ear to the ground and it sounds like perhaps the open mm and permit implementations are off by a small factor. And so when that gets resolved, we will update our model to follow what their models do. The way that gbsa is specified is that there's a number of keywords in the header of the gbsa section that specifies what model you're using and this sets a number of global constants. You can choose to have the surface area energy, either the a C or none. So you can have no contribution from surface area calculations. And we follow the normal Smirnov hierarchy rules for assigning single parameter single parameter gbsa values. The next major release zero six zero added library charge support. These are basically a way to do charge lookup from a table for entire marching matching smarts patterns. It got a little bit interesting because technically the Smirnov hierarchy rules allow you to do some strange things. And so basically the bottom line on how this works is that for a molecule to have library charged signs library charges assigned using library charges. All of its items must be covered by at least one library charge. And that is you can cover an entire molecule kind of piecemeal and we had intended for this to be used on something like a bio polymer, where for each amino acid use to specify what the charges are. But allowing this capability and theory means that you could have a small molecule which you parameterize using library charges. It's a very strange use case we recommend that you don't do it but in theory you could. Here are two examples of what library charges look like. So one is tip three p water. So here we don't need any header arguments in defining how library charges works. And then for each for each pattern that we want to perceive. We specify a single library charge entry. We give it a name. We give it a smirk. So in this case this is hydrogen oxygen hydrogen. And then we say for each tagged atom. So we have tagged atoms one, two and three. And here we see the hydrogen should both have charges of 0.417 and the oxygen should have a charge of negative point 834. This can be applied to larger molecules as well. Here is for example, library charges for an alanine. And I think that this corresponds to ff 14 SP parameters but it, it's a lot of numbers so there may be a mistake in there. So this marks for an entire amino acid. And you can see that here we have a tremendously large number of tagged atoms and some of them are not tagged. This is just a requirement for the match to happen in the first place. But for each of these tagged atoms, we count up to 10 of them. We have a corresponding library charge. And it's important to note that these need to add up to zero. We don't allow kind of ending up with a net charge from library charges. So this is exciting because it unlocks us taking our first steps into biopolymer or general polymer support. And we'll hear later on in the meeting about efforts to port in an amber force field into Smirnoff format. We are also soon going to be supporting Viberg bond order calculation and torsion interpolation based on that. Now this is kind of a subtle topic and we'll hear about this in highest talk. The gist of it is that we can do electronic population analysis of semi empirical QM to get a floating point value for the order of a bond. And high Eastern found that the order of some of these bonds, especially in conjugated or partially conjugated systems correlates very well with the barrier height of a torsion. And so our thought is that by doing the semi empirical quantum calculation during parameterization for a molecule just like we do for partial charges. We can more finely describe the torsions in a molecule. And this version of interest in this molecule here. The central torsion would apply to in our current force field, one of these three smarts. And these smarts differ only in the order of the bond in the middle and then there's a little bit to go along with that on the sides. But more or less we say, if this potentially conjugated bond has a single bond, here's a parameter with a k value around, you know, one. If it's aromatic, here's a k value around 2.5. And if it's double here's a k value around four and a half. What we can do is provide support for describing bond orders in a different fashion, where we say we can condense these three parameters down into one. We say here the same atom types, but with an any bond in between them. So this tilde will match any order of bond in the graph molecule. So this is a periodicity in phase, but we say k one for this, if it's bond order one is this value, if it's bond order two is this you know we're just taking these from up above. And we can simply linearly interpolate and get a pretty good guess for a lot of these different situations for different tautomers of the same molecule. So this is a really exciting piece of infrastructure and we're hoping to have us out in the next few weeks, so that we can begin fitting force fields with it. So we're releasing more customizable charge methods and custom ACC's and I'll say exactly what those are in second and the ability to read and write STS with charges. So we will be adding support, instead of just for that toolkit am one BCC tag that you see in many of our current force fields. We want to allow a larger degree of customization for how charges are generated. So here is an exam. Here is what the new spec will support for doing am one BCC but with your own BCC's in the header. We're specifying that we want to use a charge increment model with one conformer of our molecule of interest and am one charges. And then after those am one molecule charges are generated. So this is just am one BCC before the BCC. Here we can go in and define a collection of smart space BCC's each tagging some number of atoms and providing the same number of charge increments. And we can go back and revisit some of the choices made and see if a larger smaller set of BCC's of different values for BCC's would provide improved performance. The important note here is that it's easy to think of this by comparison to BCC's but really what we're talking about is ACC's. And this is for an arbitrary number of atoms so a bond implies that you only have two items that you're transferring charge between. But in a lot of our smart space parameterization, we can tag an arbitrary number of atoms and simply provide a corresponding number of charge increments for how to adjust them. Now, as I said before, along with this more flexible way to calculate charges we want a better way to serialize molecules with charges that is save them to disk. And one problem with SDF is that it doesn't contain a single supported place for storing charges and for that reason you haven't been able to store charges and SDF previously. So here we've talked with OpenEye and RD kit, and we've converged on a spec for partial charges that we use in the MM way. And it's a field like this so here's a molecule at cyanide. It's got three atoms, C and H. And then over here, this is the new way that we will be supporting storing partial charges inside of an SDF. So from now on instead of sort of having to ownerlessly pass around mole twos if we want to complete information for charges in a file, we'll be able to do that with SDF starting at the 070 release. So one final note on the open force field toolkit is that we're starting to roll out developer documentation. And this is a living document so it's it's not perfect right now and it's incomplete there's a few tags for where we want to add new sections. But this is our start for trying to get genuine outside contributors, working on the open force field toolkit. So this is a long term investment that I think is going to pay off really really well if we manage it correctly, because sitting a contributor community will bring us new users by simply giving people more exposure to the toolkit. It will spread the best practices that we use in our ecosystem. And I think that will probably raise the software quality or at least the level of maintainability for a lot of computational chemistry. It will help us enable new workflows and studies both by informing us of what they are and getting to see how people implement them. It'll increase our engagement with other efforts when people understand how we do our development and are better able to navigate our API and classes. They'll be able to more tightly integrate our code into theirs. I'm more willing to suggest or even implement valuable features, and also having more eyes and more hands is going to is going to test our code in new ways, and ultimately I think this is going to make it more robust. If you want to go look at that I'll share these slides later on the documentation is a little bit separately version than our releases you can of course get the documentation for specific release. If you look at the latest version of our documentation this developers guide is already online, you do not need to wait for the zero seven release. So, moving a little bit away from the open force field toolkit for the rest of this talk. I want to talk about our general strategy for engagement with the community. And we've actually had a couple really nice successes over the past year. So as I said we drove agreement between our D kit and open eye for how we want to handle a partial charges in SDF files. We successfully engaged and are continuing to engage with the amber tools community. We found that actually sqm was able to calculate fiber bond orders it was just not very well documented and working with them we made this part of the default build and improve the documentation of it in the amber tools manual. And that was that was sort of a bit of work that the amber tools developers put in for our benefit. And so actually right now we have a team member Jaime, who's packaging amber tools as a condo package. He's currently testing the 2020 release of amber tools. And this is looking very promising. So the fact that we're able to package amber tools as a condo package makes it a reliable dependency for a toolkit and it means that we can rely heavily on functionality. We've been working with open Kim recently to exchange best practices so open Kim is a center that tests inter atomic potentials to see how well they can reproduce properties of solids and things so it's a little bit more material science oriented. But they have been wanting to expand into bonded force fields and we have have an interest in reproducibility and benchmarking and that's really what they do well. So we've had a very successful set of meetings with them and we're building towards a prototype of open force field in their benchmarking engine. We've also been engaging with multi and bio exiles interoperability efforts and just a few weeks ago, our scientific coordinator Carmen spun this off into a small organization called m moss. And that's where we're continuing some of the discussions about software interoperability. And finally, because I know the entry point for a lot of practitioners of molecular dynamics is charm gooey. We will be working with charm gooey to implement our parameterization engine inside of their workflow. So when people go to get lipids and set up systems. We will hopefully be able to provide an option to simply parameterize open force field ligands and that as well. We've also been contributing to the growth of best practices in our field so we've been engaging with multi about the way that they leave their best practices workshops, and the way that they organize their conduct cookie cutter. So this is a standard Python framework for building computational chemistry packages. And we've been engaging in discussions about how these practices should evolve. There have been instances where we found bugs or we found parts of the environment that have changed and we've been contributing to the upkeep and the design of this cookie cutter. And actually, we, it, we've been starting this annual tradition of an open force field best practices workshop led by multi staff so here's here's the 2019 version this was a couple days after we met last year up at UC Irvine. But we had another best practices workshop just last week, led again by Daniel Smith. This one was of course remote, but we made a lot of progress on it. What's nice about standardizing on this cookie cutter is that it makes the maintenance burden for various packages much much lower so this isn't a primarily academic project and we have developers coming and going as they graduate and as they as they find different stages in their career. By using the shared ecosystem, it becomes very easy to pass off orphaned software to be the central software team to the software scientists and we can keep it running. In fact, just two weeks ago there was a change in how many condo distributes its packages, and we were able to simply copy and paste a fix somewhere more than a dozen times to bring the entire ecosystem back up to spec. This was wonderful, and that would have taken us hours and hours and days and days if we had all used different build infrastructures. Another activity that the core infrastructure team has been working on is marshaling compute for our quantum chemistry data sets. And so on a normal day when we don't have any special needs, we'll be running about 1000 cores that 600 from Memorial Sloan Kettering with the Kudera lab and about 400 on this distributed computing platform called the Pacific Research Platform. In order to get the parsley 120 data sets ready for the fitting. The last few weeks we've been running contip constrictively more CPU cores. Without contributions from multi we've been landing somewhere in the 4000 to 5000 core range. And this is a graph of compute used by open force field over the past two weeks on the specific research platform. Now these periods where we're using about 2000 continuous cores. This is, this is a value of about $1000 per day of compute, which we're receiving for free by virtue of being an academic project. So this has been absolutely fantastic and the multi software has scaled very well to meet this need. So the 120 release is going to include I think a ludicrous amount of QM data and in the future, we can continue to increase the rate at which we add QC data to our fitting. So talking about a couple plan developments before, before we finish. Matt Thompson, the new software scientist is going to be working on our dedicated open FF system object. So many people who have worked with the open force field toolkit. Might find it a little bit unwieldy that our native output format is the open MM system. And we are looking to move away from that and build our own. We want this to meet a number of needs. And not just for ourselves but we know that we'll get better buying and that this will become a more functional object. If we take input and meet the needs of different members of our community. So we're currently in talks with various package maintainers and, and people who are doing force field research about what exactly they would need to adopt this system format. And importantly, in addition to the information contained in an open MM system, this force, the system format will contain a mapping of system parameters to force field parameters. And this is a very important concept because an open MM and the open MM system format, there's no record of which system parameters, for example, a particular bond and in a molecule map to which force field parameters. And when you're optimizing, you want to optimize the force field parameters not the individual system parameters because they're very redundant. There's many instances of that bond in the system and you only want to change the number once. So that's one important feature we'll have. We're also looking for it to be convertible to a number of popular simulation formats. And so initially this might be simply by intramole or parmed. In the future we may roll our own converters if we start doing anything very special. And one great thing here is that when we talk about doing force field fitting what we're really talking about is numerical optimization. And this is a well studied field. So, if we're able to format our optimization questions in a more traditional way. If we're able to abstract away from specifically molecular structures and kind of put this into a machine learnable format, we will get access to a number of very widely supported community tools for numerical optimization. So, one topic that we're looking into is how exactly we want to provide an interface to existing machine learning tool kits. Another exciting development that we see coming out in the near future is one second is support for geometry independent charges. And so this is an interesting topic where we acknowledge that we want force fields and research to be reproducible, but that one crucial step in force field parameterization is charge calculation. And those charges are dependent on the conformer of the molecule that you use for your semi empirical calculation. Now open eye and our dekat and even just different versions of those are going to generate different conformers. And thus the charges that you get out are likely to be different and right now we don't see them being to tremendously significantly different but we know that this adds some amount of noise to our force field fitting. So if we were able to assign charges accurately but based solely on the graph representation of the molecule. Then we would have a major advantage in terms of consistency and reproducibility. And that's what you want Shane Wang of the codera lab is working on right now. He's working on this package called gimlet to use neural networks predict partial charges of atoms and molecules. Here's here's just a quick plot from one of his first papers, showing the am one BCC calculated charge versus the neural network predicted charge for different atoms and a data set molecules. This performance looks quite nice. So we're hoping that by adopting this as a supported charge method in the toolkit, we can speed up molecule parameterization. And make it more robust and less geometry dependent. Another major development that we see coming out in the future is being led by Josh Horton. He's an open force field postdoc working in the UK. And this is what's called our bespoke fitting workflow. So the purpose of the bespoke fitting workflow is to automatically generate parameters for a molecule of interest by running specific QC calculations on that. This will rely on running a personal instance of QC archive. And so this can be seated with the full, the full data in QC archive, but this can be run more or less offline. So if you are interested in not having people know which molecules are working on this can be run entirely in the house on. And it looks roughly like this where you're going to come in with an input molecule. This input molecule will be prepared by the open force field toolkit. And so some set of partial charges and an initial gas force field containing molecules specific or containing parameters specifically tailored for this molecule will be will be made and held on to for a minute. And what will happen is that molecule of interest and this could be a smiles this could be an SDF, what have you will be passed in some essential form to this QC submit package. And QC submit will decide. Oh, we should run a torsion drive of these torsions we should run some other calculations maybe optimizations to best pin down the force field for this molecule. And this will be connected to a local QC fractal server. So this could be something that you're running on the same machine as this workflow you could have your own cluster that's just hosting compute for this. But it will submit different confirmations of your molecule of interest. The results will come back it will do a good amount of quality checking here. And then mm compatible confirmations and energies will be passed a force balance and force balance will optimize as force field that contains parameters specifically tailored for this input molecule. And you'll come out with sort of the best we can do for force field parameters for your specific molecule. The great thing about this is that actually this QC submit package that Josh is working on is extraordinarily powerful outside of this specific use case, we have always had a problem of interpreting return results from QC archive, simply because we've been in in quantum chemistry operations like proton migration bond order changes. Change the identity of the molecule and we need to flag that data as not suitable for fitting, because we're no longer fitting to energies of the molecule that we started with. But this has been a really hard point in our infrastructure and we've got different workarounds for this stored in different places, but generally we want a single place where all of this quality checking can happen. And this is fantastic because it's going to give us the tools to jump around between QM and mm land and we have a lot of different people in our ecosystem contributing to different parts of this. We're also looking at putting in an automated benchmarking sweep. And what we're seeing down here is approach type of a front end by high made Rodriguez Garo. We have a postdoc in Berlin and it's extraordinarily talented. But this will be a platform where we can run energy comparisons against a QM data for Smirnoff 99 frost parsley gas, maybe some other force fields that we that we find a way to implement. And we can provide an intuitive and flexible view for people to compare our force fields to previous generations of themselves and to others. And again, this is an area of huge synergy, Josh and time a and other folks have been working together the last few weeks on this. Now we're hoping to have a prototype out in the next few weeks, maybe in time for May 4. But we'll keep you updated about that because this is this is fairly challenging. And I think we're just running this on on some sort of mock data or like a preliminary data set from Victoria Lynn so I wouldn't, I wouldn't read too far into the numbers in this gift. So we have some more kind of exciting areas of upcoming infrastructure. We're looking at adding polymer support so right now smart space algorithms perform very poorly and large molecules, because these are sort of graph isomorphism operations that are extraordinarily expensive. So as you try to feed things like proteins into the open force field toolkit. So one major area of biopolymer polymer support that we need to add is improvements in the performance of those algorithms. Also, we want to be able to perceive whether something that looks like an amino acid but isn't covered by any of the libraries that we have currently and have a workflow to cap it and charge it and provide reasonable parameters that will be compatible with the other standard amino acids in the chain. So this is nice we have an ongoing effort right now with Dave Saruti down here to to start porting over some amber force fields I think we're starting with amber FF 14 SB. And support for polymers is will kind of be done hand in hand with porting of these force fields. Trevor Goki down here he's a postdoc with David Mobley. He's a grad student with David Mobley is working on adding virtual site support to the open force field toolkit so virtual sites are are these off center charges that can better describe the electron distribution around a molecule. And this is this is a fairly hairy area of infrastructure so I'm not sure if it will be out in the next few months. But we think that this may be a large area where we can increase the accuracy of force fields. We're going to be implementing support for four different types of virtual sites. And these will be able to be added either using Smirnoff based typing, or using the open force field toolkit API. And in at least the Smirnoff based typing adding a virtual site so here we have two virtual sites we have one that's a bond charge so you define two tagged atoms. And then you define a distance from the center of one of them. And, you know, take the vector along the axis of that bond, and then it will go that much further along this distance D. And it will put some center here. That's a Leonard Jones and a Coulomb interaction particle. And this is sort of our preliminary spec, but a virtual site you specify that it's type bond charge as opposed to any of the three other supported types of bond charge requires two tagged atoms. So you have Adam one, Adam to a chlorine in a carbon. So this might be a sigma whole model. Some distance D and that's the distance that we're looking at here. If this site is going to have a charge, you specify how much of that charge comes from Adam one how much of it comes from Adam two, and then optionally also Leonard Jones parameters. So like a sigma and an epsilon. And again, this is exciting because we hope that adding support for this will allow us to include it in future rounds of fitting especially non bonded fitting. And this can offer dramatic improvements over the traditional kind of like the fixed charge right on top of the Adam force fields. So with that, I'd like to sort of thank everyone who's been a big contributor to core infrastructure over the last year. It's a long list of folks that a lot of people I've already mentioned their work. But yeah, it's been a great honor to work with this team I don't think I've ever worked with people this talented. And really, I feel blessed to be working with these folks at the pace that they work and with the insight that they bring. So thank you all very much.