 Okay, so we'll go ahead and get started next up is Jeff who is our software scientist Who's coordinating a lot of the infrastructure, so please yeah? Okay, let me make sure my microphone. Yeah, that's good Okay, hi, my name is Jeff Wagner. I am the open force field software scientist along with part-time Daniel Smith And this is gonna be a little bit different from the talks We've heard earlier today because this is about software infrastructure. It's lots of a science topic and so For people in the room and people listening on zoom you can feel free to jump in at any time with questions or clarifications this talk is sort of in three parts because I just want to address a couple of the common questions that I Anticipate people having so my goal is to move across an understanding of what it is. We're building and what parts of it you need during this talk We're gonna start with What is this open force field toolkit that we're building and are getting ready to release? So in determining which parts of the open force field stack that you need We thought it's helpful to separate this out into use cases So if your use case is that you want to set up simulations using the force fields that we're going to be making The only component that you need is the open force field toolkit And this is what we're going to be working with in this afternoon's hand-on session hands-on session If you're interested in performing bespoke parameterization for a single molecule that you have on your computer Then you're going to need a component which does not yet exist But will be called the bespoke workflow and we're currently in the process I believe of getting someone who will be working on that If you want to kind of take this to the maximum scope and on your own computing resources You want to sort of in parallel do all of this force field optimization that we are doing You're going to need every component in the project and this is possible But this is something that you're going to want to contact us and be a little more involved in to understand how the components will work together I thought it would be good to define some terminology about the open force field toolkit So that we have a common language we can use to talk about what we're delivering So the smirnov specification is a language for defining parameterization strategies And in the big picture what we want this to include and what we're kind of building towards Is that this is a single place where we record Every decision that affects the system energy and so this isn't just Maybe bond parameters, but also includes non-bonded Force cutoffs and things like that. So if you have this file you can transfer between machines and get the same system energy for the same molecules Uh, so this is a general object model We're going to be talking about things that follow the smirnov spec as off xml files Because that's what they are right now open force field xml But we're not restricted to the xml format So a single instance Of an off xml file uses the smirnov spec language to describe a particular parameterization strategy So this will be specific smirks mapping to bond length and force constants and such Um, this can also include parameter libraries. So things like 3p water where we just want to recognize the molecule and stamp parameters on it Um, and it's modular. So right now. We're uh, mostly using it for smirnov 99 frost or Our initial yes smirnov 99 frost force field But as we decide to introduce new things like off-site charges We have keywords that can be used to specify where off-site charges should be put and what kind of parameters they should have The thing that we're going to be working on this afternoon is the open force field toolkit And that's a program that we're going to install on your computer And it's going to take his input an off xml file and a description of a molecular system and it will output systems ready for simulation To really jump right into it This is what it's going to look like inside of your scripts when you interface with the open force field toolkit And for people who want a command line interface, it will be easy to add the specific functionality that you want But this is how it's going to be built on the back end At the beginning a force field object will be instantiated by Loading a off xml file or some equivalent parameter containing object You'll load a description of your system and this is going to contain the coordinates of all the atoms and right now A lot of people are used to using pdb files But this does not contain all of the information that we need To apply parameters to your system so we can guess at the bond orders from a pdb file But the fields in there even if it has connect records have been sort of abused and don't mean what they necessarily should in all cases So in addition to knowing all of the molecules and atoms and where they are We need to know detailed information about each unique molecule In the system and we have different ways that you can specify these unique molecules So you can give us a mol 2 file and that contains the bond orders that we need Or you can even specify it from smiles Once we have these unique molecules defined Then we can go ahead and make a topology object And this is a toolkit independent topology object. So this is something that we've written this comes with the open force field toolkit You don't need open i And this contains a description of the molecular system as well as detailed information about each unique molecule in that system And finally at the end of the day you can i'm sorry You can output the resulting system In open mm format and a lot of people have given us feedback saying oh, I don't want open mm format And this is okay because there's this tool format that we can use at least as a first swing To generate amber and gromac systems That correspond Can we get a microphone? How about charges? Are you generating them in here or do they have to be in the pdb for? Good question. Um, there's two ways that we're handling charges So the initial way that we're handling charges is by specifying a semi empirical method Uh that will be used to generate charges and we can show you some more detail about this in the afternoon session What exact few words we'll accept and we're also open to feedback if you think that there's something that we don't have We can start incorporating that The second way that charges can be specified is in the future will Uh be able to support those as a sort of parameter library So sort of like a protein force field where it just pre specifies, um, which charges should be on which items for your molecule so We're working with sort of two different versions of the open force field toolkit There's the previous version, which is what many of you if you've used anything have used and that's open force field toolkit version point one Um, this is what was used in the previous publication about smear mouth So this relies on open i, uh, so you need to have your system made into open i molecules Uh, and it uses open i smarx matching to acquire spark space parameters This is nice because we have these objects where you can open up a parameter and in python You can go and modify The force constant the bond length and such and you can also just write them in your program. You can add new ones Uh, these load an old version of the parameter files and you'll have to keep this in mind when we're transitioning to smear mouth version one And this by default outputs open mm systems, but again these can be converted to amber or romax using parmed What we're going to be working on uh this afternoon in the hands-on workshop is is the Preview of the 1.0 0.0 release So one of the big new things that we have here is that we've removed the need for open i we can run everything doing work in the back end using rd kit and This means that for everything that we used to do using open i That is assigning partial charges doing smarx matching We've implemented identical functionality or identical to the best that we can and we're going to be updating it as we find problems We've implemented that back end using rd kit and amber tools Even better this should be invisible to you as a user. So, uh, you're not going to need to have different imports depending on which machine you're working on But rather we automatically check if open i is importable and if the licenses are there If they are the the script will proceed using that otherwise it will fall back to rd kit and amber tools In addition, uh, previously you had to have oemalls to create your system And now we have a toolkit independent molecule representation And so the great thing is if we find some functionality in in yet a third Come informatics toolkit, we can plug that in as a modular back end. We're not we're no longer Relying specifically on the properties of any one come informatics toolkit This quietly handles partial charge calculation if you have open i you're going to use quackpack If you don't uh, it's going to go to amber tools using anti chamber um But you won't see this this happens all behind the scenes And wherever possible we're taking the same keywords and and applying them evenly in the different spaces where they would go In those two toolkits Also to keep the road open for people inserting different sorts of parameters that they want to apply to systems in the future Uh, the way that we handle parameters the way that we handle parameter sections So bonds and then angles and then um, vander walls, uh, non-bonded parameter assignment These are all completely modular things. We could plug in whole new types of parameters. Uh, and as long as we teach these new parameter handlers How to find where they should apply their parameters and how to add them to our system representation We're good to go. So this should make it a lot easier for people to independently start making new sorts of parameters And inserting them into systems for simulations We also have modular parameter i o handlers. So like I said right now we're using o f f x ml x ml isn't necessarily the only choice and we might find it More useful to use a different file format in the future So that can change and excitingly this means that Uh, we could have a parameter i o handler Where instead of putting in a file that you happen to have on your computer right now and oh boy, where did I put it? It is in local directory. Um We are planning maybe on having just a central place where we put our force fields as we make releases So you can plug in a url at that spot And it will go grab the most recent version of the force field force field And again just to clarify it's a 1.0.0 prototype that we're going to be using this afternoon So the ink is still very wet. Not all the functions will be there But we have a notebook that shows some of the functionality that you can use um I have a um, I have a question about the parameter handlers does this uh Does this mean that um that if we are interested in Experimenting with say some anharmonic term or some coupling term that smirnoff doesn't currently have That will be able to define it using this parameter handler And the open force field toolkit will be able to assign it using smirks pattern matching and create the And create the corresponding open mm custom force that um that implements it More or less the answer is yes to that and I think john can more definitively say but he's nodding Yeah, the hope is that without having to modify the toolkit you can create your own parameter handler subclass That knows how to interpret the keywords and create an open mm force There there's difficulty in having it translated into other Force fields right now if parmint doesn't support it But we can work on ways to turn it into splines or something like that for export But that's exactly what we want to do We want to make it very easy for you to try new potential forms or new charge models and just plug them in without ever Modifying the toolkit until the point where you're ready to share that you contributed to the effort so the developments that are in progress that we'll be looking at a Um In the api but won't necessarily be able to use this afternoon Is like I was saying additional flexibility and partial charge calculations So maybe we want to use multiple confirmers And get some sort of average of their partial charges when we're doing semi empirical Uh, and those are keywords that we're going to be adding in the future And if you have any feedback on what keywords would be particularly useful we can compare that to The plans we have for the api both the smirnoff spec and the Aromaticity model may change in the future We found certain limitations for our aromaticity model, but they are consistent across all the molecules we look at So parameters will be applied in the way that they're fit But if we end up fixing this we're going to need to be checking for compatibility So we'll we'll be able to load an off xml and see What sort of aromaticity model do we expect when we're applying these parameters to a system? Like I said application of library charges is slated for the future Viberg bond order calculation and this is Going to be important for different sorts of parameterization strategies. So as we apply off-center charges Custom bond charge corrections and also valence parameter interpolation So if we have a bond order of 1.5 Is it reasonable to go take the bond order one and the bond order two parameter? And just average them or use some sort of function to get a interpolated parameter Oh, yes so just It what if somebody's really fussy about their charges and they want to import charges that go on their molecule I'm seeing you can calculate them internally or you can have them in a library Is there going to be a means to import charges or a molecule that has charges? This is a great question And this is something that I wrote a bit about Um in the notebook that we'll be using for the afternoon session And the answer is yes, you can bring in your own charges But you'll want to tread carefully because the um the parameters that are going to be fit to these molecules Worth using the semi empirical method listed in the off xml And so it's possible that if you bring in your own charges, they will sort of be out of spec with how the bonds were parameterized or how the torsions were parameterized And in the future we're also looking at In the long run, they'll be including support for biopolymers. So being able to put in your protein as well as your small molecules Another thing that's a recent development is sort of a very large overhaul of the documentation led by john and david So this is the um version point one documentation We have a few subcategories here for looking up information about what we do and how we do it And now here's the version 1.0.0 draft documentation This is getting near a pretty final form and you'll be able to look at a preview of it in this afternoon's hand-on session But it has much more detailed information and you'll be able to dig down We have human written comments and discussion about how we do things but also We have uh apis for every every callable function in the entire thing So you can really do a lot of your own hacking A quick note about versioning these numbers aren't all made up. They actually have a specific meaning So the initial smirnoff publication was done using version 0.1.0 of the open force field toolkit And version 0.1 of the spec Today we're making toolkit Version 1.0.0 and version 1.0 of the smirnoff spec And so again the spec is the language that we use describe a parameterization strategy And in case we find that we need completely new keywords that we hadn't anticipated We're going to increase the version number of the spec The meaning of these numbers is that the first uh decimal player early the first digit indicates Major api changing updates. So If we change from smirnoff version 1 point something to smirnoff version 2 point something That means that your previous scripts may not necessarily work when calling the api If we could have changed stuff's name the entire science could have changed to the point that your old stuff Needs to be updated If the second digit change changes that means that we've added new features your old scripts will still work, but There are some new calls that can be made from the toolkit And z is going to increment with bug fixes And so as you submit issues that you've run into That z number will be incrementing probably the fastest of all So I'd like to stop there and ask if anyone has questions about the toolkit as a whole And I might just direct you to the hands-on session this afternoon, but I can take questions about that right now Yeah Just a question on when you're reading in Malt two files because I've always found that atom typing in malt two files is not well defined Are you paying any attention to that or are you just using that to get the element type that you then use when you do the matching? Uh at this point, I believe we're only using the element type. Okay Oh, is that not correct John? Why don't you go? Maybe I can elaborate a bit more and this is something we'd love your feedback on How do you want to get your molecules into the toolkit in the first place? Do you want to use malt twos or stf's so what happens right now? I believe is that we load in the molecules into a Or create from smiles into some one of the toolkits that you have installed So it's either the rd kit or open eye or anything else we had in the future And then that is we know how to translate that back into our Molecule spec where we have sort of a universal representation that can convert between toolkits So we do need to make sure that because we use the aromaticity perception and the smarts typing for the individual toolkit under the hood That that interpretation is done correctly and there's huge problems with rd kit reading malt two files because it only supports Is it carina? And not civil Typing and there's no capability for writing multiple files So how do I had a great conversation brief conversation about this at dinner last night? And we kind of concluded that one really good path forward because for reasons he kind of just alluded to he doesn't like multi either Is that we get support into rd kit for stf's with a well specified charge tag So that so in other words work with rd fit define a standard partial charge carrying tag And then we can use stf for basically everything we want and that if effectively would become a new stf standard that has everything we want All right, so these would be st tags then for yeah But then that would become sort of a standard partial charge st tag for everybody and then we Can use that as a universal container Or Rout of microphones Or you guys have made an internal representation of a molecule With your own specifications and what you need is a format specification just for reading and writing well well Pat they're going to take an stf and create a standard set of st tags for charges Well, I don't think so Cage match cage match Okay, but but I this if we just work with rd kit to get this in it's uh, it would be a simple Thing that almost everybody can use, you know right away, and then it'll probably spread from there So it's an easy thing to do without creating a new format I think the key concept for us is that we we have an an object model which says here's what a molecule is Here's the minimal information we need about possibly charges and uh adam identities and uh Stereochemistry and bond orders and that's enough information to go back and forth between all the other toolkits So we do have a way of serializing that to disk if you wanted to use that directly in any of 10 formats That you might enjoy using but that object model is sort of the only information we need and any way you can populate it We can just make it convenient to do that through multiple ways Right. I think that the biggest problem I have is I want to avoid File formats that make assumptions and multi format makes assumptions and none of those assumptions are documented anywhere that I know of Oh, yeah, so the question was which file formats do we currently support? Uh, and the answer is from open. I uh, we support mole two and smiles from rd kit We support stf and smiles and we're being very conservative right now So we don't we don't want to leave the gates wide open for anything These toolkits can read because they have different behaviors when they read um So basically right now we're only taking well trodden routes and as interest comes up from you all We can start working to make sure that We know the limitations of uh, these toolkits reading different file formats But we can bring them into our neutral format that we use for parameterization uh What sort of idiot proofing is there going to be? because I went through uh A couple of weeks a while back admittedly when I had some osmium primers applied to oxygens just because i'm an idiot right and Right, you know, but um, I don't know. I'm sure there'll be some people who also do things like this, right and our The the philosophy that I've been sort of guiding myself by here is that error should be fatal Um, so like rd kit is is very helpful It can load big databases. Um, and and maybe in these big databases some molecules that won't interpret correctly And it does that silently and that's a very scary thing Um, and so wherever possible, uh, you'll find maybe when you try to import your own molecules that were very strict about stereochemistry um We really need very complete definitions and wherever possible. We're we're having to throw an error It's it's going to be fatal uh to the program if a molecule attempts to be imported. Uh, that is ambiguous in any way All right, uh, so with that, I guess I'll move on to part two And that's going to talk about the different software components Beyond just the open force field toolkit and how we're going to go about developing them in this consortium So the number one biggest thing that we think about is deployment Um, it's great if we make software, but uh, frequently the case with scientific software is it runs great on my computer But it doesn't run on yours. Uh, so I wanted to open up For industry partners by explaining how you'll get the software Uh, there are popular distribution tools that leverage internet access to these big managed libraries And it's great if you can access that but the results of our survey indicate that perhaps not all of you will have access to that Uh, and so our number one plan for reliability is that we will send out Large binary installers for mac and linux. Uh, so these we can send them to you on a usb key or by different file transfers But these are things Uh that across the industry are used with very high reliability This is uh, so the question was is this in in addition to condo forge or instead of condo forge. This is in addition to So we'll still have conventional cloud-based distribution where things magically appear on your computer But we'll also have a way to get into more secure environments Um, these binary installers will not require administrator access to install These are a local build of python. Uh, they can live in your home directory or whatever data directories you can access They will include all of the dependencies that you need to run They'll be built using, uh, this platform called condo using python 3.6 and 3.7 Uh, and I want to drive home that that Even though this has really amazing cloud capabilities condo is a reliable tool that's used in a lot of industries as standard So this is something that we really do expect will run But we'll still walk you through the installation one-on-one to ensure that it gets on your computing infrastructure Uh, unfortunately the the ink is still wet sort of on the scientific capabilities of the code that we're showing this afternoon So we can't give you this large binary installer today. We'll be using the cloud-based solution Um, but it's it's one of the highest issues on our to-do list and we have um an expert daniel smith Who's implemented this for one of his other projects? Uh, who's working with us on making this installer Daniel smith runs a project called sci-4 and uh on the website This is the way that he interface or this is the way that users will interface with Getting the newest version of the toolkit. Uh, so the sci-4 installer you see on this first line You select your operating system and here it supports linux mac operating system or the windows subsystem for linux You have your choice of getting the binary installer um the condo installer or Uh, installation from source. I'm sure we're all very familiar with that And then the selection of python versions for us. This will only be 3.6 and 3.7 And then uh, if the small thing on the bottom will disappear You'll have the choice of a stable release where we guarantee that all of the toolkits are intercompatible Uh, or if there's some new functionality that you desperately need and you don't have time to wait for a stable bug fix You can get our nightly builds Yes, and to chris bailey's point, uh conventional condo installation from condo forge will be supported so you can If you do not have access restrictions, you'll just be able to run this condo install command And an environment will come down onto your computer We will be moving. Uh, so currently the project lives on a condo channel called the omnia These channels can mean a lot of things but in this case, it's sort of uh things within a single channel You can assume that they sort of have similar dependencies. So similar system level, um, libraries And every once in a while you install things with incompatible dependencies and it's a big problem And that's one purpose of condo channels Uh, so as computing standards change or as operating systems put in new features and you have to specify stuff when you install in different ways Um, these packages have to be changed. Uh, and this can be a problem when the developer goes away or the grad student graduates Uh, a package will fall out of maintenance and no longer be installable So right now omnia has a community of developers who maintain the packages that are there But the main inspirant is becoming substantial Uh, and in fact a lot of the fixes that need to be made based on these new architectures or whatnot that come out Um can be made by bots or by just like a small team of professional developers And that's what condo forge offers. So it has an active community community of professional maintainers Who will help watch your build system and as things change or as your standards change They will go and suggest changes To the open force field consortium when we're on there saying hey, you'll need to update this to maintain compatibility So one way to think about this Uh, as daniel smith explained it was that every developer in open force field could get hit by a bus And you'd you'd be able to continue installing the software for An estimated five years or so before it fell out of compatibility What about the bus The way that you'll see how to use the project components in open force field Is using the service pulp read the docs and so here I have an example Of the open force field toolkit and on the left. This is the source code So this is the actual python code and when I define a function I'm putting in these keywords in the doc string for the function And read the docs goes ahead and parses my code And it finds these and turns them into formatted web documentation. So we're going to be Implementing read the docs for all the projects in this consortium So that if you just want to use the toolkit then you can come look at the toolkit documentation But if in the future you want to get more hands on You'll be able to see the ways that you can interface with all of these packages Another problem is that we're going to have so much separate development going on That we won't know if one package breaks in what way did it break If packages rely on each other and one changes its behavior. How do we manage that? And so we're going to be running two types of tests. So what I'm showing on the right here this is as of a couple days ago the travis status for all of the projects that are going on and As developers in open force field commit changes to github Travis will see those changes and go ahead and try to compile all the code and run all the tests that are in there And we get a green check mark if all the tests pass and we get these different degrees of redness If the tests fail and this is one way to check because this because this travis system makes everything from the ground up Including all the dependencies This is one way that we can check that the software is still remaining intercompatible So each package will run with two types of tests. So there's regression tests Where i'm a developer i'm making changes to my software and i'm making sure that these changes to my software work with the previous Stable releases of all the dependencies all the other project components in open force field But to make sure that all the development versions work together so that at some point we can release a whole new Stable like sets of stable versions for the entire consortium We have these integration tests and these are with the bleeding edge of each one of the components And only when these integration tests pass will we say, okay Now we can cut a stable release of the new bespoke fitting workflow without sending industry something that just breaks when you try to Run it for the first time Open force field is open and this means that we get a lot of benefits from the community So we'll have people submitting bug reports to us from outside of the people in this room and on this web call People submitting fixes to those bug reports and we really gain a lot But on the other hand We do have to watch out for the emergence of negative behavior So for example, this is a quote from the maker of linux And the linux community has recently gone through a lot of trouble because they just had a bad culture People didn't want to contribute to linux because the head developer linus corvolds would send very mean things to people And developers wouldn't want to volunteer their time And there's different ways that this sort of culture can emerge But it's something that's inevitable and so One thing that we're going to be putting into place is a formal code of conduct So what we expect when people contribute and I think for the people who are formally Involved in the consortium in this room. This won't be a problem at all But as a public begins interfacing and I think we're going to be getting a lot of attention. I hope This will lay down standards for how people should be interacting with each other And hopefully getting us the good parts of being open source To make sure that we have these best practices that we have build the That we have these integration tests running for every component that we have documentation building for every component Uh, we don't want to go to every component maker separately and say hey We're going to sit down for a week and give you all of this and implement it sort of differently on each one of your projects And one great thing is that we're working with this organization called mulsey Who want to improve the quality of scientific software? In all sorts of different areas and one thing that they provide is called the mulsey project cookie cutter And this is a way to download sort of a very generic repository that can contain all sorts of different code But it already has a files needed to spin up read the docs and travis And it's just registered. Like there's a couple steps to register Your project with these different external services But for the most part the work is done and these are systems that the software scientists at mulsey know how to interface with So if a developer leaves or if they have a problem that they don't have that they don't know how to solve These are systems that the software scientists will be experts on and can help keep these packages up to spec um Kind of a closed invitation But on wednesday so tomorrow open force field developers are actually having a long workshop Talking about the state of each one of the components in the project and how we can get it up to The standards that we want to have making sure that all of these components fit together and make a useful working product Does anyone have any questions about the software standards that we will be applying across the different components in the consortium If not i'll go ahead and continue on to the last portion Which is how will these components interact and this I interpret to have two meanings. Um, and i'm i'm trying to to get the understanding of this across geo as best i can So on one hand these components will talk to each other They'll have have a calls in python or calls on the command line that can be made And we need to make sure that each component knows about the other components call Interfaces well enough to use them in a reliably and scientifically correct way But then beyond that A lot of the components of this project will be operating over distributed computing So cluster is here in san diego and in colorado and in new york Um, and what exactly does the interface to that look like? How are we going to be developing software that can be distributed? Um, and called using these sort of archive management systems or distributed computing management systems So this is one of the most overwhelming problems that i ran into while i was trying to figure out Who needs to be talking to who to put together these components? So we had shown the conceptual scientific Um, in what order will we run things? What? What big issue uh, what big picture topics will communicate with what other big picture topics? Uh, but as a software scientist I want to look at this as which programs are going to talk to which other programs To make sure that the developers can all sit down for some time and agree on Um, what sort of returns uh, what sort of return values their components will have So if your component fails Will I get a matrix of all zeros will I get false will it throw an error? Because this is important to having a workflow that knows how to run for more than a few cycles And our solution is that we're going to put a lot of reliance on stable defined apis And so initially we're going to have the developers sit down and talk scientifically about what are the important concepts to get across between these components Um And once we have that down and documented it will allow future developers or developers outside of the immediate, uh, room here To go ahead and use our tools for, uh, whatever purpose they want because we'll have the behavior very well documented That said Even if the functions in front do the same things performance improvements and changes in how the science is done Uh, you know in terms of becoming more accurate can happen on the back end So we just have to define sort of what public interface we're going to, uh, allow Maybe only expand that try not to change the existing behavior too much And development work can go on involving the community Uh Read the docs is nice in that, uh, it auto generates, you know, stuff from the doc strings in your code saying this function returns an integer or a list of integers But also that it gives developers room for a little bit of scientific discussion So we can talk about oh If you pass me weird ionization state of lithium the function might do this or topics like that So other developers who are plugging in can get the full meaning. Um Of what the functions do GitHub is a great way to, um, find bugs and track them and invite discussion about them And especially with this kind of project these bugs can get tricky because sometimes they're technical Um, but sometimes they're scientific and require making a decision Uh, and github is a really good way to record these design decisions. Uh, and and the reasons that we came to them And as I said before, uh, these apis will be great and in conjunction with, uh, regression and into integration testing This will actually give us a fairly streamlined way to make sure that all of these components come together into one working product at the end of the day so Our distributed computing for this project on one hand, uh, we have the quantum chemistry distributed computing project And this is a somewhat mature product that's being developed by daniel smith um, and the way that The qc fractal distributed, uh computing Uh program takes input is that it takes you sort of formatted files. And so in this case This is an energy calculation for water and it has these very rigorously defined fields. Uh, it says, okay molecule I need to have a coordinates of these three atoms. These atoms are oxygen hydrogen hydrogen. I'm interested in getting energy out um And and a couple different keywords here And this is eaten by the qc fractal machine which goes and figures out what computer somewhere can go run this job with what back in And it will go find the job out and to the user what happens is a few minutes later. They get this return output And basically the important thing to see here is that uh, the run did not fail The return result was a number of interest energy in some units, I assume Um, and also that the versions of programs that were used here are recorded So this is uh, again something that really exists you could go do this right now with qc fractal This is the starting point for how we want to do distributed computing for things like our property calculator So for that we envision input jobs looking something like um The function that we want to run is called property density We want to run this on a substance. Uh, so maybe we have some record from thermal ml describing a mixture of substances Or a pure substance and something about it So in this case our substance is ethanol The more fraction of our system is one. Uh for ethanol if we had multiple components, they would add up to one Uh, the impurity flag is false And then here we can say what what keywords do we want that will affect The computation of this property So we might say all right, here's a force field and we could just put the whole force field file here We could put the whole object representation force field right in this input file Uh, we'll say, you know here, we didn't define coordinates or anything or a number of particles So we could say oh you go use pac mal using some pre-known settings for how to put together these sorts of systems for that And we can have a workflow defined on the worker that received this job to know how to do that in a reproducible way We can have information about the thermodynamic state. So these are going to be inputs to the molecular dynamics program Uh, and then also uh one thing and this is sort of getting in the weeds, but for the property calculator The longer we simulate the lower the uncertainty will be as we have more data about these properties that we're observing And so we can say well, here's the experimental data that we have With some amount of uncertainty go ahead and use some predefined rules to Run this molecular dynamic simulation until the uncertainty of that simulation is kind of on par with the uncertainty of the measurement So we know they're comparable And at the end of the day, we want to get this return result. Uh that has maybe uh, what we're looking at the density in some units and the uncertainty on that density density And also because molecular dynamics can be a very complicated multi-component thing We want information about uh the version numbers the toolkit numbers and everything of all the components that were used to construct that system and perform that calculation Uh, and so I don't think it's unreasonable to have these very large provenance sections And that way if in the future we go and we say oh oh gosh It seems like our dk was getting aromaticity wrong in some cases or it was differing from open hire the way that we did it We can go ahead and track down that data and remove it Well without having to throw out all of our previous work Is there a question? Are you going to include like property? Are you going to include property and unit data in the output? Uh, yes, um, so I'll I'll I'll say number one. This is an approximate representation We have a draft representation. It's very immature. So we don't want to share it. Um, but actually the the um The example that I drew from as many pages long of things Including yeah values and quantities of those values and different and um completely described unit systems. So good question Was there another question or did that address it? Okay uh, and and again, this is an immature topic. I I believe this is um Simon's arena and so in the coming months, they'll have more to say about this. Um, But this is just an example to show how distributed computing can work So on one hand the property calculator is something that will obviously be distributed It's one of the main goals of the consortium But I wanted to walk through the steps that enable any component in this project to operate in a distributed manner through qc fractal So if we can define in python a function that creates some output It's got a name like my function and uh input a and input b If input a and input b and output can go through this process called serialization Uh, and my function can be installed using conduct Then we can run this through qc fractal and this can be a very arbitrary function And the things that we have to do is we have to standardize the description So this means we have to have the function do something um that is reproducible, uh And that is unambiguously named and formatted with the input so that you know We have to sit here and think of all the keywords that people might want to use when we just grab a task But once we have that We can put it into these schemas where it will be an unambiguous way to send task task descriptions between machines asking for computation to happen In the case of this, uh property calculator run There needs to be infrastructure on the worker nodes that instead of returning the whole trajectory Because we don't have the bandwidth to do that for all the computation that we're going to do It needs to parse through the output of whatever did the simulation Uh and return only the needed quantities so the single uh value output that we're interested in And we we want to think ahead of time like I said about the customization option so that we can have keywords prepared in these task descriptions Um for the different ways that we might want to do the calculations We also need to make sure that all of the objects can be serialized So input a and input b could be things as simple as strings Or they could be things as complicated as entire defined topologies depending on What we're sending between machines and what work needs to be done And serialization is something that a lot of python packages support But you have to design your objects with serialization in mind ahead of time But it's a way to take a living object in like a python session and turn it into An unambiguous representation could be a string could be a file That can be sent somewhere else Uh people who are used to pickle pickle is something like serialization But it's not always safe not all objects know how to pickle safely because some of them have links to other objects And it gets complicated So serialization is a way to take just one object And unambiguously put that into a file that can be reinflated somewhere else If we use serialization, uh the interfaces between our components will be a lot more reliable. We won't You know necessarily depend on toolkit descriptions of molecules or anything To go between components. Uh, we'll have a very well-defined way to pass information between components Further for these worker nodes to work We need to have all the components be able to install no questions asked on any cluster involved in the consortium Or any cluster where we have compute time And this is one really great thing that comes with condo deployment. Lots of machines can access the condo cloud It's it's fairly easy to update And so for each maintainer at the different centers in this project When when the time comes to do an update We just all update and suddenly all of the workers In this open force field cloud know how to do the new tasks that we have or have the bug fixes that we need And finally, uh, I don't think that there's any Any amount of provenance cracking that's too far keeping track of all the version numbers of everything that was used Will let us be able to backtrack and remove bad data from our database Without having to throw everything away away if we find a problem I use this term open force field cloud and So one way to look at this is that um if open force field had a billion dollars or like the entire amazon cloud We could go ahead and spin up all the machines that we want that look completely identical to completely identical things But we're not in that situation. We're in a situation where we have compute time from different centers Um, very different architectures of machines And we need a way to get the same results from each of those machines and distribute the compute And so this is something that's handled by qc fractal And we can distribute new versions of software and new task descriptions to all of these computers but Wherever qc fractal has open force field managers and workers running our places I can receive tasks in the open force field group and these tasks could be property calculations. They could be quantum mechanical calculations The each place that has these managers and workers running We'll need to have one person designated as a maintainer and that way we can say it's saturday or it's monday We've pushed a bug fix or something like that. Could you please go run condo update on your workers? If you're interested in computing, uh, if you're interested in contributing your excess computing time on your clusters As the project grows in scope. We're going to be using basically everything we can get so go ahead and contact me or daniel smith If you're interested in contributing that The property calculator specifically because of the needs of reweighting So this is something where we take a trajectory that was previously computed We've only made a small change in the parameters. We think oh can we get the new density out without redoing the entire simulation In some cases if the trajectory is available, we can do that but we need to make sure that that job is sent To the supercomputer that already did the initial trajectory so that it can be reweighted And that's going to be an advanced functionality and for that reason the property calculator In the big picture will initially be deployed on one cluster where all the previous trajectories Will be stored locally and in the future where we'll we'll figure out a way to Distribute jobs knowing where previous trajectories will run So with that sort of scattered talk, I wanted to end with a description of who I am and what I will be doing here So my name is jeff wagner and I am an open force field software scientist My job description has been given fleshed out But the things that you can expect me to do and the things that you should contact me When you have a problem with our Or my primary responsibilities are bringing project components up to our software standards like I just described Maintaining the core software toolkits. So I'm the smirnoff guy if you have problems with smirnoff. They are primarily my problems Uh, we'll I'll be guiding the development of apis to ensure functionality So these are going to be the outward facing interfaces from each of the components We need them to operate in a sort of interdependent way, but also these are independent scientific components So we need to design apis that accomplish both of those goals Uh, I will be managing these nightly builds Making sure that component interoperability is being continuously tested and we know when it's safe to release a new build of the whole toolkit Uh, I'll be helping include new functionality So if groups join later or if if entirely new projects need to be spun up I can help decide how that will be added to the big picture Uh, and I think for a lot of people in this room, this is important Uh, when you have problems with this running on your computer I'm the person to talk to uh, and right now we're still drying the ink on the code that's written But if problems come up, uh during the mature phase of this project You should come talk to me. Uh, and and this is something that might even be affecting other people. So don't be shy Um for technical problems. I think it's most helpful if you uh contact me on github because that way we can keep a Kind of a running log of what we've tried to fix these problems and and what fix ended up doing it So that other people in the future can find it Um, this is sort of a tricky roll for me because I I am a phd scientist But I I don't know that much about quantum I don't know all the details about every different area here And so if your problem is more of a scientific nature, please feel free to contact me On the open force field slack in a channel that you think is appropriate And that way we can bring in the scientists who are developing The tools to also have input to make sure that we address your question efficiently Uh, this is something I said in october and I actually sort of mean it more now Is that with each uh pharma partner that's working on this project I want to make sure that we have a one-on-one to talk about If we're able to kind of deploy on your systems or what special arrangement We need to have to make sure that we can push you updates and get you the latest and greatest from our consortium So with that said, uh I think this was maybe a relatively short talk. So I'd love your feedback And we can continue discussions on the infrastructure channel on slack. That's where you'll find me Are there any questions? Yes, chris Backing up a couple of slides you were talking about when you come up with updates You'll push them out to the um qc fractal servers and then they will modify their environments but i'm just Wondering whether you could there could be some danger of some of the environments becoming heterogeneous at this point and something that we've done internally at open eyes We will use conda environments So we'll so when we instead of pushing an individual update will just have make sure our servers are running With the same conda environment and have the so we'll update a whole environment and that way we can know that everything's staying in sync so Are you when you are you thinking of pushing like, you know, yes install update. No No, we're planning on we're planning on updating entire conda environments and that's That's sort of the plan a if if daniel smith wants to jump in with any details on that he can but I think that's Basically what we're going to do is use conda to push synchronized updates of all component or all stable releases of components as needed Yes, that's correct. We can talk a lot more about how We're pinning different environment versions and to making sure that the entire stack is up to date whenever we're doing these kind of cloud-based computings um at the same time, um, but actually really great to Talk to your engineers to make sure that we're not reinventing any new ideas there as well Yes, and especially um You know, this is this is one of those things that that we anticipate. We will have well covered. Um By managing these conda environments and and synchronizing updates of all components uh But this is also something that will be caught by Recording a ridiculous amount of provenance information So if we find that one of our toolkits was running with an old version that didn't have the The capabilities that we needed at a certain time we can remove that from the fitting data set If that's all I'll put in one plug for the hands-on session this afternoon The hands-on session when these slides get distributed, it'll actually just be The instructions are on this Pet ultimate slide in my presentation And this is what we'll be doing to if you have a linux or Mac laptop Get you this prototype version of the 1.0.0 toolkit and run you through a couple examples of what it can do um So when we distribute the slides for this talk, uh The instructions for the hands-on session will be at the end of that and Because we're all on the same wi-fi here for the folks in this room You may want to start running these download steps if you don't have conda or if you don't have uh You might want to start putting in this the like type these into the command line commands Maybe during lunch so that we don't overwhelm the wi-fi and that we can get all the components for the hands-on Let's get you a mic I have open mm already installed. Do I still need to run the whole omnia thing and um, you will If you have open mm installed I would encourage you to get the most recent version because that's what we've tested against So, yeah, I would still run this command and create a whole new environment using all the The current versions of the packages Uh, yeah, syngen If I do not have open me so and also all if I want to share some of the Parameter just to ask other colleague in fighter to test it Is the other way for them to Other than using open m because we don't commonly use and have an external interface to other software package So our external interface is largely the fact that we can output parameterized systems. Um so Using our toolkit you can create the system and then at the end You can have an open mm system file or amber system files or gromac system files and during the hands-on session I'll show you the functionality that we'll have um to create those files Does that answer your question? Yeah, uh some kind of because uh Most practitioner do not use amber or charm directly. So that may be another We're talking to a few colleagues Maybe is that whether you may find out a interval generated for to find for micro model or More type of software so that it's easy to test This is something we could talk about at the one-on-ones I think john wants to say something And this this is super important to get information from you on how How you're going to integrate the toolkit use into your workflows And if you need parameters in a certain format or if it would be convenient to have something that would stand in for some other Thing then we can look into uh figuring out exactly how to do that. Um, we already Go ahead David I was just also going to point out there's there's an important distinction between a parametrized system and the parameters themselves Because we're using this so it's going to be much easier to move parametrized systems around Then to move the parameters themselves around Because we are using this Direct chemical perception which is different from most of what the other modeling packages do So you can't move our force field As a whole force field into many of these other packages. You have to move just the parametrized systems Sometimes the parameters are not the input file formats are not super well documented So it's useful for us to put us in touch with your tool makers as well Yeah, so so I I do realize that because the chemical perception But if you look at uh clock students micro model format fire format There's a subsection that is very similar to the smirk format that you could potentially Go through just like passing through all the general parameter and define the specific one for the chemical structure that can be used So there's a section that have a special definition that can Okay, if there are no other questions, I think maybe I'll just wrap up with a little bit of time remaining Thank you very much everyone