 Hi everyone. I'm Josh. I'm our science communicator and I'm going to take us through this workshop. We're going to talk about bespoke fit and how it works. I'm going to start with all the theory behind it. Then we're going to look at how we actually set up an optimization workflow and then finally we'll run an optimization workflow and see it in a protein ligand simulation. Bespoke fit is our workflow for optimizing torsions. There's a lot of compute that we're going to run in this because we're actually going to run an optimization. I want to set that off now so that when we get to actually analyzing it, hopefully it's done. Bespoke fit's broken up into an executor and then jobs that you send to that executor. The executor is in charge of managing all the computational resources and passing them out to different jobs. The idea is that you can give the executor all the jobs that you want to run and it will optimize them all and run them in a way that uses all of the resources you assign it. The first thing to do is that we need to run the executor. We can't run that in the notebook because it will block everything else that we want to run in the notebook. This default setup that I've given here is generic like it'll run on anything. Up here I go to the file menu, new, terminal. I've got quite a lot more compute available in this so I'm going to add a bunch of cores to everything. The fragment of workers and the optimizer workers is pretty safe to give them a lot. Of course QC compute workers is a bit more complicated because it's split up into how many times do I want to run my QC compute engine and how many cores do I want to give each execution? The number of workers is how many times you run the engine and compute end cores is how many cores each worker gets. Last time we found that we've got the best efficiency with having lots of compute workers with few cores. We're running that executor in the background. We can see it comes up with green and it's all happy so we can move on with the notebook. This first cell is then submitting a job to that executor. There's a file distributed with a notebook here. We're just going to tell it to use the default workflow and we're going to override the QC engine that we're using to use XDB, which is a semi-empirical method, which is a lot faster than the full fat Cy4 method that is used by default. We can just run that in the notebook. This exclamation mark means this is a shell command rather than a Python command. This sets up the bespoke workflow, loads the molecule into it and then submits it to the executor. This is now running. My computer is heating up and we're going to talk about how the bespoke fit works. In open force field, we have a philosophy where we want to minimize the number of parameters we use and we've built our whole force field infrastructure around being able to specify a minimal number of parameters and use them in the maximum number of places rather than having a lot of degenerate parameters. This is great. We get away with only about 350 parameters, but the downside obviously is if you're competing with force fields that have thousands of torsions, you lose some specificity. Open force field sage is this orange line for this torsion here in this fragment of a drug molecule. You can see it doesn't really match the blue QM line very well, but after running bespoke fit on it, we get new torsions that very closely match QM. What bespoke fit does is it automatically runs all the QM calculations we need and modifies the orange line to fit the green line to the blue line. Bespoke fit is really clever about how it goes about this. We aggressively cache prune and reuse parameters so that we're not wasting any time calculating stuff we've already calculated. The way this works is all around fragmenting. So I want to demonstrate without using bespoke fit how that fragmentation works, what torsions are recalculated. What we can do here is we can load a YAML file full of ligands, go through those ligands and pull the molecules out and then display them. We're downloading this ligand file from the internet so you know I'm not cheating. I don't know what cheating would look like in this context but I'm not doing it. So here's a bunch of ligands we can see they're all very similar so these are ligands we want to optimize and we want to minimize reusing parameters without compromising those parameters performance. The first thing we need to do is we need to identify which torsions we need to optimize. By default bespoke fit operates on rotatable non-terminal torsions. What does that mean? Well it means exactly this smirk string. This smirk string if you speak smirks just says any atom that isn't a hydrogen bonded to any atom that isn't involved in a triple bond and doesn't have degree one which means it has like more than one bond and then it must be singly or doubly bonded but not in a ring to an atom that has the same characteristics as its first atom and then that's bonded to a non-hydrogen atom. So basically we're looking at two heavy atoms singly or doubly bonded to each other each bonded to non-hydrogen atoms that either end whether they're not part of a ring. Once it's identified those torsions, the smirk fit will exclude torsions that are identical under a symmetry operation which just means that we only calculate that once. This data set doesn't have any examples of that. We can use this smirks which we're getting out of a spoke worth flow factory which we'll talk about a bit later to get out all the times that smirks matches a molecule and we can just display those bonds in a grid and so we can see basically this is what you'd expect the torsions that we need to optimize are. We're not bothering with rings because they're so constrained there's not much to fix. We're not bothering with terminal atoms because hydrogen is spinning we're not bothering with but other than that we're getting all the torsions and so you can see that that gives us the same torsions in each frame. The next thing we do is we fragment these molecules so the idea here is we want to be able to reuse these torsions that are similar in each drug but we don't want to waste time calculating parts of the molecule that don't affect that torsion. Our way through this is fragmentation so we can do this ourselves. We use a Weiberg bond order fragmenter so Weiberg bond order is a non-integer way of measuring bond order so instead of saying this is a single bond or a double bond we say it's like it's a 1.2 bull bond and the idea is that the Weiberg bond order of a bond is a proxy for its electronic environment so if its electronic environment changes then its Weiberg bond order will change and so we can fragment a molecule by picking a bond measuring its Weiberg bond order in the whole molecule and then growing that bond out from the center until we're back within some threshold of the original Weiberg bond order and once we're within that threshold we sort of say oh okay so now we're in the same electronic environment so the torsions should be transferable. I've got an open eye license and a fast computer so I'm pretty comfortable running this fragmentation on all of the molecules so I've just deleted this indexing thing here if you're on binder I would consider skipping this cell because the cell is quite slow if you don't have open eye so this cell performs the fragmentations themselves and then we can view them just for the first molecule and so you can see this first molecule which is this one has been split up into these two fragments because it's got these two rotatable double bonds. The torsion drive that will run on this fragment will only involve these atoms and the torsion drive will run on this torsion will only involve these atoms this chloride for instance doesn't affect the electronic environment of this bond to within the default threshold so we don't bother including one. So we have two selected bonds up here so we have two fragments if we look at all of the fragments for all of these molecules we get a lot of fragments and you can see a lot of these are doubled up so this fragment is exactly the same as this fragment is exactly the same as this fragment and so what the spoke fit does is it goes through one at a time and it will say all right do I already have this fragment in the case if it already has the fragment it will just take it out of the case and if it doesn't already have the fragment it will compute it and the end result of that is that this fragment will only be calculated once and those parameters will be reused for every other molecule so computer means running a torsion drive to get reference data to optimize against and optimizing the parameters the way we specify our parameter means that the parameters we calculate for this fragment in this molecule will work for all of the other molecules without modification this cell just creates molecules from all of our fragments removes any copies by putting them in a set and then puts them in a grid and we can see we've gone from so many that I had to zoom out to just a couple and so we can run all of these fragments much much more efficient than running all of the molecules because you can see there are only about as many of these fragments as the original set of molecules and on top of that these fragments are much smaller and since quantum chemical calculations generally run in polynomial times with degree greater than one this is a big win so that's the strategy the spoke fit will identify rotatable non-terminal torsions prune out any that are related by symmetry then fragment the molecule around those torsions so that we get the minimal set of atoms that reproduces the chemical environment run a torsion drive on those torsions fit parameters to that torsion drive and case everything so the next time that we see that fragment we don't have to run it again so now we sort of understand what spoke fits doing I want to show how the spoke fit can be configured spoke fit is designed to let you run your workflow however you want and to be extensible so at the moment it does torsions but the bones are there for it to do non-bonded parameters and bonded parameters and whatever else you want as long as you can give it some way of getting reference data and some optimizer that can convert those reference data to parameters spoke fit can handle the rest because we have that flexibility we want to be able to configure it as well the default setup matches our parameterization strategy for the mainline open force field force fields but if you're not optimizing a open effort force field that may not be very useful for you so for that reason we we try to keep everything configurable the model for this is that we have a workflow or a bespoke optimization schema this basically outlines all the steps that we're going to do to convert a particular molecule into a parameter these bespoke optimization schemas because we call them workflows are created from workflow factories so the idea is we create an instance of the workflow factory and use it to create optimization schemas from particularly molecules so you could configure one of these directly but because you're probably going to do more than one molecule it's much easier to configure this so that's just this class from the bespoke fit workflows module and if you pass it without any parameters you'll just get the default settings and you can print that and see what they are so this is a pedantic schema if you're familiar with pedantic which means the different values are type checked and there's lots of different ways if you can modify them and you can set up validations and stuff like that and like i said these default settings are appropriate for optimizing the mainline open force field force fields what i'm going to do is i'm going to go through the important values for bespoke fit workflow factories and how you can set them but i'm just going to set them to the defaults except in one case so i don't want to like say this is a better setup or you should use this setup instead i'm just trying to demonstrate how flexible this workflow factory is you should usually stick with the defaults unless you have a good reason to change them the first thing that bespoke fit needs because bespoke fit is about optimizing force fields you need a starting force field the default starting force field is just unconstrained sage so we use unconstrained instead of constrained just because it saves us some trouble with the optimization convergence sometimes and because constraints are very easy to add in at the end if you need a constrained force field this force field provides the generalized force field for parameters that you're not optimizing but it also provides initial guesses for all of the parameters that we will optimize along with the initial force field we need to specify the target torsion smirks which is the same smirks that we talked about earlier because we need to know which bonds we actually want to optimize you can see this is a list rather than a single value so you can give it multiple different smirk strings and it will optimize all of those bonds so if you don't want to write a really complex smirks to cache like fundamentally different torsions you can write individualized smirks as well and just have a big list of them okay so once we've configured the initial force field we need to configure the fragmenter bespoke fit just uses the fragmenter class from open ff fragmenter it doesn't wrap this at all so if you know how to configure open ff fragmenter you know how to configure a fragmentation in bespoke fit so in particular you have a fragmenter class you could pass in the fizer fragmenter if you preferred which is also provided by open ff fragmenter or any other fragmenter that is the same api we set up a threshold which says how close does the wyberg bond order for the fragment have to be to the full molecule for us to say that the electronic environments are the same and then you can set up whatever options you want for example we specify that we want to calculate wybergs from am1 bcc with l10 so that goes in the fragmentation engine attribute and all of these attributes can also be arguments to the bespoke workflow factory constructor if you prefer once we've set up the initial force field and the fragmenter we need to set up how we want to actually generate data for our torsions this is done with this target templates list a target is a particular quantum chemical calculation or data set that you optimize against that language comes from force balance at the moment torsion profile target schema is the only one that's implemented and that represents a torsion drive but you could imagine implementing vibrations and geometry optimizations and a few other things the idea is if you needed to optimize anything other than a torsion you would specify the source of the reference data here the other thing that we have to specify to configure a target is the actual quantum chemical program that we're using to generate our reference data here i'm just printing the default qc specs so the default quantum chemical program that we use because we're going to change it so the default setting is to use psi four with b3 lip d3 bj and the dzvp basis set the reason we use this by default is that it's exactly the same method that we use to optimize open ff force fields so if you use the same method hopefully you get compatible answers so obviously if you're optimizing a different force field this would be one of the key settings to change we're changing it here to xtb and this gfn2xtb method which is a semi empirical method it's much faster than v3 lip because we're running an optimization in real time that specifies the targets it specifies how to generate our quantum chemical data once we've done that we need to configure the optimizer this language is a bit confusing because like isn't all of this both fit an optimizer the optimizer is the step that actually takes the quantum chemical reference data and optimizes the force field parameter to reproduce that so we use force balance for this and force balance works by it has a nasian prior distribution so it says you're right you've got this initial value and the final value should be influenced by that initial value in the sense that we think that initial value is a good guess and we don't want to stray from it unless it actually really improves our optimization and the reason we do this is because we don't want to overfit we don't want to just like have 20 different sine waves that exactly reproduce the torsion drive we want each sine wave that we add to justify its existence so we do that with an l1 regularizer and you can configure force balance itself with the optimizer attribute and then the hyper parameters that you use to do the optimization are configured with the parameter hyper parameters attribute so here we specify a prior width of of six units that's a relatively large prior because we don't want to overfit but we also don't want to be limited if the initial force field is his way off and so now we have our fully configured workflow factory all of these different settings you can look at the api docs we have documentation for everything at docs dot open force field dot org slash bespoke fit and we have a complete api reference so if you want more detail on how to set any of these up check it out it's organized exactly the same way as the python per deals and it's searchable and if anything's tricky or wrong we'd love to hear about that so once we've created this workflow factory we can save it's a file and that means that we have a machine readable and vaguely human readable record of everything we're doing when we run the optimizing so this hasn't actually run any calculations yet it's just set up this file so everything we've done so far is summed up in this except for the cypher change if you just run factory equals bespoke workflow factory you'll get the same settings so you don't need to run this every time you run bespoke fit we're just demonstrating how things work if you do change all the settings or change some settings and you save this workflow factory then all you need to do is call bespoke workflow factory from file on the yaml file you've got and then you've got it back for the next time you run bespoke fit so basically that command we ran in the very first code block in the notebook that was basically dispatching a whole job using the default settings except for substituting xtb for cypher yeah and that's been running in the background the whole time and what we've done since then is just kind of go into all the knobs and dials that we could have turned if we didn't use the default settings yeah exactly so we can see here we've specified the workflows the default workflow we've created an optimization schema from that workflow with this ligand we've overwritten the qc spec and when we submitted that to the workflow it says here look it's loaded the molecule it's built the fitting schema um and then submitted the workflow so this has done all that behind the scenes and we can just replace this workflow with our new file so let's see how our optimization is going this executor that we're running in the background is basically a http server and the executor command line can send it requests to say okay i want you to run this calculation now or can you tell me what calculations you've done or where is this calculation up to this list command uh does exactly that asks where are all the calculations up to sends that http request within my computer so it's not going over a network it's just using a network protocol sends that request to the executor that we ran at the start of the session and then it found this optimization the executor receives the request says i can tell you all the optimizations i've got here they are and we can see this is the molecule that we submitted the start and it successfully finished its optimization so with that we can retrieve the results that it's got so we specify the id which if we'd submitted multiple molecules this would say i want the first molecule or second molecule or the tenth molecule and it says i want you to write the output to this file and then write the actual force field that we get out of it to this olexa hexagonal file so if we run that it'll say yep we finished i've saved the results and i've saved the force field and we can see here's the results um and so you can see this is just a jason file we can see it specifies everything that we we did so there are three stages which are like fragmentation uh target generation and optimization so the fragmentation was successful the qc generation was successful and the optimization was successful we keep a record of the input schema which is the optimization workflow that we worked on so this this file keeps track of what we actually did and it provides the refit force field so this olexa xml string is the same as this force field so this file is the fully optimized force field and if we scroll down to the end of the proppers we can see we've got some values that have really deep smirks and those smirks are specific to new fragments but i'll go more into that in a second now that like we've seen how executor works we can submit all of the rest of our molecules to it so this is just a python loop where i'm converting each molecule in our original list of molecules to a smile string and then passing that smile string with this exclamation operator to the executor which is actually a command line command and it's just giving that smiles and using the new workflow factory file that we just created which is identical to the original workflow it's just got the xtb qc engine built into the workflow factory rather than having to specify so we can run that and we see it creates each molecule submits the workflow and now id2 is this slightly modified molecule actually id2 will be exactly the same molecule and then it does the rest of the molecules in that list i'm now running 12 optimizations on my computer the executor is managing all of the resources but that's a lot to manage and once we've done that we can also watch an optimization as it runs with this watch command so if i switch to id3 id2 will have already finished because the parameter will have been cached so here you go you can see it's it's fragmented the molecule successfully and it's now generating qc data now that you've submitted all those molecules could you do the open ff bespoke executor list again okay so we can see this was the first one we initially submitted this number two is the one that we resubmitted so i didn't leave out the first one out of the list of molecules that i wanted to submit so it's resubmitted the first one and we can see that this one has already successfully finished and that's because it basically just said have i worked on this molecule before yes i have here the parameters for it so this is a result of the caching and the rest of these will be only running their fragments that weren't in a previous calculation so that's bespoke fit in a nutshell we've created a workflow file that specifies how we want to calculate our optimizations and we've submitted it to the executor we've seen that you can retrieve results from the executor and list different things we've executed but fundamentally the only thing we actually had to run for this whole notebook is these first two commands and that's kind of the magic is that i've spent i don't know an hour talking about what bespoke fit is actually doing but it doesn't all in these two commands so that's what bespoke fits about and ultimately the result of these two commands is just this force field file which is just a standard of fxml smirnoff force field unfortunately bespoke fit has not been updated yet to use the latest version of the open ff toolkit the newer version is really really convenient for working with proteins and since these drugs are protein leggings we want to be able to work with proteins so that's why there are two environments and a second notebook so we've got this toolkit dot ipan b which is a separate notebook with a separate environment so i'm just going to open that up and then i'm going to make sure that we're on the right environment i've got a kernel change kernel and i've changed it to conduct and toolkit and click select sorry that that's so complicated we will be updating bespoke fit to work with the new toolkit version as soon as we can okay what i'm going to show here is just like lightning speed how do we set up a simulation with the toolkit to use a bespoke trained force field if you haven't done this before the toolkit showcases the premier example for how to do this and it's described in some more detail in other workshops as well so take a look at those resources i'm going to go quite quickly first thing we're going to do is load our molecule from pdb we found a bug in from polymer pdb if you're using rd kit so that's what this command is for it will work around it for the moment i'm going to open i so i don't need to worry about this so all we're doing is we're loading the protein from a pdb file we're loading our ligand from an stf file so this is coordinates and connectivity and bond orders and all that stuff but not force field parameters obviously and then we're combining them into its apology so this takes a bit because loading a protein from polymer and let's look at this a bit slow the next thing we're doing is preparing the force field which is the bit that bespoke fit comes into this if you've ever simulated a protein in open force field before you're familiar with this caveat open force field sage is parameterized to be compatible with amber force fields but we haven't checked it in detail we've created a port of amber ff14sb to our tools and this is the only way you can run protein calculations at the moment unless you have your own force field but there's just a lot of ways that this could go wrong so i'm demonstrating how to do this but i would recommend waiting for open ff 3.0.0 rosemary which will have protein parameters of its own before you use this in production work and that should come out next year so what we're doing is creating a force field object out of our augmented sage force field and out of amber and the way these force field objects work is you have a list of parameters from top to bottom like from to start to finish so we're starting with sage and then we're adding amber to the end of them and we always go through the force field and we find the last parameter that bits a particular atom to give that atom parameters so we can always start with general force fields and then add more specific parameters afterwards so this will use sage parameters for most thing it will use the bespoke fit torsions over them and then it will use the amber parameters for proteins now we've wrapped this force field i just want to demonstrate that we're actually applying those parameters so here's the molecule we're working on i've hidden the hydrogens and i've just labeled each atom with its index in the molecule and that's because that's how parameters are assigned and then this blocker code labels the molecules in the topology using a force field so it's going to show us how would we apply this force field to these molecules and then what i've done is iterate over all the proper torsions and we're going to check if they're in the vanilla sage or vanilla amber force fields and if they're in sage we'll say it's from sage if they're in amber we'll say it's from amber and if they're in neither of these and they must have come from bespoke so we'll come put them in bespoke fit and then we'll just pretty print the first six of these so we can see there are these sage parameters with boring short smirks sage parameters boring short smirks and then down here we have bespoke fit parameters with exciting long smirks so the idea is that this smirks will match both the parent molecule and the fragment and because it matches the fragment it should also match any other parent molecule which would have the same fragment so it won't match other chemicals that are vaguely similar but it will match other chemicals that literally contain exactly the same fragment this makes sense because the whole guiding principle of this bespoke fit fragmentation scheme is that when you create a fragment that fragment is all the information you need to calculate the electronic structure of that central bond that the fragments built around so the idea is that if you had any molecule that included that fragment it would be appropriate to use these parameters for that molecule so this first bespoke fit is connects atoms 0 3 10 and 11 and we can scroll back up to see where that is and it's 0 3 10 11 so it's this bond that we're torsioning around and we remember from the previous notebook that this is one of the torsions that we've optimized so that's spot on so okay good we're assigning parameters now we can just put the system in a box i was going to use a cubic box but i guess i'm using a rhombic dodecahedron box so rhombic dodecahedron is nice because it's still a triclinic box so it's still tiles but it's like much more spherical than a cube so you're not wasting the waters in the corners so we're just going to calculate the maximum distance between points that's just a non pi function and then use that to set the size of the rhombic dodecahedron all of this stuff and also the draw molecules function that we used previously is distributed in this utils toolkit module which is just here there are a bunch of functions that we used in the previous notebook that are in utils and so this is some convenience methods that let us do complicated things relatively easily so you can see max distance between points it's just getting all the points and then computing a convex hull and then computing all the distances in that convex hull and rhombic dodecahedron is literally just the box vectors for a rhombic dodecahedron with image distance one so if you had an atom in the middle of that it would be one unit away from its periodic image so if we multiply that by a scalar which is the buffer region that we want plus the maximum distance between points we have a properly sized box then solvate works by converting the topology to a pdb file in memory and then loading that pdb file into pdb fixer using pdb fixer to add a solvent and irons and then calling it back out of pdb fixer with the new toolkit this is a really convenient way to solve it it's apology we're solvated so we can visualize this now so this visualized method is also in this usual still good thing so this is all the water solvating our protein and you can see our ligands down here this looks like the box is way too small and that's because rhombic dodecahedron the boxes are way smaller than other boxes it also looks like it's not a rhombic dodecahedron but that's just because you can always represent a triclinic box as a rectangular box that's translated differently than just having the other box on top of it so the box will have a corner in the middle here roughly and so you're still sort of getting all of the image distance because it's going through the corners to the center of the next box which is why it looks so thin rhombic dodecahedron boxes is totally magical if we're choosing off that sort of books now we're just converting our topology to an open mm topology and we're creating a open mm system out of our force field this code goes through interchange behind the scenes so if you want to put a bit more code in we can export to amber or gromax or i think that's always support them but we're going to add more and more molecular dynamics engines but for the moment we'll use open mm because it's really easy to use in python this step takes a minute because it's running python code on i don't know 100 000 atoms or something the next thing we'll do is set up an open mm simulation this just happens like any other open mm simulation once you have your topology and system so we construct an integrator we set its temperature we set its friction constant and we set its time step we collect our topology system an integrator into a simulation object we tell the simulation object to set its positions to the topology's positions and then we add a reporter to write everything out to a dcd file then we're going to energy minimise and simulate and we're not going to have a long simulation because we're just going to run for one minute of clock time and then we can visualise the dcd file by converting our topology to an empty trench topology and loading the dcd into that all right so we're energy minimising now there we go we've minimised to 836 megajoules per mole negative and we simulate for a minute so this is our wiggling protein um you can watch it wiggle it's got water it's wiggling yeah so that's how we take a bespoke fit force field and run a md simulation so what we've just done is taken the force field and then used it to parameterise a system of the ligand that we that we optimised in a protein with water and then we've run a simulation on it it's not a particularly detailed simulation it was 13 picoseconds long we didn't do any equilibration or anything so obviously if you actually want to do this you're gonna have to like come up with a simulation and then previous to that we just went through how bespoke fit works in general terms and how to configure and run it we have further readings so there's documentation we now have like a centralised documentation site for open force field software so this is just docs.openforcefield.org and this will link out to all of the different packages so docs.openforcefield.org slash bespoke fit for instance we'll take you to the bespoke fit docs we have an automated api reference which we'll have every part of the public api in it we have handwritten docs as well installation guides getting started guides for all of our software now there's lots of stuff to read there if you find something wrong with anything we'd love to hear an issue about it if you find there's missing information or there's information that you'd like we'd love to hear about that too hope you've enjoyed this great thank you very much josh