 Hi, everyone. Today, we're going to be looking at simulating some biopolymers with non-phenomenal amino acids with the toolkit. If anyone's ever tried to do this with existing tools, it's real pain. You have to combine a force field that will handle your protein really well with a general force field that can handle more general chemistry. So you need to be able to parameterize that. And existing tools just don't do this automatically. So you have to do it by hand. And you have to guess how to stick them together and hope that your hand-edited topology files are correct. Because of the way we design force fields, in principle, using non-canonical amino acids is no harder than using canonical amino acids. But of course, you need software support for that. And that's not entirely in place. So in the new update in version 0.11, we just got to a point where it's really possible to do non-canonical amino acids and biopolymers. And so that's what we want to show you, the promise that our technique has for this workflow. And this is just going to get better in the next year or so as the Rosemary force field comes out, which will have proper protein support. And as we figure out a robust API for teaching the toolkit about non-canonical amino acids, today we're going to introduce the new biopolymer support in the toolkit and just demonstrate that biopolymers work exactly the same as any other molecule. We're going to demonstrate a way to load a protein that has a non-canonical amino acid into the toolkit. Then we're going to use the toolkit to prepare a force field for proteins with that non-canonical amino acid, which is going to be a particular synthetic fluorophore. Finally, we're going to use the toolkit to simulate our labelled protein with OpenMM. So the first thing we do is obviously import everything. The one thing I want to mention here is this Utils module, which is distributed with the notebook. It just has a bunch of code that isn't particularly interesting, but is varied long. And so it's there if you want to look at it, but it's just code that we're going to use, but not talk about how it works. It's somewhat commented and documented in the Utils module. So if you want to have a look at how it works, it's there. So the first thing we're going to do is actually load an unlabeled peptide into the toolkit and just show that it's just like every other molecule. So the way the toolkit works, if you haven't used it before, is we load molecules into the molecule class. And the molecule class stores a chemical graph, basically. You can add some annotations to it, but basically it's a chemical graph that identifies a particular chemical species. And then we can combine molecules into a topology. And then the topology can be combined with a force field to get an open M system or an interchange for exporting to other empty engines. So I've distributed a PDB that has an ACE and N-methyl group CAT A5-16A5 peptide in it. So you can see it's a helical little peptide. I haven't displayed the side chains here. These little licorice bits at the end are just the caps because NGL view doesn't know how to display them. They are bonded. It's just NGL view being a bit odd. What are the known residues that it knows about in the From Polymer PDB method? Right. So it currently knows the 20 canonical amino acid residues and a bunch of their like, protonated, deprotonated forms. And it knows acetyl caps and N-methyl caps. And it doesn't know anything else, which is why we have to go to this world. To teach it about a non-canonical amino acid. The plans for future support of new residues is that we want to do it and we haven't figured out how we're going to do it yet. If I could cut in, we have a variety of options and we're looking at basically ways that will be good for the future. So the past few months we've been scrambling to have a production ready variant from Polymer PDB that needs to operate perfectly for the 20 amino acids. And now that we have that out in the wild, and probably in conjunction with open free energy in the Shirt Slab, are going to be exploring sort of now a more extensible way to like define new residues and be able to load them from PDB as well. Asperationally, we want to handle, of course, all proteins and amino acid variants defined by the PDB. We want to handle other polymers like DNA and RNA and other common, probably sugars as well, but I know sugars, their branching structure can be extra complex. So maybe they'll come in a third generation. Yeah, but we do plan to expand this functionality a lot. And also, so if you have an SDF file of your protein or you have a smiles of your protein, then you can load that in directly. You don't need to work with this from Polymer PDB method. But this molecule that we've just created, this A5CA5 is just a regular molecule. So anything that you do with any other molecule you can do with this. So one of the new features we've added to all molecules is hierarchy schemes. And the point of hierarchy schemes is that a lot of chemoinformatics toolkits have and the PDB format itself actually has this scheme for iterating over atoms within a molecule, which is all within a topology, which is chains and residues. The idea is that a chain is a single polymer chain that's translated or transcribed all at once. And a residue is a particular amino acid or nucleotide residue within that. And we didn't wanna have this very protein specific bespoke iteration method in our molecule class. So we came up with this abstraction called the hierarchy scheme. A molecule can have any number of hierarchy schemes. By default, polymers loaded from PDB have these two chains and residues. So exactly the same as other chemoinformatics toolkits. And the way a hierarchy scheme is defined is that you give it these uniqueness criteria. So our chains hierarchy scheme has the uniqueness criteria chain ID and the residues hierarchy scheme has the unique criteria of all this stuff. The idea is that every atom with the same values for these metadata keys gets the same residue or every atom with the same set of these metadata values gets the same chain. So that lets us iterate over anything you can think of to iterate over atoms as long as you can express it as metadata on an atom and some combination of uniqueness criteria. So this is a really flexible scheme that we anticipate can be used in ways that we haven't thought of yet. But it also works really nicely for residues and chains. And we go to a lot of work to make sure that when you load a PDB or you load a protein there's a good chance that you'll be able to iterate over it like this. And if you can't, we provide pretty simple methods to allow you to. I'm just going to demonstrate how this works. This loop iterates over all the residues in this residues iterator. This residues iterator is created because of the hierarchy scheme. So... What's that and 13 elements at the end of the residue definition? Right, so that's saying that this molecule has 13 residues. So it's got 13 elements of the hierarchy scheme. So yeah, they're iterator elements, not chemical elements. Okay, so we're iterating over our residues. We're going to take the first atom in each residue and just check that it's got the same metadata as for the uniqueness criteria of residues as every other atom. And then I'm just going to print each atom, each residue with its atoms. And you can see this gives exactly what you'd expect in any cheminformatics toolkit. And so this lets us iterate over chains and residues really simply without having to shoehorn the idea of a residue into every molecule. And the metadata is just a dictionary that maps from strings to either integers or strings. So you can put whatever data you want in that dictionary. But apart from its predefined hierarchy schemes, a peptide is just a regular molecule. Any biopolymer, just the same as any other molecule. For instance, we can convert it to smiles, no worries. You get a very long smiles and this is for 11 amino acid residues. So you can imagine for a real protein, these get very long, but there's no problem with it. You can just convert things to smiles. And like I said, for topology is a group of molecules, possibly with box vectors and positions. And we can convert our molecule to a topology just like it would any other molecule. Okay, so now that we've got our unlabeled biopolymer, we're going to label it with a fluorescent dye. The problem is that it's really unusual to have an SDF file or a smile string of an entire protein or especially of a protein label with an non-canonical amino acid. And PDBs don't include enough chemical information to unambiguously get a chemical graph out of them. Unless you make assumptions about connect records and no one knows what those assumptions are and different tools disagree about them. So it's just not worth doing if we wanna avoid errors. So we're going to use our D-kit to manually add a modification to the peptide we've already got. So this is fluorescein 5-malayamide. This is a really bright synthetic fluorescein dye and it's squandered to this malayamide group. Malayamide groups are great because they do this click reaction with the cysteine. It's a conjugate addition against this across this double bond. And then you've got a labeled peptide. So that's what we're gonna do. We're going to use our D-kit to perform this reaction and then we'll have a smile string that we can use moving forward. So we thought about a lot of ways to modify a molecule originally disinvolved like manually editing smile strings. And we eventually settled on this method which is our D-kit provides a way to apply a reaction to molecules from smarts. So I've written this reaction smarts which says take a thiol group and take a malayamide and give me a thiolated malayamide. If I run this reaction, we can see this is the molecule that we get out and that's exactly what we're looking for. I did have a question. Is there any way to control the chirality of that product or you just kind of take what you're given? So this reaction smarts specifies the chirality of that product. So if I put an extra app in there, you get the other and that's your money. Oh, great. And actually the tool kit will not accept smiles that don't specify all of the stereochemistry. You can tell it to just get over itself and accept them. But by default, we quite specifically avoid that particular issue. Now we've got a RD kit molecule for this. We've got a smiles and smiles are really good because we can use them with this from PDB and smiles method. And the way it works is you take a PDB which doesn't have chemical information. You take a smiles which doesn't have anything but chemical information. You make sure the smiles works with the PDB and then you get out a molecule. I've provided a PDB of the modified peptide with the notebook and we're just gonna use this method to load it in. And if we do that, we magically get our peptide. I've represented it in licorice rather than the default representation so that we can see that it's actually right. So it's got the positions from the PDB rather than generating new positions. And we can see it's also taken the residue information from the PDB. So I've given it this new dye residue so that we can identify it later. And so this is sort of the superpower of the toolkit is that everything's just a molecule. So anything can do with the molecule can do with all of it. And that means exporting things to RD kit and then re-importing them in various different ways. Obviously there's a million different ways to get to this point where you have a molecule of your non-phononical amino acid. This is the easiest way for this example we think but obviously if you wanna do some other PTM you need to write a new reaction smarts which ended up being a real pain and I tried to do it the other day. You may want to create your molecule in some other software and export it to an SDF or just manually edit smile strings or something else. The point is that it doesn't matter how you get to this point. Once you have a molecule, we can put out a transplant. If you import from an SD file you don't get the hierarchical naming is that important for the rest of the tutorial? It is, but it's pretty easy to work around. So there's a method on molecule called perceived residues that it applies a bunch of smarts rules to the molecule to identify amino acid residues and you can get residues out of that. There will be one or two changes you might have to make to one or two cells and you'll be able to apply to the rest of the network, no worries. Is there something like from PDB and SDF for PDB and mole? Yep. The way to do it would be to load the SDF or the mole into a molecule and then produce a smiles from that molecule and use from PDB and smiles. So when we do this from PDB and smiles we have this PDB with the dye on it named as dye in the rest name, right? But we don't know the chemistry of that. So we add these smiles for the molecule to know what's the chemistry. Basically, is that what's happening? Yeah, that's exactly what's happening. If we have more than one molecule or one residue in the PDB that we want to specify, is this possible? No, this method will fail if you do that. Okay. You can work around that with OpenMM. If you have a look in the utils module there's a method called solvate and it demonstrates a way of loading a multiple molecule PDB through OpenMM. But you just need to check your work. Like, we can be relatively confident in that this will either do what you expected or error out, but when there's more molecules then there's more opportunities to go wrong. So we need to support that. I mean, when we have more than one residue that we need to, that we need the smiles for? So the smiles is all or nothing. It has to describe the entire polymer. It has to describe the entire PDB. All right, we have the smile for the whole thing. Oh, I see it. You can just iteratively apply this reaction smiles if you had like multiple systems that you wanted to label, for instance. And then once you've got the whole molecule out, yeah, this needs smiles for everything that's in the pigment. So now that we've loaded a post-translationally modified peptide into the toolkit, let's parameterize. This is obviously the pain point in a traditional workflow for non-canonical amino acids. I've done this more times during my PhD than I wanna really think about. But the nice thing about SmonoFORCE fields is that we just define, this is a general force field and this is a specific force field and it will apply this to everything it can and then anything that's left behind gets this. And that's just lovely. Like being able to combine force fields is really useful. The caveat is that you're combining force fields and so you need to be really sure that the parameters are compatible. So OpenFF 2.0 Sage uses the same charge generation method or very similar charge. I think it's actually the same charge generation method as AMBA. The bandwals are all derived from GAF which is compatible with AMBA. We've done a lot of refitting over them but while this has never really been rigorously tested and you probably shouldn't use it in production unless you're gonna check it yourself, we have like good reason to hope that these will work together. And this situation will get much better in the next release of OpenFORCE field which will have protein parameters. And so you could parameterize a protein just with this but it wouldn't do very well because we'd be using general parameters for proteins. So all your backgrounds would probably be very similar and stuff like that. This gets us better quality proteins at the risk of some compatibility issues that we hope are pretty small but the new OpenFF Rosemary will be the best of both worlds. We'll have good protein parameters and we'll handle general chemistry in a way that has been optimized together. So if you wanna use this in a paper I would suggest waiting for Rosemary which should be out next year, hopefully early next year. You made a comment that they're using compatible charge methods but I think FF14SB probably uses REST derived charges whereas the OpenFF line to date I'll use AM1BCC. Normally those have been treated as compatible even though they're not the same. Yeah, this isn't probably the worst thing in the world to do but be careful when you're doing something. The problem with the way the toolkit applies charges is that to generate charges for a molecule you need to do a quantum calculation on the entire molecule. And if you're doing a quantum calculation on a protein that takes a very long time. So what we're gonna do is we're going to compute charges from just the residue that we're trying to charge and then we're gonna translate those charges into the general force field so that we can apply them to that residue whenever we see it without having to recalculate charges. Just to demonstrate how this works the new force field that we're creating here works out as a bunch of parameter handlers that each have parameters. And so if we pull out the alanine parameters from this combined force field then we can see it's a smirks which is like a smiles but with numbers attached and then a bunch of charged parameters. This smirks matches an alanine and it just labels every atom in the alanine with a number and then the parameters assign charges to the atoms with that number. So this is charge one which corresponds to this index. So this is the backbone nitrogen charge and so on for all of the charges in the residue. So this is the information we need to get. We need to get charges for the atoms in the residue and we need a smirks for the residue. Since we want some electrostatic context we don't just wanna be simulating like the charges for free floating labeled cysteine but we're gonna just apply a reaction smarts to locate the labeled cysteine residue cap it with the same acetyl and N-methyl caps that we've been using and then we'll have something to charge. This function just inverses things to our D-kit applies the reaction which just takes a protein residue and caps either end of it then iterates over all of the possible products when you run this reaction on the input molecule and identifies the one that has the residue name that we give it and pulls it out and then make sure that our metadata is up to date. Don't worry if you don't understand that any method of getting a capped residue will work. So I've just given it the labeled molecule that we came up with the residue name that we gave it and told it that undefined stereochemistry is okay. The reason I have to do that is because this center here which is not chiral because these two are symmetric sometimes thinks it's chiral because of the calculation of this ring system. So we have to tell it that the undefined stereo is okay but this is fine because we know that we know that it's accepted it previously. So we know we're in the clear. Okay, now we're gonna assign partial charges toolkit provides this really easy method to assign partial charges. This step is the reason we use open-eye. So this step is much faster if you have a open-eye install than if you're using our Dkit but this whole thing will get easier when we eventually support graph charges in the toolkit which is a machine learning method for generating charges directly from the molecular graph rather than going through the rigmarole of computing a whole bunch of different confirmations and doing a quantum calculation on each of them. We're just going straight from the graph to the charges. It's not ready yet. But once it is, this step will be much faster and we'll probably be able to just assign partial charges to the entire protein rather than doing it specifically for this residue. We've got our partial charges out. We only want the charges for the residue itself not the caps that we've added. So we can do this just by using our residue hierarchy scheme. So since we've got residues in our cap to amino acid we can just say we wanna take all the atoms that aren't the caps and we'll get a list of atom indices. The toolkit generates charges that sum to an integer over the entire molecule but we need it to sum to an integer over just the residue that we wanna include in the library charges. So we need to compute an offset and then we need this smarts code. So up here, we have the smurfs code, smurfs and smarts are the same thing. We need one for our new residue. So there's a great little tool called Kempler which can do this for us. So we just say, here's a molecule. Here are the indices that we wanna smarts for. Give us a smart and we get this. Again, it's a smurfs but they're like subsets of each other. This is a very specific smurfs. It specifies like all of the possible things that you can specify in a smurfs. So this is very unambiguous. And you can see that it labels each atom. So everything in a square bracket is an atom. The first little number after the hash is the atomic number. Then there's a bunch of like, how many bonds does this have? How many items are attached? What's its hybridization status or what's its charge? And then a column and then an index. And this index is gonna do the same thing as these indices that lets us apply charges. So we've got our smarts. Then we just need to combine our smarts with the charges that we calculated earlier and put them into the library charges parameter handler of the force field. So we just iterate over the indices that we computed. So the indices that exclude the caps and we can enumerate them to get the corresponding smurfs atom index, from Kemper. And then we can just say, all right, the charge of that index is the partial charge plus the charge offset. And then that can go straight into our parameter handler. Now that we've modified our force field, we can save it to disk so that we can distribute it with our paper or apply it to other molecules later. Or put it on some preprint server or whatever. We save it to file and then we have our augmented force field forever. So now we've got our force field. We can check that it parameterizes our molecule correctly. So these are just methods, functions that are in the utils module that we talked about. They're not optimized or particularly clever. They just give us some coloring so that we can see whether we've done the right thing. So here we are coloring atoms by where they got their charges from. So the amber colored atoms get their charges from amber and these lilac colored atoms are the ones that we just added. We can check that the charges are reasonable. So all right, the background nitrogen is negative. Our nitrogen is negative. Our oxygen is negative. Our carbons and hydrogens are relatively low. That looks fairly reasonable to me. They're actually typically correct, but it looks reasonable as well. We can color bonds by where they got them from. So we can see all of the bonds that would be present in a regular peptide from amber, including the background bonds for the sulfur. And then the new bonds are taken from sage. That's why they're colored green. And we can do the same thing with the torsions. So in this one, darker colors imply that there are more torsions over that bond. Green is torsions from sage. Yellow is torsions from amber. If there was the bond that had torsions from both, it would be like a green and yellow color, but we don't have anything else. Okay, so our parameterization looks good. There's no obvious errors. If you're gonna do this for real, you would do more than looking at colors, but I think we're about ready to simulate. So we had our labeled molecule and we needed to create parameters for it. The first thing we did was combine our general force field with our protein force field so that we would have highly specific protein parameters for the bits that they exist for. And then we could fall back to our general sage force field for our non-canonical amino acid. Then we had to generate charges. We could have just let the toolkit do this automatically, but it would have taken a very long time because it would have had to run quantum calculations on a spying molecule. So instead we ran our own calculations using the toolkit on a trimmed down version of the protein. And then we took those charges and injected them into the force field so that we could reuse it later. And so that you could translate the charges that we got for our trimmed down molecule onto the whole force field. All of these steps will be easier in the future with a later force field because that will have a new charge generation method that will hopefully scale to an entire protein and will have a protein parameters in the open effort force field. But for the time being, you need to do a lot of work to get this done in a reasonable amount of time. Okay, so the last thing to do is simulate. I don't think this has changed in a long time. We first need to create a topology which includes all the molecules in the system. We're not gonna solve it because we want this to run quickly. There's a salvation method in the utils module if you want to experiment with it. But basically we just take the molecule, add it to a topology on its own, give it box vectors and create an open error system out of it. This step parameterizes this topology with this force field. If we hadn't put the library charges in, this step would calculate the charges on the whole protein. And if you want to spin your CPU, you can try that on the whole protein without adding the library charges in the symbol. And then we set up the OpenMM simulation by creating an integrator, giving the OpenMM version of the topology to the simulation along with the system that we just generated on the integrator and then adding a reporter so that we can view the output energy minimized the system. We can visualize the energy minimized system. Okay, great, energy minimized. Then we can save this energy minimized structure to a PDB relatively easily in case you want to view our trajectory in VMD or something. And then we can simulate it. We're going to set the velocities to the temperature we set in the integrator. Then we're going to run it for one minute so that some people aren't waiting along with another people and then just check how many steps it did and then we can visualize the result. I remember I think my heart skipped a beat when I saw that you were simulating something for a minute and then I realized that OpenMM is talking about a wall clock in this one situation. Yes. Sorry, we've got half a nanosecond done and I've written out every 50-pick a second so it actually looks like a trajectory that makes sense for a long jumping around all the time. So there's a lot of steps. But yeah, that's a we're going known for monoclonal amino acid. Grammar is all from published force fields kind of using the toolkit. So what we've done today is we loaded a protein into the OpenFF toolkit from a PDB file and we saw that the molecule that was produced was just a regular molecule just like those from any other way of loading chemistry into the toolkit. Then we used RDKit to add this synthetic dye to the molecule, created a smiles from that that we could use to load coordinates from a already modified PDB. Then we combined the amber protein force field and the sage force field into a force field that could handle both general chemistry and protein chemistry, general chemistry with sage and protein chemistry with amber. And we added in some charges that we calculated ourselves for our new amino acids so that those charges don't need to be calculated every time the system is parametrized. And we used that new force field that you can bind an organic force field to parameterize our protein and simulate it with open immunity. If you want something to try for yourself, if you really want to experiment with protein ligand systems, there's a great example called the toolkit showcase which goes through even the steps like solvating a protein ligand system goes all the way from a docked structure through to the simulation all with the toolkit. If you wanted a post-translational modified protein to try for yourself, I've just prepared some reaction smarts for black oscillations that protein needs. So this is like the smallest protein I could find that has structures for sugars. That might be something that you could experiment with if you wanted, but I'll hand it back to Jeff to see if there's anything else to say. Thank you very much, Josh. Yeah.