 Yn y taw, rydyn ni'n gofyn i ddim yn ymddangos ar gwaith gwahanol, a'r cysylltu i'r ddyn nhw. Dyna sy'n ganu'n meddwl. Yn amser, mae'n gwneud o bwysig ymddangos y byd. Mae'n ddiddordebau'r cyfnodd cyfnodd cyfnodd cyfnodd cyfnodd cyfnodd y byd, mae'n gofyn o bwysig cyfnodd cyfnodd cyfnodd cyfnodd cyfnodd cyfnodd cyfnodd. I see we all are going to execute the cells in the tubes we are going to go through. This is basically going to load up. Can we read this as well from the back? Good. It's going to load up a couple of amber files. This is actually quite on the top file. It could have been Gramax files, it could have been PSF files, it could have been anything. I don't care what the file for it is. Now we have a molecular system represented in memory with a data format. How many molecules? 631. This is a small little demo. Here I now define a protocol for running the calibration MD simulation. So we worked out what were the key things in various MD packages which a user wants. The users don't really care about the integrators, they don't care about all the low level details. They want to know how long are we running it for, what temperatures are we going for, am I restraining protein backgrounds etc. So we define a protocol and now that's the data format in memory. And then what we do is we say we've got a system, we've got a protocol, run it. It's now found on this particular cloud. It happens to have sander available. It's now running a sander simulation. Could have been Gramax, could have been MD, could have been P-M-D, could have been SOM-D. It doesn't matter the same thing is actually running the simulation. Is it running? Yes. It's currently running in the cloud. It's got the resources, put all the data across, it's written all the input files, it's running everything. How long has it been going for? 0.4 minutes. But because this is running in the cloud and we can actually understand the data files, and convert them back into our data model, I can query it. So data is being written from the simulation. So what's the energy? All problems, it's now 6,300 kc per mole. What's the energy now? 6,354. We can do this line because it's running interactively. But just the energy that's boring that looks quite rubbish. So let's get more data. It's now been going for 0.0076 nanoseconds. That's the energy, that's the temperature, that's the energy. And if I run this again, voila, it's a new value. What do we have now? We have data changing with time. Won't you have data changing with time? What should we do next with it? Make a graph. Voila, this is a live graph of how the temperature is changing through the simulation and a live graph of the energy. And if I then rerun this cell, this is just using standard MapBotLed and et cetera, it live updates. But graph's kind of old school, two-dimensional. We've already heard that we have NGL view, so with NGL view let's get the latest snapshot out from the simulation. Or with this latest snapshot, there we are. 3D interactive. What it means is we're not moving data backwards and forwards at the top and between the cloud, it's all just happening live. In the cloud, it's got something in my eye. Obviously because I'm so amazed this is working. It's a live demo when it's working. And again, I will emphasise, NGL view is amazing. And because of that, I have to write any of this. We did not write most of this, I should emphasise. This is all using the tools and standard libraries. It did not take us long to write it. This is not the end of the project. We're only about halfway through the project. We've got tons of time to go. What else? Well, we're producing trajectory. Data is being written to the cloud. So let's get a handle on that data. We now have a handle on that data. How many trajectory frames are being written? 8. Let's now get them. This has now got the data. And this is actually an empty track piece. This is an empty track object. I could have asked for the data in an empty analysis format. So again, we're not building new things. It's just tying together the existing tools. But I have an empty analysis or empty track piece of data. Let's plot the RMSD from that. Voila, that's coming in the RMSD of my really rubbish simulation. It's running in the cloud, but it also runs on local machines. It also runs on big clusters. As you notice, it's Python. It works because we have a data model. Now, this is MD is where we started. But we weren't writing this to run MD. We're writing this because we want to run lots of different types of simulations. In particular, what we're running at the moment, this is where magic people won't work. Come on. Come on, spinning wheel of death. Still spinning wheel of death, there we are. It's a little dinky MacBook. Ultimately, what we want to do is run protein-link binding-friendly calculations. So that's really where our bread and butter is. As John mentioned, there's a D3R challenge at the moment. So we have basically ligands, we have proteins, and we've got to predict the binding free energies. And hopefully, they won't be totally rubbish when we get the real experimental data back. So what you've really got here is you have input data, which are your ligands, because we're doing relative binding free energies. We have two different ligands. We have a protein going in, and there's this magical black box. When you put this input data into the magical black box, that will come the relative binding free energy, and it will be right and everyone will be happy. Now, when we come to sort of sharing molecular data, it's really the question of how do I share the protocols and the data flows of the magical black box? We have put the scripts on GitHub, but what should we be sharing? The input files, the input and all of the output data, all of those things. And it becomes difficult when if we look inside the magical black box, it actually looks like that. You have to set up the ligands, which means basically throwing them through anti-chain with arm check, T-leap, aligning them together with fine MCS, doing a GMX survey for salvation, T-leap for the protein, sand for minimisation. Piwyd-dicuda for doing some equilibration. Solvedeaf for doing the free energy calculations using OpenMM. Then into our wise free energy to pull the data out and get one relative delta G. And that's one free energy that you're getting, and we're actually going to do where we are doing this, back 50 relative binding free energy, so it's 50 times this. Now, we do, we publish the scripts for this, but the scripts for this, if you do it traditionally, will work with these tools and only these tools, and these versions of these tools. Nobody else can reproduce this workflow in two years' time because they might not have that software installed. What you really want to do is publish the script, so that if somebody comes on and they don't have these pieces of software installed, it will still work with RD Kit Smiles, it will still work with T-leap, it will still work with GMX ND1, it will still do a lot of free energy for piwyd-dicuda. That's what you want, is the ability to share workflows where we can flip easily between different equivalents, have different file formats for the inputs, etc. And we did it. This is a biosim space script that parameterises a ligand using a wide variety of different tools. This, slightly smaller, is a biosim space script which does a relative binding free energy with a wide variety of different tools. By sharing these Python scripts, which is what are on the website, if you have biosim space installed and you have the underlying tools installed, then you'll be able to reproduce ID3R on it. If you can beat us by next Wednesday, it's all like now, you can submit a better answer than we will. We took the view when we did biosim space that our job is not to come up with the XKCD perfect file format. Our job was to look at what exists already, recognise that all our tools have different shapes and so they don't fit together perfectly, and so our job is to build what we call the shims. Anyone with an engineering background knows that when you engineer stuff, it really doesn't fit together well and so you end up putting little bits of metal between it and those are the shims. They're the things that fill the gaps up, make sure that the rockets don't explode. Ultimately, that's what biosim space is. It's file format into converters and things that join together lots of tools. We designed these to work with all the standard packages and we designed them following seven design principles of seven sacred design principles of biosim space. One, if you are writing a file format reader, it better be able to read and write the same things. I see so many tools that come out that can either write the file format or read the file format, but they can't read and write the file format and even those that can read and write the file format all things are difficult. So one, anything that goes in, these are the supported file formats we have many, many more, we read and write and it's symmetrical. So everything we can read and write. Next design principle, information is preserved. That means that if you read something in from here, when it goes into the data model, so we have data models which are system or molecule, every piece of data from this file format is precious and it must be contained in the data model. You do not throw away anything from these files when it goes into the data model. We even know what the file names are, where they come from, that bit of data, it goes into the data model and that means that when we have the data, we can then write it back to the file and assuming you put it inside a Gromax top, you write it as a Gromax top and then you basically read it again, you should have the same data going around and you check this. You check this by actually doing things like comparing the energies of the molecular systems that you've loaded. So we calculate the energy in Gromax, we load it up, we have an internal thing that can calculate energies because it's really important that a molecular file parser can actually calculate energies. We calculate the energy in our internal parser, compare it to the energy in Gromax, write it out, read it back, make sure the energy is sane, make sure the energy still agrees with Gromax, write it out in amber, make sure the energy agrees. It's really hard to get files which have the same energy in Gromax and amber. Really difficult. But when we've got this, it's all unit test and of course automatically running all the time. It allows you to do triangular transformations. So literally you can read it in Gromax, write it out in amber, write it out in PSF, read it back in as a PSF. Through these cycles, as you're changing between molecular file formats, you should always have the same energies, in a way. Design principle number three, don't be too clever. Being clever is the enemy of pretty much every piece of software ever written by the whole of humankind. I really hate software that tries to be clever. Say for example I load up an amber rest file. That amber rest file definitely cannot be written as a Gromax top file because this is quality data, this is topology information. Throw, raise an exception, tell people straight away. If there is information that I need to go here, which is not available here, don't guess it. Don't think the user may be missing something I'll helpfully try and suggest with Clippy something that might help you. Just say no I can't do it. It's about ambiguity. Only do you do the transformation to completely unambiguous. But hang on a minute. I don't know what exactly is it. What is molecule? Molecule is a collection of property derived objects. Everything in the system is derived from property. Molecule is a key value dictionary of properties. So for example charge zero is a key for the value of atom charges. Atom charges is a data object which contains the charge, and all the atoms. Lj is a key for the atom lj. Data object that contains all the energy and parameters. Element, atom elements, data object that contains all the elemental data. Mass, masses, connectivity, connectivity, bond to atom functions. Angle for atom functions, diagonal for atom functions is a key value dictionary. But the key thing with this key value dictionary is that it's completely arbitrary. I can add whatever keys I want. I can add a key called fluffy cat and I associate it with atom lj. That's a perfectly valid molecule. We don't do ontology. It's a container that can contain arbitrary data. We're not telling people what to name it. But this means that when we do relative binding for energy calculations we can have charge zero and charge one to represent two states of charges. We could call elements elements silly name, atom masses, etc. What we also do is make sure that molecule is itself a property and because molecule is a property that means that molecules can contain molecules. So then you can actually say what was the reference molecule? Where did this come from? When I parameterised this, what was the original molecule that came from the PDB? Let's say that in the molecule. Then let's take this one stage further and actually molecule is not actually the thing you're holding. What you're holding is molecule data. Molecule data is the thing that holds all of the data of the molecule and molecule is just a view of the data where you view the entire molecule at once. Atom is another view of the data but we're viewing only one atom. Residue is a view of the data where you're viewing one residue. What it means is you're actually doing the model view controller way of working with things where the model is the data model, what molecule data the view is your molecule, atom, residue chain, segment, cut group, who cares what and the controller actually have editor versions of these because once you've loaded it edit it and atoms, change things merge things together, we let you do all of that. So that's molecule, what is system? System shock horror is yet another property a molecule could contain an entire system if it wanted to but it's much better to have the system containing molecules the system contains property so it's space is a periodic box periodic box again is a property time is a time molecule group, molecule group shock horror is a property that contains lots of molecules and so all is all of the molecules in the system but because the data format actually does implicit sharing it means that protein can contain just the molecules in the system that we designated as being protein but the molecule data is not actually duplicating the molecule is still only there once so the system understands the concept underneath of what is the same data so you end up tagging data all over the place aggregating it's show and blah blah blah and as I said system is a property so that means that one system can contain their own systems which can contain molecules which can contain systems which can infinitely go on an inception recursion but as a data format we don't mind so this brings us now to design principle 4 design principle 4 don't change anything if the user gives you something don't change it because the user gave it to you for a reason they liked it you start missing around with it they're going to get shirty so if they give you a gromax topology coordinate file and they want you to run nd on it give them back a gromax topology coordinate file don't give them back an amber palm top file because that will surprise them so this means if I only have p-m-d-cudo on my system and they give me gromax files just translate it into amber top crud behind the scenes run the simulation it comes back out oh we managed to save the original file format as a key value pair in our system so we know it needs to come out as a gromax top so pop it out as a gromax top that way you can take an existing node in the workflow the same file format goes in our file format goes out we've replaced it with our own code and no one has any wiser but this goes further but don't change anything this person is a bit silly they've got alanine 5 hip 20 and rtf 15 as their residue names and numbers we all know that's rubbish and indeed if you try and throw that through p-m-d-cudo p-m-d-cudo say that's not really nice because I want contiguously numbered residues and I want to clean everything up blah blah blah but the user had a reason for being so silly with the way they were naming their residues and so when you go through this process you map what the original residue names and numbers were so when you run it through the tool you'll be sure what the user gave you again don't change the residue numbers who here has had it when they've run an empty package and suddenly all the residue numbers have changed behind the scenes and suddenly you've been analysing the wrong things again don't change anything don't change anything also implies to don't add things if a user gives you a molecule that is missing hydrogens it is not because they forgot the hydrogens it's because they really want to simulate something without hydrogens if you behind the back add hydrogens to it then that's going to confuse them if you really need something and it doesn't exist in the input you're missing that information raise an exception tell the user they need to supply more information don't automatically go back and add the information so the simulation will run again this is really designed principle 3 of don't be too clever you have a lot of trust in the user but this is actually where you end up having your pre-filters in and the tools underneath should not be dealing with silly users that's the ambiguous layer that's done by your user interface when you go to the user interface then we're into a world where you do exactly what you've been told to do designed principle 5 store general right specific we have a complete computer algebra system built into this package because way back when we needed this for Monte Carlo simulations this means we store the intramolecular potential terms as algebra it is literally represented as 1.3 times r minus 01034 squared I can differentiate it I can integrate it I can do whatever I want to it and we have an API that enables you to search the algebraic expression and work out how to convert it into different forms this means all of these bonds and angles are stored as algebra and only when I need to write it out do we actually try and create a Gromax bond object which inspects the equation and goes is it one of the types of bonds in Gromax that we support if it is this is the data structure of a Gromax bond it only has three parameters four parameters and a function type it's only at the point of writing we create something specific and again this is an amber bond it only has two links because amber has got very limited bond potentials in comparison but by doing this it means you don't burden with artefacts of the program because you're storing general by local you'll be happier if we group these two together design principle 6 units are important numbers do not exist in isolation as we have taught in high school your numbers have units so all numbers all things actually have to have units attached and so we have a complete uniting system in there a complete uniting system where you can do an algebra on the units that means you can literally write times k cal divided by wall times action times action and it works and the reason why this is so fundamentally important is if you're trying to write a program that can read Gromax topology files and amber topology files you have to deal with SI and ACMA and actually you have now a horrible mess if you're not actually dealing with a bad unit system so different design principle 7 is just don't just assume ask and again it's the same thing and you need it ask the user, don't try and calculate it whatever is input is complete don't add hydrogen to missing because user may have asked for it don't do things behind the user's back because you assume they were stupid because your program is more stupid than your users raise an exception if you can't deal with what you've got tell the person they've got an error and again this is really related to dp3 and 4 which is don't be too clever and don't change anything now this is all beautifully implemented in c++ with python wrapping so we do a full python API and a full c++ API very easy to install ish will be more easy after we submit the d3r results because we're publishing the whole thing so people can reproduce our d3r calculation which I'm assuming will be not working because it's binding for energies but we have a lot of challenges this all assumes the tools which underneath are interchangeable when they're not the underlying tools choose different places to store or represent key information for example shake that's how amber represents shake it puts an additional bond between the hydrogens Gromax you put in your original water model that you could find it's got three different locations depending on what mood you're in and it's basically if not death flexible block or an algorithm of shape or whatever you want to do so actually working out where the information needs to be what is molecular data what is configuration data simulation data is very difficult other challenge the underlying tools are not modular I would really like to solvate my protein using solvate from tealip because sorry GMX solvate you do a pretty poor job compared to tealip solvate tealip solvate is brilliant unfortunately tealip solvate you've got to parameterise the protein and solvate you can't separate those two stages and that's annoying because you think okay you just parameterise the protein it doesn't matter but what if my protein doesn't have any parameters in amber I've suddenly got a problem particularly if I'm actually working with a protein that's already parameterised in Gromax it's like it's a complete waste they should be modular and they're not modular enough the tools are not robust if I take the same input protocol and the same input files and I run them with PMD the simulation will crash for shake errors but if I run it with sander it's fine but with Gromax it's fine so we don't have sufficient robustness in the tools but you can just drag and drop different tools in behind a standard protocol and then there isn't a perfect match of algorithms we can run and once we've finished the D3R run with SOMD we're going to run D3R again but using Gromax and using PMD QDA they will give different answers and they'll give different answers because they have different implementations different radiators, different shaker algorithms different force fields, different PMDs different cutoff schemes I could go on and on and on these codes are not drag and drop replaceable so there is no point trying to create reproducibility if we're going to change codes and indeed because they keep updating themselves they're not even drag and drop replaceable from version to version so we've done something that does this we hope it's going to, well it's running at the moment and we should hopefully finish and submit in time for next Thursday these are the scripts that are running you can download them from my D3R website and just to thank the Biosyn Space Research team it's actually a big collaboration which is CCD Biosyn and HEC Biosyn together with collaborations going out but with all of these things there's beautiful pyramids and at the bottom of the pyramid the two people doing the work are Leicester Herges he's my research software engineer and Tony Mayn and Julian's crew many many links if you want to be looking at it and playing with it it's not the solution to the problem it's just a picture of what a solution and the inspiration for a solution could be Thank you Maybe if I complain a lot of things I'm wondering, you know, there is some other packages and like HTML so I guess it's not doing the same can you comment a bit of do you, yeah or you can comment different packages and very different views so I think there are lots of different packages and what we deliberately took that we knew of the Biosyn Space Research we were not going to try we're not going to reinvent something so if a tool exists we will try to use a tool so we didn't do anything with trajectory analysis because there's already empty analysis and empty trash we also took the view that we don't want to disrupt somebody's workflow so this is why we're focusing on workflow nodes where whatever was your input file and your input description that should work with our node it's not the same format as you wanted to give as input so in theory if you have an existing nine-work flow in industry for example what our partners are worth you could just pluck out something that was currently doing a hard-coded batch script to run NAMD and then just plug in a Biosyn Space node and in theory it will keep working in the same way because it's the same format the same output format but generally it's we very much don't see this as replacing anything it's about fitting in with a community of other tools and in terms of longevity the actual underlying code that's been going for about 12 years now we just haven't been very noisy about what we can do and we have long-term plans of how it works, the actual software itself is built on top of something which is we now work with Microsoft, Google and Oracle with for this huge cloud project where actually you'll be able to have simulations I showed the notebooks running if you look on my talk slide you'll see actually what's happening is that data is being put on object stores computers being allocated on the cloud next to the object stores they're all running, you have an identity service an accounting service, an access service which is enabling you to pay for simulations up front pay for the storage of data pay for the long-term archiving of data are moving us to this world where instead of you log on to a cluster in SSH and ask them to do things like that it's all interactive notebooks and you can then give a DOI to the notebook publish that and then you have an executable paper that contains the description all the input, grabbing all the resource all the analysis, all the graphics all the conclusions and that's the kind of 2025 vision that we're moving towards alright, let's thank Chris again