 Good afternoon everyone, I hope you're all digesting your lunch well and are ready to hear about how we use Python in the bio molecular science. So my personally, I'm a postdoc at the university of Edinburgh in the chemistry department and I work together with Master Hedges and Christopher Woods based at the University of Bristol, who are both research software engineers. Hamddwn i gwynt i'r code bwysig cyfwyr bod diogelbahau yn cyd-gwysigol. There are more details in a second. Dwi'n�on ddim yn gwneud, mae'n gymhwyffau gwisigol o gyd o gael gyflyniadwyr ac niadwyr a'r cyd-gwysigol. Mae hoffi'r cyfwyr hyn sy'n cael ei wneud. Ond yr hyn yn sylweddol at gael y bwysigol? Dwy ddweud hyn, mae'r dysgu, dwi fi gyddi'r gwneud ymwysigol yn dyma. this is a protein or at least a cartoon representation and biologists or biochemists they will go and run experiments where they crystallize this and then they do an x-ray experiment and get x, y and z coordinates for all the atoms of this protein and that's why we have this representation and these things aren't very big so that's sort of 42 angstrom diameter and what are the questions you then ask so how does this particular smaller molecule actually bind to my protein for example so what I'm actually showing here is a protein called cyclophilane and this is a molecule called cyclosporin and it's an inhibitor and a drug you get if you have a liver transplant or any kind of organ transplant so this kind of leads to what are the sort of common uses of biomolecular simulations um we run them because experiments fail to tell the whole story and ideally we want to predict experiments and we want to have the sort of cross validation of what is actually going on in a molecular level um so a typical question could be how well do I don't know 10 000 of these small molecules inhibit this particular protein so that I don't have to make them all in the lab I just have to make the five good ones in the lab MD it's molecular dynamics I'll get to that we'll I'll explain what MD actually means yeah well we'll get there we'll get there I'll actually explain this so in fact what we do is we run simulations these molecular dynamic simulations where we take our cartoon structure and see it wiggle about essentially and then we can ask things how do proteins fold or we ask things like how do proteins actually interact with each other because in our body we have so many proteins and all they do is interact with each other all the time but we're interested what actually happens on the molecular level and here's a sort of an overview of time scales in biology this is a various of busy slide and you have various experimental techniques to probe biology and you have various computational techniques to probe biology so in order to know something about structures as I already said you do something like x-ray crystallography where you actually freeze your protein and then as you would in a GP practice or something take an x-ray of your protein and get some structural information you might also be interested in dynamics information which bits of the protein move in some way this is where nuclear magnetic resonance comes in and different types of experiments can probe different time scales so in a protein you have certain vibrations so how fast do atom bonds vibrate sidechain rotation so proteins are made up of amino acids these are the building blocks and how do they move about then ligand binding happens at time scales between sort of 10 to the minus 7 to seconds how does catalysis happen so proteins are often enzymes so they speed up a reaction in your body and these are all things you can study with experiments but you can also study them with these molecular dynamics simulations or in some cases you might have to resolve to quantum mechanical descriptions of the protein but this is not what I'm talking about right now so molecular dynamics can probe time scales from 10 to the minus 12 seconds to about 10 to the minus 3 seconds and pushing it sort of to a second but that's very uncommon um and then you have a vast zoo of various experimental techniques that help you with the structure determination some dynamics and then you can compare these two so how do we actually do these um MD simulations or why do we want to do them so a typical 200 nanosecond protein dynamics trajectory generated in a computer looks like this and in particular we're just focused on these four amino acid side chains which I've highlighted here and they wiggle about you can now go and look at a particular time trace of say the styhedral angle here of this phenyl alanine ring and then you can do some statistics and data analysis and it's all great so what are the sort of ingredients to actually run this kind of simulation which is the molecular dynamics simulation is essentially you take a box into this box you put a protein and some water molecules and then you need to find a way of describing your protein so it's atom coordinates the protein has I don't know 4 000 atoms or something and they interact with each other so you have angles you have um bonds you have dihedral angles and you have electrostatic and colombic interactions so this is kind of what they look like and you call them force fields and different types of atoms have different parameters it gets very complicated but basically you've defined your potential uh your your forces between your atoms by by this force field and then you use the forces to integrate your system over time by using Newtonian's equations of motion and using a leapfrog integrator or something like this and then you run your dynamics and then you hope that the ensemble average is equal to the time average so you have substationary observables so you can look at the statistics of a um dihedral moving or sidechain movements or whatever you might be interested in how a ligand binds to a protein and what the interaction energy is and so forth yes molecular dynamics yes yes so what is a typical workflow it's very complicated and the whole sort of field of molecular dynamics has been developed in the academic world over the last of 30 years mostly um and usually what happens you go through a bunch of prep steps simulation generation or trajectory data generation steps and then an analysis step um so download pdb pdb stands for the protein data bank which is where all the crystallographic data is stored um then usually that file is not very well behaved so you need to do some work to it uh then you need to actually generate this box with solvent and so forth then you actually run your integration dynamics and then you have all these uh x y z coordinates of your trajectory you need to analyse in some kind of meaningful way um for this purpose an entire zoo of different software has been generated over the past mostly in academic groups um so some of them are python based most of them are a mix of i don't know tcl scripting pulse scripting um these are the simulation um toolboxes that actually run the molecular dynamics i either newtonian integration steps uh which are mostly c plus plus written some of them have python apis some of them don't and then you have a vast zoo of analysis scripts um or libraries which some of them or the more modern ones are written in python most of them are things like tcl pearl and whatnot so then you have the problem of various uh coordinate files and trajectory files and force field files that are obviously not standardised across all the different tools some tools can read more of these uh file formats than other tools and it's not very easy to interconvert between these file formats uh so oftentimes people will just write their own pauses um so here's a very simple scenario i as a researcher want to run simulations uh using a particular simulation toolbox and that's called amber this is their amazing logo um and my collaborator has given me a coordinate file in gramax coordinates a different simulation tool so typically what we would do is i visualize the coordinates to make sure that they are not stupid then um i would have to convert it to a file format that can be understood by amber then run the whole setup where i generate the box and the water and then actually run my dynamics so i would use one particular tool called vmd in this case to visualize uh my structure then i would save this either using this vmd a visualization tool to a different file format in this case the protein database format or i could use a python tool to do so or i could use five other tools that also do this then i can run the simple setup which is taken from this amber tutorial where you basically go from your file go through a bunch of essentially bash scripts to get to the point where you can actually run your dynamics and you get your x y z trajectories out um so you can do that with amber on its own you could use a tool that kind of substitutes the the um the many bash things in a one line command line argument type thing or you could use an online web app that generates a python script for you which you could run to to do the setup or so forth there's even more than that and then eventually you can run your actual simulation which would usually be from a bash terminal running a command like this so it's all quite complicated and particularly if you start out you're very confused and it's like why do i have to know all these things it's very complicated so the problem is most of these tools have grown organically in different labs there's not a lot of communication um there's a lot of sort of hacky bash scripting where you inherit some bash scripts from a previous phd student who's never really tested it properly uh then you have to be essentially an expert in many different types of software in order to be able to do all the things you want to do and if you find a problem which yourself software you know can't do you get into this google trap where you search for something you're trying to do and then you find on research gate or on stack overflow here's how you do it and then you try it and it doesn't work and then you go to the next solution and you try it doesn't work and eventually maybe you find something that works but basically you lose the focus on the science you're trying to do rather than trying to use all all the tools um so this is where bias sim space came into play and this is the python code i'm i wouldn't talk about so basically all this complicated workflow i've just talked about can be condensed into these seven lines of python um yay exactly it was a big build up but hey so um the idea is actually that we're not rewriting all the underlying tools we're just wrapping around all these underlying tools and making it very easy for an academic user who had some python experience and some prescripting experience to actually interact with an api that lets you focus on the science so in this case uh while we import bias sim space uh and then we can read this dramax file uh we can visualize it uh we can get the molecule we can then parameterize the molecule with this force field we can solvate it and then we can run it and it doesn't matter that it was a grow max file we can still run it in amber it it's it's completely agnostic to this so uh at this point i will go to a live demo of how this works hopefully um so basically everything we've prepared at the moment is um we're running a docker image of the software on either an oracle cloud service or uh asia and you can try out yourself um if you wanted to so we have uh we import the the library uh and then in this case i'm reading a coordinate file and this force field file in um in order to define my molecular system and before i had to go to my bash console in order to open this vmd now i can just look at it in in the browser in the stupider notebook um so you see this box and there's a little molecule in there i can also look at a particular molecule if i wanted so i can do something like so i know the zeroth molecule in this case is actually this peptide i'm trying to simulate which is uh allen indy peptide it's not very interesting system from a biological point of view but for demonstrations it's great um and then we can run in this tool a typical simulation workflow which would be minimizing this water box uh running an equilibration where we actually get the temperature of the system to 300 kelvin and then we do production run which we then take the trajectory to do our analysis um and we can also kind of we've implemented a sort of standard default protocol which can be easily overwritten by any expert user um uh so basically what you define is this uh protocol and in this case we're running an equilibration and then you have all these default parameters that are automatically set or you can set them yourself if you wanted so in this case uh we're running for 0.05 nanoseconds so 500 femtoseconds uh and we're doing a temperature raising from zero kelvin to 300 kelvin um and we don't restrain backbones that's a very technical term in this case so we define the protocol then we define a process we want to run an MD process we don't run it and we give it the system the molecular system and the protocol um and then we can just simply execute this um using sander so basically actually buy some space we'll look in your path and see what tools are available and then choose the one that is best suited for the job you're trying to run um you can give it a working directory if you don't it will create some temp directory and write things there um and it actually then writes these files so md.amdr and md.rst7 uh the two input files uh I read in and uh sorry the mdrst7 and the param7 are the two input files I gave in but the md.amdr is the configuration file that was auto generated by the protocol because someone decided that this is a best practice protocol for uh your equilibration if I didn't like this particular protocol I could just give it a config file if I want it instead and I don't have to deal with it I can look at this config file um which yes you need to be kind of an expert in order to understand what's going on in there if I don't care about these things because I know someone decided that this is a good protocol it's great I can just run the dynamics without having to understand all the different bits um I then can look at the argument string for the command line to actually run my simulation um I can then actually get these arguments and um that returns a dictionary and I can unset them change them as I wanted to so I can set this minus o flag here to false um it's now false if I now look at the arguments again the minus o bit from up here is gone um and I can just reset to the original in case I messed it all up and I can start running this process uh so this is now running on a oracle cloud instance um and I can query various things about it is it running uh how long has it been running for and then obviously sort of the more protein uh simulation interesting parameters such as what is the total energy of my system so at the moment it's minus 6494 kcal per mole and I can um kind of monitor an update of this okay so the time moves up I can also interactively plot uh time series of what's happening so here I'm plotting uh the time versus the temperature so we said we're slowly going from zero to 300 kelvin so I don't know there's a spike because that's part of the algorithm and then we're slowly ramping up and then we can look at how the total energy changes over time and then the main thing what you're usually interested in is the analysis the data analysis um which uh you use the trajectory uh x y z coordinates to actually do the analysis and um so you can get your trajectory you can I'm going to stop here with a demo you can access the trajectory data in different um tools so MD analysis is a python tool for trajectory analysis as well as MD trash so we wrapped around both of them and you have complete access to either of them essentially um and then you can write out stuff um yeah okay so what is by a sim space basically it sits in this layer in the sort of simulation set up simulation run analysis layer but not so much in the sort of cleaning up of the x-ray data but there's quite a lot of commercial software available that does that really well so we didn't want to uh dive in there um so of a very quick overview of the API so um we try to be as clean and obvious as possible so basically protocols are the things you want to do like an equilibration or production run um whatever you could think of MD is the dynamics this could be Monte Carlo um has a lot of um potential to be extended i o is anything that has something to do with writing and reading files um gateway contains a lot of information such as units um how to handle the processes trajectory contains the trajectory data and process is obviously the guy that manages all the the the processes isn't essentially so what is biosim space it's an interoperable tool for biomolecular simulations and yeah it's kind of the same thing so uh it encompasses system setup trajectory generation and simulation analysis and kind of supports and the aim is to support um underlying existing um software um so in summary it is a python api api that allows you to write work for components for biomolecular simulations the idea is is that you can really focus on the signs you're trying to achieve and not so much the um knowing how to use the software um we're very much pushing for cloud use so that you can basically spin up an instant of a GPU cluster send your jobs there it runs them and they come back um and we're planning support for um workflow managers such as nime and uh common workflow language i don't really know if either any of you use this uh but it's not it's not trying to be a workflow engine we don't want to manage all the workflows we just want to be able to write a tool that allows you to do so um it's a top-down it's not a top-down approach so we're not trying to reinvent the wheel we're just trying to um make it easy to for different softwares to communicate with each other um and it's definitely not finished we started this project in january this year um and it's got funding for two years uh and yeah we hope that people might be interested in getting involved it's all open source oh yeah it's it's on github um i think there we go it's github michella biosemspace uh it also has a website and all the um server uh the the the cloud server the um uh docker image they're all available for download and playing around with it essentially and all the softwares online all the code is online and that's it thank you very much for your attention um i'm happy to receive questions thanks very much are there any questions five minutes hello i have one rather weird weird question i work also in the life sciences and can witness a lot of hacky code especially biologists don't really take care for for nice coding uh one question my question is why does the python example code you showed not follow the pep8 naming conventions yes excellent question so this has um legacy reasons which aren't great uh but um so biosemspace builds on a code called sire um sire is a biomolecular software library written in c++ uh with its entire api exposed to python with python wrappers but obviously it was written in c++ style so we decided to keep everything in the c++ style and not go with pep because it would have ended up being some weird mixture of everything so rather just go with one of style then yeah thanks uh any other questions anyone yeah very nice talk i would like to ask to about the submission of the of the task of the of the simulation because a lot of uh because i also i'm a PhD student in biomolecular chemistry and my colleagues uh working with uh very weird experiments trying to change the parameters and do a lot of programming changing the dynamics and we also have clusters uh on the whole republic for economical purpose and you need to ask you need to submission to task or to or to or to job in the queue and then you retrieve the uh retire the result yeah and uh i think that uh different environment can be tricky to to approach do you have any idea how can be solved so i mean submitting so the idea is that you would write i don't know 100 line python code with this api and then you have a python script and you can just submit that whether i don't know q sub or whatever a um scheduler you use to a cluster i think this is dark or something like this i mean it doesn't really matter so it's just basically you just call a python script you just have to have the software installed on the cluster i suppose um what we're particularly interested in is sort of generating artificial uh cloud clusters so at the moment well we use a lot of GPU computations because they're fast um for the kind of purposes i want to use it and um so we just spin up a um GPU cluster in the cloud it's alive for the time of the simulation and then it shut down again and we also have local cluster but i barely use it anymore because it takes forever to get anything through it so yeah yeah and uh do you know about elixir sorry do you know about elixir um no i've not heard about it it's it's part of european union that is initiative for biomolecular and biological tools you have the database of this tool you can you can ask for help and you can propagate your software and okay find the way it's supported by you in union and you you can find the way how it's can be moved forward okay cool elixir elixir yeah yeah yeah elixir yeah i can help you with thanks very much sorry i attended these meetings because i was a python programmer and my wife is a biologist so and uh yeah i i i i keep seeing she's doing some like simulation which i actually don't understand but there's one time as a sport there's some keyword just jumping my eye which is a python so if you confuse i think she's using uh she was a in lmb and in Cambridge so it's a uh laboratory of molecular biology it's a lots of Nobel Prize winner i think that in their institute they use some uh simulation tools seems like commercial i'm not i'm not sure i just so there are very few commercial simulation tools because industry at least for biomolecular simulations because um only in the last sort of two years pharma industries who are probably the main target for this have shown interest in molecular simulations the problem is these simulations do take quite some time so you have to wait for a week or so to get your trajectories back i think yeah she was doing sorry yeah and so um that just wouldn't fit timeframes of uh pharmaceutical companies to do lead to target optimization but this is changing now yeah just ask oh yeah sorry sorry to bother guys um i think if if she wants to use like a use simulation to to simulate the the interactive of some molecular a small molecular towards some target point so can we use this yes so there's this the soft the underlying software we support can do this yeah yeah okay uh we're out of time no but can we thank can tell you again really just