 Welcome to this tutorial on performing high-throughput molecular dynamics with Galaxy. The aim of this video is to provide you the introduction to molecular dynamics simulation and analysis, and in particular using the Galaxy Data Analysis Platform. My name is Simon Bray from the University of Freiburg in Germany, and this presentation has been prepared together with Chris Barnett from the University of Cape Town in South Africa, and Trin Duselle-Party from the University of Trugiana Wada in Sri Lanka. So a brief word about the structure of the video. It will be split into two parts. The first on molecular dynamics simulation by me, and the second on MD analysis by Chris. I will start by giving a very basic 10-minute introduction to MD simulation, followed by a demonstration in Galaxy. I'll also show you how we can implement high-throughput MD simulations using Galaxy's concepts of workflows and collections. Then Chris will do the same for analysis. I'll give a short introduction to the concepts behind MD analysis, followed by demonstration in Galaxy. There are no strict prerequisites for this tutorial, but it would be good if you have some basic knowledge of the Galaxy platform. We want to explain basic features in this video. If you want to get yourself up to speed, we recommend following the Galaxy 101 for everyone's tutorial, which you can find linked on the website of the Galaxy train network. All right, let's get started with the first part, an introduction to MD simulation. Molecular dynamics, as you may well know, is a computational technique for molecular simulation. The reason why we are generally interested to use it is that it provides a very high level of temporal and spatial resolution in comparison with the large number of experimental methods, which are also used to provide information about the positions or the motions of atoms and molecules. For example, we can use techniques such as extra crystallography, which provide a very high spatial resolution, but essentially a static picture molecule. They don't show the motion. We can use techniques like FRET, for example, which gives us some information about how the molecule changes in time, but not at a very high temporal resolution. And MD simulation allows us to leave the world of experiments and to simulate as long as we have sufficient compute resources at unlimited temporal and spatial resolution. So that is more or less the rationale, the reason why we are interested in MD. The principle is that we are simulating the atomic motion with Newtonian physics. So in general, more precise methods which involve quantum mechanics are not involved. So now a very brief introduction to the physics behind molecular dynamics. Like I just mentioned on the previous slide, MD works on the basis of classical mechanics. So we treat the bonds and atoms in the system as point masses connected by springs. And therefore we can treat them as simple harmonic oscillators. The atoms also have a charge which is associated with them. And for each time step, we can calculate the potential energy U, which is made up of two components, the intermolecular potential energy and the intermolecular components. So these are the interactions between molecules and also within individual molecules. And then over the course of a single time step, so a typical length of being on the order of a femtosecond, so 10 to the minus 15 seconds, we can calculate all of the forces which are acting on each of the atoms from the potential energy. And if we know these forces, we can then calculate how the positions of the atoms change over the course of our time step. And then at the end of this time step, we have a new structure with new positions and we can repeat this process for thousands or millions of times. And in this way, we accumulate a so-called trajectory, which is just a sequence of these frames. So you can view what we do here as collecting a video. Which shows the motion of all the atoms in the system over time. And one consequence of this, which I think is quite well known, is that MD has a high computational cost. So if you're dealing with a system with thousands of atoms or even millions, then to calculate all of this just for a single time step is already computationally very costly. And if we're dealing with millions of time steps, which we need to get to a substantial trajectory length, then the computational cost gets very high. So there are plenty of applications of MD. So we will focus mostly on MD of protein ligand systems, how protein ligands interact with each other. But it can also be used to study, for example, protein-following, conformational changes in protein, and it also has applications in material science. One thing that we want to focus on particularly in this tutorial is how we can scale up our MD simulation analysis to a high throughput level. So we're not simply calculating the simulation of a single ligand, but we can calculate say for 10 or 100 ligands against the same protein in a single run. This is something which the work for management system that we're using Galaxy is particularly well suited for. So there are a lot of questions that you have to think about when setting up an MD study, and these will be discussed in the tutorial. You have to think about how to parameterise your protein and your ligand. These have to be done in two separate steps, as we'll explain. In general, you want to think about solvent because biochemistry takes place in the solvent, that is water. Should we add particles with charges? Sodium ions, chloride ions, for example, is the water equilibrated around the system? We generally include special preliminary simulations to ensure that water is correctly equilibrated. Is the system starting from an energetically minimised state? Something that we also need to consider. And finally, you might finish your simulation, but what do you want to do with it next? You want to form some kind of analysis? And then what you might want to think about is like RMSD, RMSF, PCA, and we will guide you through some of these techniques later on. So to turn now to Galaxy, there are a wide variety of open source tools which are available in Galaxy for molecular simulation. So you'll be using Chromax in the tutorial, and for analysis tools such as MD analysis and Bio3D, you can access these tools via the European Galaxy Server, which you can access at. Cheminformatics.com to use Galaxy to you. That's what we recommend for this tutorial. There is also a South African server. I think it's provided on the screen. You're welcome to try that out as well. And to launch our own Galaxy server is also pretty straightforward. So if you're interested in doing this and making use of your own compute resources, then you can also try this out. So what you see on the screen is the webpage of the Galaxy server that we'll be using for this training session, cheminformatics.usegalaxy.eu. So we assume that you have an account there and you can just log in and either follow the steps in the video or you can follow the training material by yourself. Okay, so to get to the training material, you click on this little hat symbol at the top of the screen. And you'll see this page. So then navigate to computational chemistry and then high throughput molecular dynamics and analysis. So this is the page which describes the whole of the training that we'll be going through in this video. We have the introductory section which provides some background information about the protein that we'll be simulating. This heat shot protein 90. So it's a chaperone protein which is involved in helping to fold proteins after synthesis. And so there's some background information about this which is not crucial for this tutorial, but maybe nice for you to have read through if you're interested. We have a diagram of Hsp90 with a ligand bound and you can click to view it in the NDL view which is embedded into galaxies. So let's get on with starting the analysis. So the first step is to collect the protein structure of the Hsp90 protein. So we can do this using the get PDP tool which is available in Galaxy. So just take this accession code 6HHR which refers to the protein structure and then let's find the tool. So it's a very simple tool. You just paste it in to the PDP accession code field and press on execute. So already we have the the data set appear in our HG panel. So we can rename the data set already to do something sensible. Okay so now our PDP file has successfully been loaded into Galaxy. So let's continue with the next step which is preparation of the topology that we need for simulating. So as described in the training material we have to do this in two separate steps for both the protein coordinates and the ligand coordinates. So first of all what is topology a generation? So the simulation software that we're using Chromax makes a distinction between the constant and the dynamic attributes of the atoms in the system. So constant attributes would be things like atomic charge, the bonds which connect the atoms and the dynamic attributes things like the positions of the atoms or the velocities or the forces which are associated with atoms which can change during the course of the simulation. And the Chromax software expects that the constant attributes are stored in the topology file so this top file and the dynamic attributes are stored in structure files. So PDP files or gro files. And the PDP file has some of this information but not everything that we need. So we have to carry out an explicit step in which we calculate these parameters for both the protein and the ligand. And here's a small question which you might want to have a think about. Why do the protein and the ligand need to be parameterized separately? And you can click on the solution to find the answer. So the first thing that we need to do because we want to treat them separately is to separate the coordinates of the ligand and the protein into two separate files. So look at the contents of the PDP file, then the atoms of the protein are all labeled as atom and the non-protein atoms are labeled as as het atom or hetra atom. So this includes both the solvent molecules so the waters here HOH but also the ligand AGB5E. So to form this separation of the coordinates into two separate files we can use one of the text manipulation tools, search and text files. So all this does is it searches all the lines in the input file and it finds lines from after a particular pattern then it will save them in the output. So first of all to collect the the lines to collect atoms associated with the protein into a single file then we choose first of all our input file as the input. We select don't match and then under regular expression we type het atom. So this will ensure that the output contains only lines in our file which are not hetra atoms. So in other words only the protein. Okay so in the next step we want to also separate the the ligand atoms out of our initial pdb file. So once again let's use the search and text files tool to do that. First thing to be careful about here is to use this pdb file our original one rather than the protein only pdb file which we've just created. Now we want now lines that match the code of the ligand so that's a g5e as stated in the tutorial and we just click on execute and then again let's give them sensible names. So we've now completed this step and we have to now calculate the the topologies for both protein and ligand. So for the protein we use this chromax initial setup tool which I'll show you now. So if you search for chromax then this will provide all of the tools which based on the chromax software and we choose this one chromax initial setup. We want to choose the protein only pdb file as our input. For water model we have various options which are available. So this is so-called three-point model which means that the water is modeled by three different charges so one on the oxygen atom and one on each of the hydrogen atoms. There are also four and five-point versions which also place the charge for example on the on the lone pairs of the oxygen. For the force field again we have multiple options. We will choose the amber 99sb force field quite a recent one but again there are multiple options that you can use but we recommend that certainly you use an amber force field because the tool that we use for the ligand topology also generates amber topologies. Okay so then we can just click on execute and then we create three different files explain to you when the when the job is complete. Right so now we have the task of generating a topology for the ligand. So for this we use the ac pipe tool. So this provides an interface to amber tools and it also has the nice benefit that it gives us an output in a format which is required by chromax. And as an initial step we have to add hydrogen atoms to the ligand because if we look at the PDB file it will say that these are missing currently. So we can do this using the compound conversion tool. Under output format we can just select PDB choose 7.4 as the pH at which to add the hydrogen atoms. Very simply press on execute. Okay so now we have our ligand PDB with the hydrogen atoms and we can check quickly to see that the hydrogen atoms have indeed been included in the output. The next step is to run the ac pipe tool itself. So is this one here? Generate md topologies for some more molecules using ac pipe as the input file. We take the state search ligand PDB with hydrogens. The charge of the molecule is zero. Multiplicity is one which should be correct for almost all organic molecules. And under force field we select gath. So gath stands for general amber force field which is a force field which is applicable to almost all small organic molecules. And then charge method we can select simply as VCC which is the default option. And we want to save the grow file because we need this for our simulation and I'm impressed on execute. Okay so now that that job has completed I can say something about the output files of the gromac setup tool and the ac pipe tool. So for the gromac setup tool we have three outputs. We have this top output this topology and this grow output which contains the cause amounts of all the atoms similar to the PDB file. And we have this ITP file which is a position which is a position restraint file that we can use later on in the equilibration steps. Then again for the ligand we have topology file and we have this grow file which contains the cause amounts of the atoms. As for the topology I'll show you quickly the contents of the ligand topology it contains various information not about the positions or cause amounts of atoms but about for example their charge their mass different bonds which exist between atoms angles, dihedrals. This is information which is also required for our simulation. So having now calculated the ligand protein grow and top files we now need to combine them back together again and there is a tool for this in Galaxy the merge grow and top files. So the tool has four different input files. First of all the protein topology file which is this one, the ligand topology file which is this one here, the protein structure or the protein grow file which is this dataset and the and the ligand grow file which is this one here so I can just press execute to start the job. So after we've combined the protein and ligand grow and top files to create a single unified grow file and a single top file the next step is to create a simulation box around the system in which the simulation can take place. So you can do this with the gomax structure configuration tool. So again quite a straightforward tool we take this grow file as an input we want to configure the box and we want to select one nanometer here so this is the distance between the edge of the protein and the outside of the box so we can be sure that our protein is at all times this distance from the edge of the simulation box and the different box types that you can represent that you can use. The triclinic is the most efficient one so we will use that. So the next step is salvation to add solvent to the box that we've just created. So once again we have quite a simple tool for the salvation we have to select our grow and top file so we select the two that we produced from the merge tool. We have to select a water model so we have three options here we select the three point model because that's what we used for the initial setup for the protein topology we want to add irons to neutralize the system in case we have a net charge and we have the choice also to specify a salt concentration to add in this case I believe it is zero but you could also consider setting this to zero point one moles per liter or something similar and then I will press execute to start the tool. So having now completed salvation and created the simulation box the next step is energy minimization so the aim is to ensure that the system is in the lowest possible energy state so we conduct a short simulation and wait until it reaches that state. So we have a specialized chromatic energy minimization tool and once again we need to provide grow and top files so we provide the grow file from the which we just created with the box and the solvator top for sorry this one the solvator top then we have a choice whether we upload at our own NDP file so in Gromax you have these NDP files which contain parameters for the simulation and you can also choose this customizable option so there are various parameters which we can mostly leave as their defaults. You can set the number of steps to 50,000 and the EM tolerance to 1,000 so the idea is is that the simulation will begin as soon as the maximum force in the simulation is smaller than this value then the simulation will end prematurely or else if it reaches 50,000 steps then it will end so let's press execute. So having completed the energy minimization step we now need to check that the minimization simulation has actually converged in other words that is actually reached the minimized state so there's a galaxy tool also for doing this, extract energy components with Gromax so I'll show that so first we need to take the edr file which is an output of the energy minimization and there are various terms that you can calculate so in this case the correct one is this is the default the potential you want to select a galaxy tabular the output format and then we just press on execute okay I'll just show you the contents quickly so it shows all the time steps of the simulation and the potential energy of the system in the second column so we can plot this so if we go to the history now and click on this I can here visualize this data that we can choose to plot a line chart so you can do this on here line chart jq plot select the axes so for the x axis we plot column one the y axis we plot column two and what we see here is this kind of curve which at first decreases rapidly and then starts to decrease more slowly we can see here that because it's leveling out the system is reaching a minimized state so it could maybe even leave it to continue the simulation for a little bit longer so it flattens out completely but the system is more or less minimized so if we had just seen a steep descending curve like at the start and to indicate that the simulation had not reached the minimized stage so having checked that we can now continue with the next step which is the equilibration okay so now we've completed the energy minimization you can continue with with any simulations so in this tutorial we carry out three simulations to so-called equilibration simulations and then a production simulation so the production simulation is the real simulation that you can then go on and use in the analysis stage so what does the equilibration first of all involve so what we want to ensure is that the solvent around the protein is correctly equilibrated in other words we want to ensure that it's brought to the correct temperature and to the correct pressure so for that reason we conduct the the equilibration in two different stages under the nbt and nbt on samples so nbt stands for constant number of particles constant volume and constant temperature whereas npt stands for constant number of particles constant pressure and constant temperature and as stated in the in the tutorial you might also see these terms isothermal isochoric for the nbt ensemble or isothermal isobaric ensemble for the nbt ensemble so to carry out the equilibration we have to hold the protein in place and we can do this using the position restraint file that was created in the system setup so the idea here is the position restraint file states which atoms have to be held in place and then during the MD simulation these atoms are allowed to move but the motion is energetically penalized so let's go ahead and start first equilibration simulation so we can search for the gromax simulation tool it's this one here gromax simulation for system equilibration or data collection and once again we select the grom structure file and the top file so the grom file is this one from the from the energy minimization and the top file is just this solvated top file that we created earlier so now we have various inputs to select so for the checkpoint file this is our first simulation so we don't have a checkpoint file I explain that and what this is in a second for the position restraint file I have to scroll down to the bottom of our history and for the gromax initial setup tool that we created right at the start there should be an ITP file so this one here which we use the index file we can also leave blank that would be if you wanted to specify some some custom atom groups but we can ignore that then for the outputs we do not really need the trajectory in this case but let's choose to return it we want to return the grom file I want to produce the checkpoint file which allows us to continue the simulation with our subsequent coloration and then the production simulation so again we want to produce our EDR output like for the energy minimization and we want to produce the TPR output that should be enough okay then as for the settings here we have a choice as to the ensemble and this is the nbt equilibration so we select nbt again we have the choice whether to upload our own mdp file or use customizable settings so if you are already familiar with gromax and you use it on the command line this might be an option that you're more interested in so you can customize things more extensively but if you're new to to gromax then we recommend that you use that use the the customizable settings here so the leapfrog algorithm is a good choice to integrate it here for the bond constraints then we want to constrain all bonds for the position restraints we can leave neighbor searching and electrostatics as a default for temperature let's use a sensible temperature of 300 step length here is a femtosecond you could also change this but if you make it much more than a femtosecond then um it's likely that the simulation will um will fail so a femtosecond is a good value and as for the number of steps that elapses between save and native points um a thousand is probably a good value these values we can leave as the default and now important is the number of steps for the simulation so we select here 50 000 so if you remember our step length was um a thousandth of a picosecond or a femtosecond so 50 000 femtoseconds is equal to an equilibration of 50 picoseconds and let's generate the log so in case the job fails then having access to this log is very useful to help to debug to find out where things went wrong so i recommend that you always click yes for this option now we are starting with with real simulations we can expect that they take some time so be prepared for that and um yep no need to be impatient the simulation has just finished and just like for the minimization step we want to check that's the equilibration completed successfully so that our simulation has actually um equilibrated to a constant temperature so once again we use the gromax extract energy components tool this one and we need to take the edr file as input this one here and so before we selected potential as the term to calculate and this one this time because we're interested in the temperature and then we select pretty logically the temperature and again we select um galaxies have done as the output so the job's finished we click again on visualize this data and we can use again the jq plot wine chart just as before so for x-axis we select column one and for y-axis column two and you should hopefully see this kind of plot where the temperature of the system rises from a low value up to around 300 and then it fluctuates within two or three degrees around the 300 Kelvin value if you're seeing something different that a stable temperature value is not reached then you should consider extending the length of the um equilibration but for this system 50 pika seconds should be sufficient for your own system perhaps a longer collaboration is necessary so just bear in mind that it's worth checking um with this tool okay so the next step is to carry out the next collaboration step which is um now under the NPT ensemble so we use very similar parameters to to before we want to um we can use the same topology as before but we want to update the grow file to the one which was produced by to the one which was produced by our mbt equilibration so that would be that one the inputs we also want to um choose the checkpoint file which was produced by the nbt ensemble simulation so the idea of the checkpoint file is it contains the information about the about the last state of the simulation so for example all the forces which acting on the atoms and then we can use these to um to restart our new simulation from the point where the previous simulation left off for the position strength we once again want to stack the same files before index we again leave empty we want to once again return the trajectory and structure outputs as well as checkpoint and edr files and the tpr output okay and then of course we change nbt ensemble to NPT yeah here the same parameters as for the previous equilibration so we constrain all bonds um we set the temperature to 300 degrees set the step length to one femtosecond number of steps elapsed between saving eight points to one thousand and once again that's equilibrate for 50 000 steps so after having carried out the NPT simulation then um we once again want to check the um about the pressure of the system has um converged so we should expect that it converges on on on one bar so so atmospheric pressure um and once again we expect some fluctuation so this time i will skip this step um of course i recommend that you try yourself but just avoid um repeating myself too much uh but i will show you the kind of graph that you should expect after plotting the pressure output from the extract entity components tool so it should look something like this once again um a sharp rise at the beginning and then a lot of fluctuation um so actually this looks like kind of creating anti fluctuation in first glance so we have um we expect the pressure to be at one bar and um it's fluctuating through the hold of the simulation something between minus 200 and 200 bars but actually this is what we expect more or less so pressure fluctuates a lot in the MD simulation and statistically this is probably not distinguishable from the um pressure that we are um targeting of one bar so this is um an acceptable outcome i would say so now i continue with the production simulation so we are finally ready to um do the um to perform a longer simulation of the protein without uh constraints which we can then go on and use for a lot of analysis we want to perform so um for topology we use the same topology as for the equilibration steps we use the grow file from the npt equilibration um for the checkpoint file we use again the checkpoint file from the npt equilibration and important to note that the position of the strain file that we use in the equilibration steps is not used this time because this is a production simulation then um as for the outputs um we once again want to return the xdc file and the grow file and just to note here that in the last two um equilibration steps i selected um the xdc file but we did not really need to use this at all whereas for the production simulation this is the main output main outcome that we introduced the trajectory we can produce a checkpoint file but we will not be proceeding with any further simulations so it's not strictly necessary again there's nothing to um check as far as as far as the edr file is concerned but it does not help to um to produce it and i'll produce the t the tpr file as well okay so once again we have the choice whether to conduct the simulation under the npt or npt ensembles here i suggest npt and um so this time we uh leave the bond constraints as the default so no bond constraints temperature of course 300 kelvin step length a femtosecond and number of steps you can set to a million so six zeroes that is one nanosecond so let's start the tour this is of course a longer simulation so you should expect it will take uh some time to complete all right so as you can see the final simulation job has now completed so you can see these um four datasets being produced i'm sorry that's actually six datasets in total and in particular the important ones are the grow file which shows the static uh structure at the start of the trajectory and the xdc file which contains all the frames which make up the trajectory so what we can do next and what chris will show you in a few minutes is how you can then apply various different analysis techniques to these two files to find out some more about how your molecular system is behaving during the course of the simulation but what i'd like to show you first of all is how we can scale up this kind of simulation to a really um high throughput level the way that we do that is using two different galaxy concepts called um workflows and collections so this part of the tutorial is optional and it's covered by a section at the end of the training material so if i go back to the training material page here this section optional automating high throughput calculations so you can also follow through the instructions uh follow through the instructions here so what i'll start by doing is navigating to the um to the workflow tab here on the top then you see that we have a list of workflows um i will show you the content of this one the protein ligand html simulation so what a galaxy workflow is effectively is a way of connecting up um multiple tools um together to form a single pipeline which you can then run um as if it were a single tool so um these boxes here represent um two different input files the rest of the boxes represent um different tools so all the tools that we used so far in the course of the analysis these um pipes uh represent the inputs and outputs of the tools which are being passed in between so first of all we we we start with an sd file containing um some ligands they want to perform any simulations for in case you're not aware um and the sd file is a commonly used file format for storing for storing chemical information um for molecules which have three-dimensional coordinates i think we've asked it to the um compound conversion tool which splits it up into um a collection so a collection in galaxy is um a group of individual datasets which we can apply the same tool to then for each of these we run the the generate any topologies um tool which you already used we merge the topologies together with these um protein topologies which we have generated from the pdb input file and um let me continue with all the tools you saw already structure configuration salvation uh chromax energy minimization um our equilibration simulations and finally our production simulation so just to show you how this works in practice um i can click on run workflow up here and have a history prepared it with two input files uh three um three three molecules for which i want to conduct um molecular dynamic simulations and um the same hsp 90 um pdb file we've been using so far so then um you can see the workflow we have these two inputs we have our um pdb input file we select here and our sd file and then we have various other tools so these have um different parameters which we can also adjust so for example let's say we want to change sodium chloride concentration to zero from one um the step length maybe to one femtosecond zero zero zero one picoseconds and then let's conduct some very short simulations so um 1000 sets um the mpt simulation 1000 the mpt and 1000 for the production simulation so very short simulations but just to demonstrate um what what is possible here and then we simply click on run workflow here and what you'll see is that the um workflow will be will be scheduled and it will and it will produce um a lot of different jobs in the history which was started as gray because they're waiting to run and then gradually um they'll change color and eventually become green and uh we'll see in the end that that the final output files that we get at the end um are in galaxy collections so in groups of um three files so one for each of these three molecules which are contained in this um structured data file so you can see already that the first few jobs have been scheduled these grommacs initial setup tools um and this um initial step which splits up the um the structured data file into um a collection of galaxy datasets so what you can see here is that the entire workflow has been scheduled the intermediate datasets are hidden so we can just see these two collections um at the end which haven't completed yet we need to wait until the entire workflow completes um but for example you can see that if you click on this collection it will show you the three um the three xgc files which have yet to be produced but um which are grouped together in a single collection so this provides a really neat way in which you can just upload a couple of input files and then um as a click of a button and after selecting a few parameters then you can um run multiple simulations in parallel for a range of different uh small organic molecules the final thing that we can do with workflows in galaxy is to automate then apply the command line so if you um look now again at the training material then you can see that we describe two different ways that you can do this one using the python library biobend we provide a small python script which you can use for this or even more simply you can use the plenamo command line tool um to just run the um to just run a galaxy workflow from the command line in a single command so this section is strictly optional if you don't feel comfortable using the command line then feel free to so skip it but um i think it might be interesting for some people and and it's nice also to show you what kind of things are really possible with galaxy so here is my terminal um in this directory i have two files i have my um my struct data file which contains my ligands as you can see and then i have um this htmd uh job file and what this contains is a is a list of workflow inputs um to run a galaxy um to to run a particularly galaxy workflow and then i can simply run um uh a plenamo command like this so uh plenamo run and i have uh the workflow id and my htmd job file which i just showed you and finally um a plenamo profile which contains all the information which is needed to log into my um galaxy account programmatically so then i just click on enter and if i return to my web browser then i should be able to see that a new galaxy history has appeared and so already the ligands file is being uploaded and shortly the workflow itself will invoke and create all the all the datasets that we saw before so my aim there was just to show you very briefly what kind of thing is possible with with galaxy how you can interact with the galaxy server by the command line to cover all the functionality of plenamo for for running workflows would be subject for a whole different tutorial but hopefully this provided you with a bit of with a bit of information if you're interested then you can go and research this yourself and look at the plenamo documentation all right so having told you about high throughput molecular dynamics simulation of the galaxy that brings my part of the tutorial to a close so i'm going to hand you over to chris now he will show you how um to perform um the analysis of the um of the md files that we've produced uh using galaxy hi everyone my name is chris and i'll be taking you through the analysis part of this galaxy tutorial to start we'll go through some short background and then after that the interactive session if you have any questions please do ask us we will be online to help you out so you've already run a molecular dynamic simulation and you found that it produces incredibly large and complex datasets in different file formats and in this case you have all the cartesian positions of each atom of the system under investigation saved out at a particular interval during the course of the simulation now often you'll have tens of thousands of millions of atoms for example just considering this enzymatic system on the left you have an enzyme and a leg and some ions in a water box in this case there will definitely be quite a few thousand atoms and your simulation will have many many time steps millions of time steps in order to sample enough of the base space so you're often running um at least let's say 15 nanoseconds of simulation for a for a decent um uh decent sampling and perhaps more than that so once you've done this and you have these outputs these molecular dynamics trajectories you want to then conduct rigorous analysis of this information rich data in order to obtain scientific insights and conclusions from the simulations so we're going to look at those analysis tools in this this talk so a general outline of the basic process is that you have some inputs for example the structure file which might be no gromax format or pdp format you'll have your trajectory file or files and also some parameters so those parameters might indicate which atoms you want to investigate or if you want to change the definition of let's say what hydrogen bonding looks like um in terms of angle or distance you might you might change these parameters then we'll go to the part where we take these inputs and analyze them using some kind of molecular analysis framework so you're going to extract information and transform it in some way we'll be loading information as well and because often the case there's there's many frameworks already available and although you can write your own custom scripts we recommend especially for workflows and for reusability to use existing frameworks that are fairly well tested and such as md analysis by a 3d mb trade and vmd and what we've done is we've used these frameworks and integrated them into galaxy so you can process the data and then your process data the results of of doing this analysis is then a table a graph or a figure or a combination of these and these are what you would then read into a little bit further and try to figure out what particular property has changed or if the simulation is converged or perhaps if there's some interesting behavior that's worth investigating and at worst maybe you'd find out that you need to run along the simulation or there's a problem with the simulation that you might not have picked up so it's very important to look at these the process data and of course this will also help for publication if you've got some interesting figures so some commonly used analyses I mean this is a really rich space there are many types of analyses in galaxy we've got some of the most common ones that you want to use so for example a time series if you want to look at a property over time and then if you want to for instance look at the root mean squares we've got we've got the root mean square deviation or fluctuation we also got the principle component analysis and often people are interested in hydrogen bonding analysis so if you're looking at an active site for protein and there's a ligand you know are there some interesting hydrogen bonds between the ligand and and this enzyme over time from a channel in parts if you're looking at the torsion angles of proteins this is quite useful and there's a few other examples now we don't have all of these tools in galaxy just just yet so I've started some of those that are not quite available right now and perhaps you have a favorite package that you'd like to use so how do you get that into galaxy so if it's on condor forge it could be added quite easily and you can contact us online to to add new tools but we'll be covering some of these more commonly used analyses during during this tutorial okay so to start off with let's focus on um a basic time series analysis so the idea here is to measure a property over time so for example you might want to measure a distance an angle or dihedral angle you also might want to consider non-bond type interactions like hydrogen bonding and so hopefully you'd get a time series of your property you're measuring for example the end-to-end distance versus the frames or the time in the simulation and while it can be very interesting to look at this it can also be misleading because we should really do a histogram analysis where possible as we're not counting how many times you know in this case it's the end-to-end distance a particular part of that is sampled so it these time series are very useful because we can look at let's say um a torsion angle whether it's sampled face space you know relatively well and if we want to change that we might run a longer simulation or we might want to run a free energy simulation or potential of new force um but there are other ways to consider properties um but this is just one of the the um easy to understand and you know useful first ways to consider particular properties of interest and of course hydrogen bondings um interactions are very popular to consider so this is a great a great time series to use for your analysis so next up let's look at the root means squared deviation or the rmsd and this is a standard measure of structural distance between coordinates and it measures the average distance between a group of atoms for example the backbone atoms for protein and why you might want to do this is if you want to check the stability first of all and also the confirmation of the selected atoms over a simulation so if you do this for a protein you could measure the root mean between an initial confirmation and all the frames of the simulation and you'll see that you'll get a time series and a histogram um using the tool in galaxy and if you look over time you can see the rmsd did fluctuate a little bit and now we want to figure out um you know are they multiple states or is it pretty much one stable state so looking at the histogram you can see in the density there's a little peak on the left here but there's pretty much a stable um state and there are not many confirmations of this protein um that are worth investigating there's just one stable stage which is good the type of um simulation we happen we happen to be doing here so this tool really useful for the purpose indicated over here so if you want to also look at the fluctuation around a particular residue you would then use the root mean square fluctuation or the rmsf where we look at the average deviation of a particle with respect over time with respect to a reference position so very useful looking at amino acids in a protein so if we have if we look at the plot that's produced in galaxy it's the rmsf versus the residue position and you can see that the fluctuation per residue changes and some residues fluctuate more than others so we'll use this kind of analysis to identify most dynamic areas of the system and see if you want to investigate those further and often the the c and the n-terminus or the beginning and end of the protein fluctuate quite a lot and we might ignore that but we're often looking for rmsfs above one significantly above one and that would indicate that area is very dynamic and it's moving a lot during the simulation so if you don't expect that that might indicate there's some confirmational change that needs to be investigated with something wrong with the simulation or maybe something interesting is really happening so you know this is another way to consider if if your simulation is is going as expected and if there's an interesting area to consider next up is pca or principal component analysis so this is a well known statistical method it's often used for dimensional reduction and the reason we want to use it here is we have a molecular system there are many atoms with Cartesian positions and these change during the course of the simulation so we have for Cartesian position three points x y and z times the many atoms that we have and there's a large complexity there's you know multi-dimensional complexity we have this huge space and we want to reduce the dimension so from whatever dimensionality it is you know several thousand dimension to something that we can understand and we will use this pca analysis so what we do the covariance matrix is calculated and diagonalized to get eigenvalues and eigen eigen vectors and these are the then the principal components and essentially these are the components they have the highest variance so these are the things that are moving around in the system a lot and they also are orthogonal to each other so they're not they're independent and so we have these these eigenvectors and we can consider this movement and say well you know is this interesting and what you know what's going on so it's really useful to study the essential dynamics of a system using pca and we will be looking then at statistically meaningful motions so the one of the results that you'll get in galaxy is this set of plots where the principal component one and two are plotted versus each other two and three three and one and then also this screen plot or um eigenvalue rank plot and what's happening here is let's look at pc2 versus one the the amount of variance that this component is responsible for is indicated in brackets and this is plotted over time from blue which is the beginning of the simulation through to red and you can see over time that the system moved from a positive region of principal component one space to the negative region and then from the negative region of principal also sorry positive region of principal component two to the negative region to the positive region so you can figure out if your system is is going through repetitive motions or not and if it's sampling this this principal component space in an interesting way now of course these principal components are from a very large molecule so they will not be as elegant as a simple bond by vibration there's often some complexity here so it's very useful to extract and visualize these components which you can do in galaxy and i'll discuss that in in the um interactive session now what's often the case is that the the screen plot um shows us that you know only a few principal components are responsible for a lot of the variation in the system so usually up to five so one two three four five maybe in this case six or seven it's responsible for about 50 percent of the variance and those are the ones that we want to consider so we tend not to consider the principal components out here because they're not responsible for very much of the variance okay so i've discussed all the um basic analyses that we'll do in this in this tutorial and just a reminder of the frameworks we're using md analysis md trad by 2d and vmd you can access the the training materials on the following websites and in fact it might be a little bit more convenient on your galaxy session if you look at the top bar you'll see the graduate cap or academic hat icon if you click on that you'll be able to access the training material so thanks very much for watching this intro session on analysis let's move on to the interactive session hi everyone welcome to the interactive session so we're going to go through the analysis part of the htmd tutorial together and you should go to your favorite instance of galaxy that has the computational chemistry tools i'm using use galaxy.eu and i've gone to the chem informatics subdomain so it's going to show the the mostly the chemistry tools on the left the chemical toolbox as well as a few other useful tools but nothing outside of this domain so it'll be easier to find the tools we're working with and we can access the training by clicking on the academic hat over here and it will automatically take you to htmd and if it doesn't for some reason you will just then move to galaxy training computational chemistry you'll scroll down to htmd you'll sit you'll click on the hands on and then scroll down and we're currently doing analysis so you'll click over there and we'll start this section so before we get started i just want to clean up my history so i have a history with some of the setup and some of the analysis going on as well but it's it's very complicated to look at so i just want to start with all that we need for this analysis and what we need is a a gromax file which was created during the final simulation that you did that final simulation would have been your production simulation and it would have produced a number of outputs but particularly we want to get the xtc file for the trajectory and the grow file which is the structure file so i'm going to take this history that i've got and i'm going to make a new history so i'm going to history options i'm going to go to copy data sets and i'm just going to scroll down and i sort of figure out which one it is so i think it should be called something similar to gromax simulation and i particularly want the data that is the xtc data and the grow data so you can see in my case if you click on on your history item it'll expand and show you some more information and in my case the the history items that are interesting are number 36 which is the grow output from the simulation and number 37 which is the xtc output so 36 and 37 which are those two there and i'm going to copy just those two to a new history called htmd underscore analysis underscore 2021 and i'm going to press copy and that'll take a moment and you can navigate that history the usual way but you'll see that there's this pop-up over here that explains and you can click on this link so you can just click on that and it'll take you to the new history so we're at this new history and now we can get going and to get started i'm going to again click on the training materials and scroll down to analysis so you've completed the simulation and you might be asking a whole lot of questions like is it converged and what interesting molecular properties can i observe and that are relevant and to to you know what's interesting about this this particular system so your research on this particular project so in this case we're looking at heat shock protein and we're just going to do some basic analyses to get used to the what's available in on galaxy so the first thing i want to do here and and you'll you'll do this as well is create a pdb file and the reason for that is most of the analyses in in galaxy right now would use a pdb file as an input structure and we have a grow format file there are other ways to do this but i mean i find this the best way and we're just going to follow along and use the following tool so we're going to use the grow max structure configuration tool and what we're going to do is just convert the pdb file so convert the grow max file to pdb format so what i've done here is i've actually copied the name of the tool just paste it into this tool search section and hopefully you'll find the right tool so you can see that it's the second hit so grow max structure configuration and then just to make things a little bit easier i'm going to full screen this this window that i'm in and for me i have to press function on f11 for you that might just be f11 to go to full screen so full screened and what i'm going to do here is use this tool to convert the grow max structure file which is history item number one and you can see there it is it's a grow max format and i'm going to output a pdb file so you have to change the output format and that's all that we have to do and if you just want to double check you'll click on the training materials again and you can see we changed the output and the box configuration is set to max we don't need any of this detailed information so i'm just going to press execute and that will then start on the right hand side in your history so it's currently waiting in the queue it's gray and that should start shortly and when it's complete we will then have a pdb file now in the meantime um let's while that's waiting let's move on to the next step so let's convert the trajectory file from xtc format to dcd and we'll use this md trad file converter the other jobs just started running it's got yellow so it's it's running now and i'm going to search for my converter and there it is and the input that's selected is is history item two so by default this converter knows that you know you might be wanting to convert a trajectory trajectory type so xtc or dcd it won't select the other history items and we're going to convert this to dcd now i'm always going to go back to this the training materials just double check between the right thing yeah between the right thing and i'm going to press execute and you can see that that is now waiting on the queue to run and that the other job that we ran a moment ago the conversion from a grow format to a pdb format has completed so i'm just going to click on that and just see what that looks like and you can see the contents of of this data here is a pdb format so it's not showing all of it because it's quite a large data set so we're just seeing the first few lines and you can see um you know there's the first amino acids of alien and all the atoms of alien and the contagious positions x1z etc okay so the next job is running that's excellent so what happens after we've got these structure and trajectory files is we can actually start the analysis and the first analysis that we're going to run is an rmsd analysis of the protein and in this case we're going to use it to check that you know the simulation the protein in the simulation it looks pretty stable and as you saw previously this is a standard measure of structural distance in this case we'll use the c alpha carbons of the protein backbone to figure out if the protein is stable throughout this trajectory so we're going to use the rmsd analysis tool and again i'm just going to use um control c right click and copy as well just to copy the name of the tool and we just want to select the domain to be c alpha and then we should end up with an output like this so let's just do that and pressing escape or rather just clicking on the outside of the training material area to get back and i'm going to paste the tool name in okay there it is rmsd analysis using bio 3d so the dcd file that we want to use is this fourth history item note that we could choose number two and you can see galaxy would auto convert that x tc to dcd i personally haven't done it this way i always use the conversion tools that we've set up in you know for computational chemistry so if you'd like to give it the other way i'd try it would like to see if it works um but i recommend doing it this way and the same for the pdb input choose history item three um if you've got a new history or as appropriate for for your history then under select domains you've got a few options and the history and the the sorry the domain we want to choose is c alpha as per the training material and we don't need any other further um notification about it and i'm sure it's clear already but a lot of these tools have further information about you know what the tool does what the inputs are and what outputs you can expect so you can always read more about these tools not only through the the training materials but also on the tool page itself so let's execute that tool or execute that job and you can see it's running so it's waiting in the queue it's gray and at this particular time there's a whole lot of galaxy jobs going on on this particular server so um while i'm running this maybe these jobs will be a bit slower than usual so it's still waiting in the queue so what i want to do is start the next job so long so that's the protein um rmsd and we expect results like that and we can also set up the legend rmsd as well okay and uses the same tool so it's rmsd analysis tool the same trajectory and pdb but we're just going to change the domain so i hope you can see that the rmsd job is running in the background now there's a yellow and color over there which means the job is running what i'm actually going to do is select i'm just going to copy the name of this residue id so what we want to do here is also check if the ligand is stable um and in that case it's good to choose the ligand atoms and the the name of the ligand that we have or the residue id of the ligand is g5e so that's what we're going to choose i'm going to select that and we're going to use the same tool so a trick i often use is to use to do the following here's our previous job for the protein it's still running i'm going to say run this job again and it uses exactly the same tool it's exactly the same inputs as you'd selected before and of course the the same parameters i'm just going to change those a little bit so everything else is the same except the residue so i'm just going to go down to that residue id and i'm really copied the name of that residue which is g5e and if you need to go back you just do so and then i'm going to press execute so those jobs um one of them is already finished this um ligand is is now running uh sorry now waiting the queues it's great that rmsd analysis job the ligand is running let's just have a quick look at the rmsd analysis for the protein okay so let's just go back a little bit i did skip ahead um and see that the output that that that you get is very similar to to what the um what is shown here in the tutorial so we should see some at least two outputs if not three and for the rmsd tool there are actually three outputs so there's just your tabular data that's a time series and there's a plot which is time series of rmsd versus the frame number it's effectively over the course of the simulation and then there's also a plot uh histogram plot so which then um takes all the time series data and places it so this is all time series data and it places it in it bends it over the rmsd so these this data looks fairly similar or these outputs look fairly similar to what's the training material and the reason they might be a little bit different is remember that we're running very short simulations because it's a tutorial um when you run a md calculation you set a random seed to set the velocities for the atoms at the beginning of the simulation so we expect that you know there will be some some some difference so you might not have exactly the same result as this but it should be very similar so what's happening in terms of the time series is that the rmsd is generally staying around one there seem to be some change um a little bit further on in the simulation that might be worth investigating and you'll see that then in the histogram plot that it's quite a a broad region from you know I guess from point five to about one point five you've got um a unimodal like distribution but there is a little bit of an edge over here but it seems that the protein confirmation generally stays fairly stable and averaging around one although there's this little edge so maybe in a longer simulation we would see something interesting going on and then for the ligand I just go a little bit further down you should you should have a similar result to this then for the ligand we expect that the ligand will have a stable binding mode um for the you know for the for the length of the simulation we're running and and because this is a well-known system so you expect nothing dramatic to happen so if we look at these plots the ligand again there's some more data from that rmsd then the rmsd plus that looks pretty stable the rmsd is about point three on average but you know there's some movement so we can figure out you know is a lot of movement we need to look at this distribution as a histogram so here we go um so it sends it around point three it's a little bit um skewed to the right but generally this ligand looks like it's fairly stable it's unimodal distribution and it's it's stable in the active site and of course um in in conjunction with these analyses you could also go and visualize the output of these simulations and to confirm this but of course now we have a quantitative measure of the rmsd of the protein and the ligand over time okay so that's the first type of root mean analysis next up we would like to look at the root mean square fluctuation and it's valuable to consider especially for the protein because we can look at the deviation of these amino acids at the time and figure out if there's large structural changes or large dynamic change of the particular position so let's use the rmsf analysis tool let's search in the left panel there it is and again the trajectories in pdb should be automatically selected correctly here and the domain that we want to look at again is c alpha so this rmsf works really well on proteins excuse me or distinct residues or molecules with distinct residues in them like a protein give me again i'm just going to execute that and we should have two outputs the raw data and the rmsf plot so that is queuing so long and the expected result of this would be a plot of the root mean square fluctuation versus a residue position and large fluctuations we really don't expect any in this calculation but anything above one would be considered a large fluctuation and as mentioned earlier the you know the first and last residue tend to fluctuate a lot and be quite flexible but otherwise we don't expect this protein to be particularly flexible especially with the short simulation we're running and as it's in a stable conformation so if it were to be the case that there was a large flexibility let's say you can see around residue 100 there is some flexibility we could figure out why that is and perhaps if those amino acids are the ones that are in contact with the ligand or if there's other dynamic motion in this protein and it might be useful then to visualize the the protein and see what's what's happening there so i'm just going to click back to the the output so in your history you should have an rmsf raw data which is then the time series so for each frame that you save from the simulation just here label from one to the end and then the rmsf is is calculated now that looks as follows so a little bit different to to the one in the training material example and mainly i guess around residue 100 or so there's a little bit of a change over here um i don't think that that this is necessarily relevant any any way around one is considered fine but you know if you're worried if you were to be worried about why this is moving so much of this position it could be investigated further again via um a visual analysis or by looking at another time series perhaps um these residue positions so amino acids at 50 and whatever's near to them to figure out why they're moving around we might see this in the pca as well so um just a general a general look at this it's probably okay this is expected motion for this particular simulation okay so moving on next up is the pca analysis um or principal component analysis and what we're trying to do here is take all the motion of the system that is statistically meaningful and try and understand what they are so we're going to look at using this pca tool which uh we're going to use the c alpha domain of the protein and we should get something like this so we'll be able to look at the principal components of of the protein over time so i'm going to choose the pca tool and there it is so pca using bio 3d and again your trajectory and pdb input should be the same as before so in my case that's history items four and three the domain to use is c alpha and there's quite a lot of outputs for pca um there's a pca plot cluster plot and then the first principal component plotted versus the rmsf so that's quite interesting we can compare the rmsf which have already calculated as well um to the first principal component and also get some raw data so i'm just going to press execute okay so that has been queued so waiting for that to run and okay so it's just turned yellow so it's starting to run so you should expect to have a similar result to this one where the first three principal components of course of of very high variance and doing something similar to what we see here so a question you might have though is you know what is the principal component so you don't know what the principal component is it's just the sort of group of atoms with the the highest variance so we're using the the c alpha atoms here so in c alpha which of those atoms have the highest variance and are orthogonal in in motion to the other principal components so you might want to visualize that which we can do and then another thing that we might want to do is look at the cosine content and this calculation gives you some idea if your simulation is converged or not we don't expect the simulation to be converged so i'll run that in a moment so you have the four pca outputs the raw data okay so this is for the first three principles if i very core correctly then you've got a pca plot which is what's shown in the in the tutorial we'll come back to that in a moment there's also a cluster plot so by default the number of clusters to look out for is two and in this case perhaps three would have been a better choice for pc two versus pc one but it's the same as the previous plot but color coded differently because it's clustered rather than over time so here's color coded over time so blue is at the start and right at the end of the simulation and then the fourth output then is pc one versus the rmsf so you can see there are some similarities they're not identical but there's some similarities between the first principle component and the root mean square fluctuation okay so let's just um before having a look at the cluster plot let's start up the next and calculation which is the cosine content i'm going to start that up it just needs the trajectory and the structure file and we'll give an indication of whether or not the simulation is converged so let's do it on the first three components and we're going to um and it has the first component and in this case the tool uses a zero based index so zero is the first principle component so that's a zero over there so we can just place execute and this will result in a cosine content value so while that's queued and running we can maybe discuss what's happening with this pca analysis if you have a look at your pca plot um you'll see pc two versus one so over time there's movement from from pc one in its negative um region to a positive region and the same for pc two it moves from negative to positive and actually back again based on what's happening these plots you could decide if it is some kind of repetitive motion or not and you can look at the sampling of that principle component space and then you'll also see in in your screen plot the proportion of the variance so you can see that the first three components are responsible in total for about 35 percent of the variance so the the first component 18 percent you can see it over here and on the screen plot and the second component for 11 percent and together 30 percent etc so those first three components are really um showing you the majority of the movement now the question is what is that movement um so the next tool that we're going to run off this cosine content is a tool that would save out the principle components um i said the first one and we could you could then visualize that and see what the motion is so let's start that up as well let me go to the training and this is the pca visualization tool so if you use this tool it's going to be here so the third one down again the same trajectories and structure the same selection so if you've used this the specific selection in your in your pca analysis please choose the same one here so we chose c alpha so keep that same selection and let's say we want to look at the principle components the first principle component pc one so we choose execute and this will result in a pdb file which is actually a number of has a number of structures in it over time so it's a little bit like a trajectory but just as a pdb file and this file can then be downloaded and viewed using your favorite um visualization tool and then you can figure out what that first principle component looks like so um for example you can see in this case the motion for pc one in the simulation that we ran for the training looked as follows so that was the principle component we found and again because this was run on a short simulation um it's it's probably you know you need to run the simulation to confirm this but you can see that the principle component here is quite complex right it's not just a vibration of a single bond there's a whole lot of things going on there you can see there's a bit of a wagging motion at the back um and you're in your case you might see a slightly different motion so once again i've jumped ahead um and we did run this pca cosine content calculation and let's have a look at what that that did so your cosine content you should get a number and if we look at the outcome of the pca cosine content calculation um you should get a number around 0.93 and that should indicate the simulation hasn't converged and the tricky thing with those pca cosine content when it's close to one it indicates the simulation is not converged and a long longer simulation is needed for values below 0.7 we can't make a statement about the convergence so um we can never know definitely that we've converged um unless we use a different type of simulation so free energy simulation we could converge over a particular degree of freedom so if we if you have a look at your results um my result happens to be 0.093 which is a little bit unexpected so it's a value below 0.7 um and then what that means is i cannot make a statement about the convergence or lack thereof but it's it's not that it's not converged it's just like i can't make a statement about it um and in this case i mean you know it's a short simulation so i expect to need to run a longer simulation you should have a similar number perhaps your number will be closer to one and and again you then would be able to say whether it's definitely not converged or whether you're unsure effectively okay so that's the cosine content and we've already considered this visualization so finally let's look at hydrogen bonding analysis and yeah these are really worth investigating and in particular persistence hydrogen bonds and so in this case we'll look at the the protein and the ligand and look at hydrogen bonds that potentially form between them and we're going to use the hydrogen bond analysis tool using vmd so i'm just going to click out and go to that tool there you go so it happens to be the first hit for me and there might be some other tools as well there's another tool that doesn't use vmd we're going to use the one that uses vmd and then you might notice at this time that your your history items that are selected by the tool are not the correct ones so the dcd output uh the dcd input is correct but the pdb grow input is incorrect and the reason for that is we ran that pca visualization and that created a pdb file so the tool's chosen the most recent pdb um dataset and that makes sense but it's not the one we want so either choose the correct one which is in my case number three the one we've always been using right and you might have seen um that cool trick that you can do where you can pull these items in like so so you could also do it that way and then what we're going to do here is select the protein of the first selection and the ligand as the second selection using the vmd selection notation so the protein is going to be put into selection one and for selection two i've copied and pasted resname g5e this is simply the ligand if you have a more complex system these selections will probably not be appropriate so for example if you're looking at an antibody system with an antigen where both systems are proteins then you'd have to start being more specific about your selections so you can see for the protein we were pretty lazy we used vmd understands what a protein is we have one protein in the system so this this makes sense and in terms of the ligand we we want a specific ligand we don't want to select the waters we don't want to select anything else so we're using the resname um to be more specific and um there's nothing else that needs to be done there so that should just work and we can press execute so if you find that your result for for this tool is really hard um the tool runs but there's no result that will usually mean that you've chosen the you know a dataset that doesn't really work so for example if i've chosen the this pca pdb dataset by mistake um the other reason that might not work is is to think about you know the selections that you've used and perhaps those are not the correct selections to use so you could figure out um using the visualization aid um the number of these tools you could figure out which selection is more appropriate to use okay so that ran pretty quickly and um there are three outputs so we've got the percentage occupancy the number of hydrogen bonds is a time series so starting at zero um the zeroth frame all the way to the end how many hydrogen bonds are identified per frame so in this case one and the the zeroth frame two in the first none in the second etc and then the other output is actually a just a log file from vmd so the most interesting one uh for now is the occupancy and this should agree with the what's in the tutorial you should see that there are six hydrogen bonds identified and two of them are quite interesting and four of them are not very interesting so the ones that are interesting have a high occupancy so the side of the ligand so g5e and the side of that ligand is a donor atom which then the acceptor for the hydrogen bond is aspartate and the occupancy is 79% or so and that means that that hydrogen bond is round for or occurs for most of the simulation and the same for this other hydrogen bond so again the ligand the side of the ligand but this time with glycine and you'll recall that glycine you know has a side chain that doesn't have an acceptor so you can see here it says main so this is probably the carbonyl group of of the glycine um so it's the backbone of that glycine and that occupancy is about 29% so the occupancy just tells us that this hydrogen bond occurs for a certain percentage of the simulation that we've um that we've run and in fact for a certain percentage of the frames that we've saved out and doesn't actually say whether or not this hydrogen bond is is consistent throughout you know a certain number of frames so so this could mean let's say that the 30% or the 29% hydrogen bond over here occurs for 29% of all the frames that we looked at but it doesn't mean that it could occurred for the first 30% of the frames and then it wasn't there so we might want to look at a correlation function and figure out you know whether or not this um hydrogen bond is stable and it occurs for long periods of time or if it's flicking on off we're not going to do that today but that's something that that could be considered we look at the other hydrogen bonds the ones with a low occupancy the reason they're not interesting is they're not they don't occur for a long time and they're unlikely to be um stable but nevertheless um they could be investigated further so there's this pherogene um of this protein with the ligand the threonine with the ligand the lysine with the ligand and then ligand with the serine so your result then should agree with what is in the tutorial and you can see that um the glycine and the spartate have high occupancy which is what we found and then the others have a lower occupancy okay so that's that from the analysis side you've already gone through the high throughput um calculations with simon and in conclusion throughout this tutorial you've looked at a protein ligand system you've performed molecular dynamics using galaxy and you've actually also then looked at the output so what is going on with this particular calculation what is meaningful we've used various forms of analysis so rmsd's pca we've looked at hydrogen bonding and you should feel familiar with the basic level um of md analysis techniques and and md simulation tools congratulations you finished the galaxy htmd tutorial if you do have any further questions or or any issues that you're having please do find us online we are happy to help you out thank you cheers