 Welcome to this tutorial on performing high-throughput molecular dynamics with Galaxy. The aim of this video is to provide you with the introduction to molecular dynamics simulation and analysis, and in particular using the Galaxy Data Analysis Platform. My name is Simon Bray from the University of Freiburg in Germany, and this presentation has been prepared together with Chris Barnett from the University of Cape Town in South Africa, and Trin Dusele-Palti from the University of Srijana Wada in Sri Lanka. So a brief word about the structure of the video. It will be split into two parts, the first on molecular dynamics simulation by me, and the second on MD analysis by Chris. I will start by giving a very basic 10-minute introduction to MD simulation followed by a demonstration in Galaxy. I'll also show you how we can implement high-throughput MD simulations using Galaxy's concepts of workflows and collections. Then Chris will do the same for analysis. I'll give a short introduction to the concepts behind MD analysis, followed by demonstration in Galaxy. There are no strict prerequisites for this tutorial, but it would be good if you have some basic knowledge of the Galaxy platform. We want to explain basic features in this video. If you want to get yourself up to speed, we recommend following the Galaxy one-on-one for everyone tutorial, which you can find linked on the website of the Galaxy training network. All right, let's get started with the first part, an introduction to MD simulation. Molecular dynamics, as you may well know, is a computational technique for molecular simulation. The reason why we are generally interested to use it is that it provides a very high level of temporal and spatial resolution in comparison with a large number of experimental methods, which are also used to provide information about the positions or the motions of atoms and molecules. For example, we can use techniques such as extra crystallography, which provide a very high spatial resolution, but essentially a static picture molecule. They don't show the motion. We can use techniques like FRET, for example, which gives us some information about how the molecule changes in time, but not at a very high temporal resolution. MD simulation allows us to leave the world of experiments and to simulate as long as we have sufficient compute resources at unlimited temporal and spatial resolution. So that is more or less the rationale, the reason why we are interested in MD. The principle is that we are simulating the atomic motion with Newtonian physics, so in general, more precise methods which involve quantum mechanics are not involved. So now a very brief introduction to the physics behind molecular dynamics. Like I just mentioned on the previous slide, MD works on the basis of classical mechanics, so we treat the bonds and atoms in the system as point masses connected by springs, and therefore we can treat them as simple harmonic oscillators. The atoms also have a charge which is associated with them, and for each time step we can calculate the potential energy U, which is made up of two components, the intermolecular potential energy and the intermolecular components. So these are the interactions between molecules and also within individual molecules. And then over the course of a single time step, so a typical length would be on the order of a femtosecond, so 10 to the minus 15 seconds, we can calculate order the forces which are acting on each of the atoms from the potential energy, and if we know these forces we can then calculate how the positions of the atoms change over the course of our time step. And then at the end of this time step we have a new structure with new positions, and we can repeat this process for thousands or millions of times, and in this way we accumulate a so-called trajectory, which is just a sequence of these frames, so you can view what we do here as collecting a video, which shows the motion of all the atoms in the system over time. One consequence of this, which I think is quite well known, is that MD has a high computational cost, so if you're dealing with a system of with thousands of atoms or even millions, then to calculate all of this just for a single time step is already computationally very costly, and if we're dealing with millions of time steps, which we need to get to a substantial trajectory length, then the computational cost gets very high. So there are plenty of applications of MD, so we will focus mostly on MD of protein ligand systems, how protein ligands interact with each other, but it can also be used to study, for example, protein following, conformational changes in protein, and it also has applications in material science. One thing that we want to focus on particularly in this tutorial is how we can scale up our MD simulation analysis to a high throughput level, so we're not simply calculating simulation of a single ligand, but we can calculate say for 10 or 100 ligands against the same protein in a single run. This is something which the workflow management system that we're using in Galaxy is particularly well suited for. So there are a lot of questions that you have to think about when setting up an MD study, and these will be discussed in the tutorial. You have to think about how to parameterise your protein and your ligand, these have to be done in two separate steps, as we'll explain. In general, you want to think about solvent because biochemistry takes place in the solvent, that is water. Should we add particles with charges, sodium ions, chloride ions, for example, is the water equilibrated around the system? We generally include special preliminary simulations to ensure that water is correctly equilibrated. Is the system starting from an energetically minimised state? Something that we also need to consider. And finally, you might finish your simulation, but what do you want to do with it next? You want to form some kind of analysis, and then you might want to think about it's like RMSD, RMSF, PCA, and we will guide you through some of these techniques later on. So to turn now to Galaxy, there are a wide variety of open source tools which are available in Galaxy for molecular simulation. So you'll be using Chromax in the tutorial, and for analysis tools such as MD analysis and Bio3D, you can access these tools via the European Galaxy Server, which you can access at cheminformatics.com, that's what we recommend for this tutorial. There is also a South African server, link is provided on the screen, you're welcome to try that out as well. And to launch our own Galaxy Server is also pretty straightforward. So if you're interested in doing this and making use of your own compute resources, then you can also try this out. So what you see on the screen is the web page of the Galaxy Server that we'll be using for this training session cheminformatics.usegalaxy.eu. So we assume that you have an account there, and you can just log in and either follow the steps in the video, or you can follow the training material by yourself. Okay, so to get to the training material, you click on this little hat symbol at the top of the screen. And you'll see this page, so then navigate to computational chemistry and then high throughput molecular dynamics and analysis. So this is the page which describes the whole of the training that we'll be going through in this video. We have the introductory section, which provides some background information about the protein that we'll be simulating, this heat shot protein 90. So it's a chaperone protein which is involved in helping to fold proteins after synthesis. And so there's some background information about this which is not crucial for this tutorial, but maybe nice for you to have a read through if you're interested. We have a diagram of HSP90 with a ligand bound, and you can click to view it in the NDL view, which is embedded into galaxies. So let's get on with starting the analysis. So the first step is to collect the protein structure of the HSP90 protein. So we can do this using the get PDB tool, which is available in Galaxy. So just take this accession code 6HHR, which refers to the protein structure, and then let's find the tool. So it's a very simple tool. You just paste it in to the PDB accession code field, and press on execute. So already we have the data set appear in our HGPanel. So we can rename the data set already. Do something sensible. Okay, so now our PDB file has successfully been loaded into Galaxy. So let's continue with the next step, which is preparation of the topology that we need for simulating. So as described in the training material, we have to do this in two separate steps for both the protein coordinates and the ligand coordinates. So first of all, what is topology a generation? So the simulation software that we're using, Chromax, makes a distinction between the constant and the dynamic attributes of the atoms in the system. So constant attributes would be things like atomic charge, the bonds which connect the atoms, and the dynamic attributes, things like the positions of the atoms, or the velocities or the forces which are associated with atoms, which can change during the course of the simulation. And the Chromax software expects that the constant attributes are stored in the topology file. So this top file and the dynamic attributes are stored in structure files. So PDB files or crow files. And the PDB file has some of this information, but not everything that we need. So we have to carry out an explicit step in which we calculate these parameters for both the protein and the ligand. And here's a small question which you might want to have a think about. Why do the protein and the ligand need to be parameterized separately? And you can click on the solution to find the answer. So the first thing that we need to do, because we want to treat them separately, is to separate the coordinates of the ligand and the protein into two separate files. So look at the contents of the PDB file, then the atoms of the protein are all labeled as atom, and the non-protein atoms are labeled as as het atom or hetra atom. So this includes both the solvent molecules, so the water's here, HOH, but also the ligand, AGB5E. So to form this separation of the coordinates into two separate files, we can use one of the text manipulation tools, search in text files. So all this does is it searches all the lines in the input file and it finds lines in a particular pattern, then it will save them in the output. So first of all, to collect the lines, to collect atoms associated with the protein into a single file, then we choose first of all our input file as the input. We select don't match and then under regular expression we type het atom. So this will ensure that the output contains only lines in our file, which are not hetra atoms. So in other words, only the protein. Okay, so in the next step, we want to also separate the ligand atoms out of our initial PDB file. So once again, let's use the search in text files tool to do that. First thing to be careful about here is to use this PDB file, our original one, rather than the protein-only PDB file, which we've just created. Now we want now lines that match the code of the ligand, so that's a-q-5e, as stated in the tutorial. Then we just click on execute and then again, let's give them sensible names. So we've now completed this step and we have to now calculate the the topologies for both protein and ligand. So for the protein, we use this Gromax initial setup tool, which I'll show you now. So if you search for Gromax, then this will provide all of the tools which based on the Gromax software and we choose this one, Gromax initial setup. And we want to choose the protein-only PDB file as our input. For water model, we have various options which are available. So this is a so-called three-point model, which means that the water is modeled by three different charges, so one on the oxygen atom and one on each of the hydrogen atoms. There are also four and five-point versions, which also place the charge, for example, on the lone pairs of the oxygen. For the force field, again, we have multiple options. We will choose the AMBA 99SB force field, quite a recent one, but again, there are multiple options that you can use. But we recommend that certainly you use an AMBA force field because the tool that we use for the ligand topology also generates AMBA topologies. Okay, so then we can just click on execute and then we create three different files, which I'll explain to you when the when the job is complete. Right, so now we have the task of generating a topology for the ligand. So for this, we use the AC pipe tool. This provides an interface to AMBA tools and it also has the nice benefit that it gives us an output in a format which is required by Gromax. And as an initial step, we have to add hydrogen atoms to the ligand because if we look at the PDB file, you will see that these are missing currently. So we can do this using the compound conversion tool. Under output format, we can just select PDB, choose 7.4 as the pH at which to add the hydrogen atoms. We simply press on execute. Okay, so now we have our ligand PDB with the hydrogen atoms and we can check quickly to see that the hydrogen atoms have indeed been included in the output. The next step is to run the AC pipe tool itself. So it's this one here, generate MD topologies for some more molecules using AC pipe and as an input file, we take this data, ligand PDB with hydrogens, the charge of molecule is zero, multiplicity is one, which should be correct for almost all organic molecules. And under force field, we select GAF, so GAF stands for general AMBA force field, which is a force field which is applicable to almost all small organic molecules. And then charge method, we can select simply as VCC, which is the default option. And we want to save the grow file, we need this for our simulation, and then we press on execute. Okay, so now that that job has completed, I can say something about the output files of the GROMAC setup tool and the AC pipe tool. So for the GROMAC setup tool, we have three outputs, we have this top output, this topology, we have this grow output, which contains the coordinates of all the atoms, similar to the PDB file. And we have this ITP file, which is a position, which is a position restraint file that we can use later on in the collaboration steps. But again, for the ligand, we have topology file. And we have this grow file, which contains the coordinates of the atoms. As for the topology, if I show you quickly the contents of the ligand topology, it contains various information, not about the positions or cause elements of the atoms, but about, for example, their charge, their mass, different bonds, which exist between atoms, angles, dihedral, this is information, which is also required for our simulation. Right, so having now calculated the ligand protein, grow and top files, we now need to combine them back together again. And there was a tool for this in Galaxy, the merge, grow and top files. So the tool has four different input files, first of all, the protein topology file, which is this one, the ligand topology file, which is this one here, the protein structure, the protein grow file, which is this dataset and the ligand grow file, which is this one here. So I can just press execute to start the job. So after we've combined the protein and ligand grow and top files to create a single unified grow file and a single top file, the next step is to create a simulation box around the system in which the simulation can take place. So you can do this with the go max structure configuration tool. So again, quite a straightforward tool. We take this grow file as an input, we want to configure the box and we want to select one nanometer here. So this is the distance between the end of the protein and the outside of the box. So we can be sure that our protein is at all times this distance from the edge of the simulation box. The different box types that you can represent that you can use. The triclinic is the most efficient one. So we will use that. Okay, so now we have the solvated top and solvated grow file. And the next step is to create a simulation box in which we place our system. So we can do this using the structure using the structure configuration tool. And this is also a simple tool. So it just takes the grow file this time as an input. This solvated grow file which was just created. We want to create another grow file as an output and we want to configure the box. So we set the box dimensions to one nanometer. So this is to ensure that the edge of the protein will always be at least one nanometer from the box. And we choose the box type. So there are various options. And it's a question of what proportion of them is occupied by the system compared to the solvent. And it turns out that triclinic is the most efficient option. So that's what we choose here. And we can just press execute. So having now completed solvation and created the simulation box, the next step is energy minimization. So the aim is to ensure that the system is in the lowest possible energy state. So we conduct a short simulation and wait until it reaches that state. Okay, so we have a specialized chromax energy minimization tool. Once again, we need to provide growing top files. So we provide the grow file from the, which we just created with the box and the solvated top file from before, sorry, this one, the solvated top. Then we have a choice whether we upload at our own NDP file. So in chromax, you have these NDP files, which contain parameters for the simulation. And you can also choose this customizable option. So there are various parameters, which we can mostly leave as their defaults. We can set the number of steps to 50,000. And the EM tolerance to 1,000. So the idea is, is that the simulation will begin as soon as the maximum force in the MD simulation is smaller than this value, then the simulation will end prematurely. Or else, if it reaches 50,000 steps, then it will end. So that's press execute. So having completed the energy minimization step, we now need to check that the minimization simulation has actually converged. In other words, that is actually reached the minimized state. So there's a galaxy tool also for doing this, extract energy components with chromax. So I'll show that to you now. So first we need to take the EDR file, which is an output of the energy minimization. And there are various terms that you can calculate. So in this case, the correct one is the, this is the default, the potential. You want to select a galaxy tabular, the output format. And then we just press on execute. Okay, I'll just show you the contents quickly. So it shows all the time steps of the simulation and the potential energy of the system in the second column. So we can plot this. So if we go to the history now, you click on this icon here, visualize this data. Then we can choose to plot a line chart. So you can do this on here. Line chart, jq plot, you have to select the axes. So for that x-axis, we plot column one, the y-axis we plot column two. And what we see here is this kind of curve, which at first decreases rapidly and then starts to decrease more slowly. We can see here that because it's leveling out, the system is reaching a minimized state. So it could maybe even leave it to continue the simulation for a little bit longer. So it flattens out completely, but the system is more or less minimized. So if we had just seen a steep descending curve like at the start, then it would indicate that the simulation had not reached a minimized stage. So having checked that, we can now continue with the next step, which is the equilibration. Okay, so now we've completed the energy minimization. You can continue with any simulations. So in this tutorial, we carry out three simulations to so-called equilibration simulations and then a production simulation. So the production simulation is a real simulation that you can then go on and use in the analysis stage. So what does the equilibration, first of all, involve? So what we want to ensure is that the solvent around the protein is correctly equilibrated. In other words, we want to ensure that it's brought to the correct temperature and to the correct pressure. So for that reason, we conduct the equilibration in two different stages, under the NVT and the NVT ensembles. So NVT stands for constant number of particles, constant volume, and constant temperature. Whereas NVT stands for constant number of particles, constant pressure, and constant temperature. And as stated in the tutorial, you might also see these terms isothermal, isochoric for the NVT ensemble or isothermal, isobaric ensemble for the NVT ensemble. So to carry out the equilibration, we have to hold the protein in place and we can do this using the position restraint file that was created in the system setup. So the idea here is the position restraint file states which atoms have to be held in place. And then during the MD simulation, these atoms are allowed to move, but the motion is energetically penalized. So let's go ahead and start the first equilibration simulation. So we can search for the Gromax simulation tool. It's this one here, Gromax simulation for system equilibration or data collection. And once again, we select the grostructure file and the top file. So the grom file is this one from the energy minimization. And the top file is just this solvated top file that we created earlier. So now we have various inputs to select. So for the checkpoint file, this is our first simulation. So we don't have a checkpoint file. I'll explain that and what this is in a second. For the position restraint file, I have to scroll down to the bottom of our history. And for the Gromax initial setup tool that we created right at the start, there should be an ITP file. So this one here, which we use, the index file we can also leave blank. That would be if you wanted to specify some custom ascent groups, but we can ignore that. Then for the outputs, we do not really need the trajectory in this case, but let's choose to return it. We want to return the grom file. I want to produce the checkpoint file, which allows us to continue the simulation with our subsequent equilibration and then the production simulation. So again, we want to produce our EDR output, like for the energy minimization. And we want to produce the TPR output. That should be enough. Okay, then as for the settings, here we have a choice as to the ensemble. And this is the NVT equilibration. So we select NVT. Again, we have the choice whether to upload our own MVP file or use customizable settings. So if you are already familiar with Gromax and you've used it on the command line, this might be an option that you're more interested in, so you can customize things more extensively. But if you're new to Gromax, then we recommend that you use the customizable settings here. So the leapfrog algorithm is a good choice to integrate it here for the bond constraints. Then we want to constrain all bonds for the position restraints. We can leave neighbor searching and vector statics as the default. The temperature, let's choose a sensible temperature of 300. The step down here is a femtosecond. You could also change this, but if you make it much more than a femtosecond, then it's likely that the simulation will fail. So a femtosecond is a good value. And as for the number of steps that elapsed between, say, major points, a thousand is probably a good value. These values we can leave as the default, and now important is the number of steps for the simulation. So we select here 50,000. So if you remember our step length was a thousandth of a picosecond or a femtosecond. So 50,000 femtoseconds is equal to an equilibration of 50 picoseconds. And let's generate the log. So in case the job fails, then having access to this log is very useful to help to debug to find out where things went wrong. So I recommend that you always click yes for this option. Now we are starting with real simulations. We can expect that they take some time. So be prepared for that. And yep, no need to be impatient. The simulation has just finished. And just like for the minimization step, we want to check that the equilibration completed successfully, so that our simulation has actually equilibrated to a constant temperature. So once again, we use the Gromax extract energy components tool, this one. And we need to take the edr file as input, this one here. And so before we selected potential as the terms to calculate, and this one, this time because we're interested in the temperature, then we select pretty logically the temperature. And again, we select Galaxy Tabitha as the output format. So the job's finished. We click again on visualize this data. And we can use again the JQ plot wine chart just as before. So for x-axis we select column one, and for y-axis column two. And then you should hopefully see this kind of plot where the temperature of the system rises from a low value up to around 300. And then it fluctuates within two or three degrees around the 300 Kelvin value. If you're seeing something different, that a stable temperature value is not reached, then you should consider extending the length of the equilibration. But for this system, 50 picoseconds should be sufficient. But for your own system, perhaps a longer equilibration is necessary. So just bear in mind that it's worth checking with this tool. Okay, so the next step is to carry out the next equilibration step, which is now under the NPT ensemble. So we use very similar parameters to before. We want to, we can use the same topology as before, but we want to update the grofile to the one which was produced by, to the one which was produced by our NPT equilibration. So that would be that one. Under inputs, we also want to choose the checkpoint file, which was produced by the NPT ensemble simulation. So the idea of the checkpoint file is it contains the information about the last state of the simulation. So for example, all the forces which are acting on the atoms, and then we can use these to restart our new simulation from the point where the previous simulation left off. For the position constraint, we want to again want to slack the same file as before. In next, we again leave empty. We want to once again return the trajectory and structure outputs, as well as checkpoint and EDR files. And the TPR output. Okay, and then of course, we change NPT ensemble to NPT. Yeah, here the same parameters as for the previous equilibration. So we constrain all bonds. We set the temperature to 300 degrees. We set the step length to one femtosecond. Number of steps elapsed between, say, native points to 1,000. And once again, that's equilibrated for 50,000 steps. So after having carried out the NPT simulation, then once again, we want to check the pressure of the system has converged. So we should expect that it converges on one bar, so atmospheric pressure. And once again, we expect some fluctuation. So this time I will skip this step. Of course, I recommend that you try yourself, but just avoid repeating myself too much. But I will show you the kind of graph that you should expect after plotting the pressure output from the extract entity components tool. So it should look something like this. Once again, a sharp rise at the beginning, and then a lot of fluctuation. So actually, this looks like kind of crazy amount of fluctuation in first glance. So we expect the pressure to be at one bar, and it's fluctuating through the hold of the simulation at something between minus 200 and 200 bars. But actually, this is what we expect more or less. So pressure fluctuates a lot in the MD simulation. And statistically, this is probably not distinguishable from the pressure that we are targeting of one bar. So this is an acceptable outcome, I would say. So now I continue with the production simulation. So we are finally ready to perform a longer simulation of the protein without constraints, which we can then go on and use for more analysis we want to perform. So for the topology, we use the same topology as for the equilibration steps. We use the graph file from the NPT equilibration. For the checkpoint file, we use again the checkpoint file from the NPT equilibration. And important to note that the position of the strain file used in the equilibration steps is not used this time because this is a production simulation. Then as for the outputs, we once again want to return the XTC file and the graph file. And just to note here that in the last two equilibration steps, I selected the XTC file, but we did not really need to use this at all. Whereas for the production simulation, this is the main output, the main outcome that we interested in, the trajectory. We can produce the checkpoint file, but we will not be proceeding with any further simulations, so it is not strictly necessary. Again, there is nothing to check as far as the EDR file is concerned, but it does not hurt to produce it. I'll produce the TPR file as well. Okay, so once again we have the choice whether to convert the simulation into the NPT or NPT ensembles. Here I suggest the NPT. And so this time we leave the bond constraints as the default, so no bond constraints. Temperature of course 300 Kelvin, step length. A femtosecond and number of steps you can set to a million. So six zeros. That is one nanosecond. So let's start the tool. This is of course a longer simulation, so you should expect that it will take some time to complete. All right, so as you can see, the final simulation job has now completed. So you can see these four datasets being produced. I'm sorry, that's actually six datasets in total. And in particular the important ones are the graphile, which shows the static structure at the start of the trajectory and the XTC file, which contains all the frames which make up the trajectory. So what we can do next and what Chris will show you in a few minutes is how you can then apply various different analysis techniques to these two files to find out some more about how your molecular system is behaving during the course of the simulation. But what I would like to show you first of all is how we can scale up this kind of simulation to a really high throughput level. The way that we do that is using two different galaxy concepts called workflows and collections. So this part of the tutorial is optional and it's covered by a section at the end of the training material. So if I go back to the training material page here, this section, optional automating high throughput calculations. So you can also follow through the instructions, follow through the instructions here. So what I'll start by doing is navigating to the to the workflow tab here on the top. Then you see that we have a list of workflows. I will show you the content of this one, the protein ligand html simulation. So what a galaxy workflow is effectively is a way of connecting up multiple tools together to form a single pipeline, which you can then run as if it were a single tool. So these boxes here represent two different input files. The rest of the boxes represent different tools. So all the tools that we used so far in the course of the analysis. These pipes represent inputs and outputs of the tools which are being passed in between. So first of all, we start with an sd file containing some ligands they want to perform any simulations for in case you're not aware. And sd file is a commonly used file format for storing chemical information for molecules which have three-dimensional coordinates. We've asked it to the compound conversion tool, which gets it up into a collection. So a collection in Galaxy is a group of individual datasets which we can apply the same tool to. Then for each of these, we run the generate and the topologies tool, which you already used. We merge the topologies together with these protein topologies, which we have generated from the pdb input file. And then we continue with all the tools you saw already, structure configuration, solvation, chromax energy minimization, our equilibration simulations, and finally our production simulation. So just to show you how this works in practice, I can click on run workflow up here and have history prepared here with two input files, three molecules for which I want to conduct molecular dynamic simulations and the same hsp90 pdb file we've been using so far. So then you can see the workflow. We have these two inputs. We have our pdb input file, which we select here, and our sd file. And then we have various other tools. So these have different parameters which we can also adjust. So for example, let's say we want to change the sodium chloride concentration to 0.1, the step length maybe to 1 femtosecond, 0.001 picoseconds, and then let's conduct some very short simulations. So 1000 sets, the NPT simulation, 1000 for the NPT, and 1000 for the production simulation. So very short simulations, but just to demonstrate what is possible here. And then we simply click on run workflow here. And what you'll see is that the workflow will be scheduled and it will produce a lot of different jobs in the history, which will start up as gray because they're waiting to run and then gradually they'll change colour and eventually become green. And we'll see in the end that the final output files that we get at the end are in Galaxy Collections, so in groups of three files. So one for each of these three molecules, which are contained in this structured data file. So you can see already that the first few jobs have been scheduled, these Grammarx initial setup tools, and this initial step which spits up the structured data file into a collection of Galaxy datasets. So what you can see here is that the entire workflow has been scheduled, the intermediate datasets are hidden. So we can just see these two collections at the end, which haven't completed yet. We need to wait until the entire workflow completes. But for example, you can see that if you click on this collection, it will show you the three XTC files, which have yet to be produced, but put your group together in a single collection. So this provides a really neat way in which you can just upload a couple of input files and then at the click of a button and after selecting a few parameters, then you can run multiple simulations in parallel for a range of different small organic molecules. The final thing that we can do with workflows in Galaxy is to automate and apply the command line. So if you look now again at the training material, then you can see that we describe two different ways that you can do this. One using the Python library Biobend. We provide a small Python script, which you can use for this. Or even more simply, you can use the Planemo command line tool to just run the Galaxy workflow from the command line in a single command. So this section is strictly optional. If you don't feel comfortable using the command line, then feel free to skip it. But I think it might be interesting for some people. And it's nice also to show you what kind of things are really possible with Galaxy. So here is my terminal. In this directory, I have two files. I have my structure data file, which contains my ligands, as you can see. And then I have this hgmd job file. And what this contains is a list of workflow inputs to run a Galaxy, to run a particular Galaxy workflow. And then I can simply run a Planemo command like this. Planemo run. And I have the workflow ID and my hgmd job file, which I just showed you. And finally, a Planemo profile, which contains all the information which is needed to log into my Galaxy account programmatically. So then I just click on Enter. And if I return to my web browser, then I should be able to see that a new Galaxy history has appeared. So already the ligands file is being uploaded. And shortly the workflow itself will invoke and create all the datasets that we saw before. So my aim there was just to show you very briefly what kind of thing is possible with Galaxy, how you can interact with the Galaxy server via the command line to cover all the functionality of Planemo for running workflows would be subject for a whole different tutorial. But hopefully this provided you with a bit of information. If you're interested, then you can go and research this yourself and look at the Planemo documentation. All right. So having told you about high throughput molecular dynamics simulation with Galaxy, that brings my part of the tutorial to a close. So I'm going to hand you over to Chris now. He will show you how to perform the analysis of the MD files that we've produced using Galaxy. Hi, everyone. My name is Chris, and I'll be taking you through the analysis part of this Galaxy tutorial. To start, we'll go through some short background and then after that interactive session. If you have any questions, please do ask us. We will be online to help you out. So you've already run a molecular dynamic simulation and you found that it produces incredibly large and complex datasets in different file formats. And in this case, you have all the Cartesian positions of each atom of the system under investigation, saved out at a particular interval during the course of the simulation. Now, often we'll have tens of thousands of millions of atoms. For example, just considering this enzymatic system on the left, you have an enzyme and a ligand and some ions in a water box. In this case, there will definitely be quite a few thousand atoms and your simulation will have many, many time steps, millions of time steps in order to sample enough of the base space. So you're often running at least, let's say, 15 nanoseconds of simulation for a decent sampling and perhaps more than that. So once you've done this and you have these outputs, these molecular dynamics trajectories, you want to then conduct rigorous analysis of this information rich data in order to obtain scientific insights and conclusions from the simulations. So we're going to look at those analysis tools in this talk. So a general outline of the basic process is that you have some inputs. For example, the structure file, which might be in a Gromax format or a PDP format, you'll have your trajectory file or files and also some parameters. So those parameters might indicate which atoms you want to investigate or if you want to change the definition of, let's say, what hydrogen bonding looks like in terms of angle or distance, you might change these parameters. Then we'll go to the part where we take these inputs and analyze them using some kind of molecular analysis framework. So you're going to extract information and transform it in some way. We'll be loading information as well. And often the case, there's many frameworks already available. And although you can write your own custom scripts, we recommend, especially for workflows and for reusability, to use existing frameworks that are fairly well tested, such as MG analysis, Bio3D, MV-Traged and VMD. And what we've done is we've used these frameworks and integrated them into Galaxy so you can process the data. And then you process data, the result of GIMS analysis is then a table, a graph or a figure or a combination of these. And these are what you would then read into a little bit further and try to figure out what particular property has changed or if the simulation is converged or perhaps if there's some interesting behavior that's worth investigating. And at worst, maybe you would find out that you need to run along the simulation or there's a problem with the simulation that you might not have picked up. So it's very important to look at the process data. And of course, this will also help for publication if you've got some interesting figures. So some commonly used analyses. I mean, this is a really rich space. There are many types of analyses. In Galaxy, we've got some of the most common ones that you want to use. So for example, a time series, if you want to look at a property over time. And then if you want to, for instance, look at the root mean squares, we've got the root mean square deviation or fluctuation. We've also got the principal component analysis. And often people are interested in hydrogen bonding analysis. So if you're looking at an active site of a protein and there's a ligand, you know, are there some interesting hydrogen bonds between the ligand and this enzyme over time? From a channel in parts, if you're looking at the torsion angles of proteins, this is quite useful. And there's a few other examples. Now, we don't have all of these tools in Galaxy just yet. So I've started some of those that are not quite available right now. And perhaps you have a favorite package that you'd like to use. So how do you get that into Galaxy? So if it's on Condorforge, it could be added quite easily. And you can contact us online to add new tools, but we'll be covering some of these more commonly used analyses during this tutorial. Okay, so just out of words, let's focus on a basic time series analysis. So the idea here is to measure a property over time. So for example, you might want to measure a distance, an angle or a dihedral angle. You also might want to consider non-bond type interactions like hydrogen bonding. And so hopefully you'd get a time series of your property you're measuring. For example, the end-to-end distance versus the frames or the time in the simulation. And while it can be very interesting to look at this, it can also be misleading because we should really do a histogram analysis where possible. As we're not counting how many times, in this case, it's the end-to-end distance. A particular part of that is sampled. So these time series are very useful because we can look at, let's say, a torsion angle, whether it's sampled face-space relatively well. And if we want to change that, we might run a longer simulation or we might want to run a free energy simulation or potential of need force. But there are other ways to consider properties. But this is just one of the easy to understand and useful first ways to consider particular properties of interest. And of course, hydrogen bonding interactions are very popular to consider. So this is a great time series to use for your analysis. So next up, let's look at the root mean square deviation or the rmsd. And this is a standard measure of structural distance between coordinates. And it measures the average distance between a group of atoms, for example, the backbone atoms for protein. And why you might want to do this is if you want to check the stability, first of all, and also the confirmation of the selected atoms over a simulation. So if you do this for a protein, you could measure the root mean between an initial confirmation and all the frames of the simulation. And you'll see that you'll get a time series and a histogram using the tool in Galaxy. And if you look over time, you can see the rmsd did fluctuate a little bit. And now we want to figure out, are they multiple states or is it pretty much one stable state? So looking at the histogram, you can see in the density, there's a little peak on the left here, but there's pretty much a stable state. And there are not many confirmations of this protein that are worth investigating. There's just one stable state, which is good, the type of simulation we happen to be doing here. So this tool is really useful for the purpose indicated over here. So if you want to also look at the fluctuation around a particular residue, you would then use the root mean square fluctuation, or the rmsf, where we look at the average deviation of a particle with respect over time with respect to a reference position. So very useful looking at amino acids in a protein. So if we have, if we look at the plot and that's produced in Galaxy, it's the rmsf versus the residue position. And you can see that the fluctuation per residue changes, and some residues fluctuate more than others. So we'll use this kind of analysis to identify the most dynamic areas of the system and see if you want to investigate those further. And often the C and the N terminus or the beginning and end of the protein fluctuate quite a lot, and we might ignore that. But we're often looking for rmsfs above one, significantly above one, and that would indicate that area is very dynamic, and it's moving a lot during the simulation. So if you don't expect that, that might indicate there's some conformational change that needs to be investigated with something wrong with the simulation, or maybe something interesting is really happening. So, you know, this is another way to consider if your simulation is going as expected, and if there's an interesting area to consider. Next up is PCA or principal component analysis. So this is a well known statistical method. It's often used for dimensional reduction. And the reason we want to use it here is we have a molecular system. There are many atoms with Cartesian positions, and these change during the course of the simulation. So we have for Cartesian position three points x, y, and z times the many atoms that we have. And there's a large complexity, there's, you know, multidimensional complexity. We have this huge space and we want to reduce the dimension. So from whatever dimensionality it is, you know, several thousand dimension to something that we can understand. And we will use this PCA analysis. So what we do, the covariance matrix is calculated and diagonalized to get eigenvalues and eigenvectors and these are then the principal components. And essentially, these are the components they have the highest variance. So these are the things that are moving around in the system a lot. And they also are orthogonal to each other. So they're not, they're independent. And so we have these, these eigenvectors and we can consider this movement and say, well, you know, is this interesting? And what, you know, what's going on? So it's really useful to study the essential dynamics of a system using PCA and we will be looking then at statistically meaningful motions. So the, one of the results that you'll get in Galaxy is this set of plots where the principal component one and two are plotted versus each other, two and three, three and one, and then also this screen plot or eigenvalue rank plot. And what's happening here is let's look at PC two versus one, the, the amount of variance that this component is responsible for is indicated in brackets. And this is plotted over time from blue, which is the beginning of the simulation through to red. And you can see over time that the system moved from a positive region of principal component one space to the negative region. And then from the negative region of principal, also sorry, positive region of principal component two to the negative region to the positive region. So you can figure out if your system is, is going through repetitive motions or not. And if it's sampling this, this principal component space in an interesting way. Now, of course, these principal components are from a very large molecule. So they will not be as elegant as a simple bond by a prop vibration. There's often some complexity here. So it's very useful to extract and visualize these components, which you can do in Galaxy. And I'll discuss that in, in the interactive session. Now, what's often the case is that the, the screen plot shows us that, you know, only a few principal components are responsible for a lot of the variation in the system. So usually up to five. So one, two, three, four, five, maybe in this case, six or seven, it's responsible for about 50% of the variance. And those are the ones that we want to consider. So we tend not to consider the principal components out here because they're not responsible for very much of the variance. Okay, so I've discussed all the basic analyses that we'll do in this, in this tutorial, and just a reminder of the frameworks we're using, MD analysis, MD-Tradge, Bio2D and VMD. You can access the training materials on the following websites. And in fact, it might be a little bit more convenient on your Galaxy session. If you look at the top bar, you'll see the graduate cap or academic hat icon. If you click on that, you'll be able to access the training material. So thanks very much for watching this intro session on analysis. Let's move on to the interactive session. Hi, everyone. Welcome to the interactive session. So we're going to go through the analysis part of the HTML tutorial together. And you should go to your favorite instance of Galaxy that has the computational chemistry tools. I'm using usegalaxy.eu, and I've gone to the Chem Informatics subdomain. So it's going to show mostly the chemistry tools on the left, the chemical toolbox, as well as a few other useful tools, but nothing outside of this domain. So it'll be easier to find the tools we're working with. And we can access the training by clicking on the academic hat over here, and it'll automatically take you to HTMD. And if it doesn't, for some reason, you will just then move to Galaxy Training Computational Chemistry. You'll scroll down to HTMD. You'll click on the hands on, and then scroll down. And we're currently doing analysis. So you'll click over there. And we'll start the section. So before we get started, I just want to clean up my history. So I have a history with some of the setup and some of the analysis going on as well, but it's very complicated to look at. So I just want to start with all that we need for this analysis. And what we need is a Gromax file, which was created during the final simulation that you did. That final simulation would have been your production simulation, and it would have produced a number of outputs. But particularly, we want to get the XTC file for the trajectory and the Grom file, which is the structure file. So I'm just going to take this history that I've got, and I'm going to make a new history. So I'm going to history options. I'm going to go to copy datasets. And I'm just going to scroll down and I'm going to figure out which one it is. So I think it should be called something similar to Gromax simulation. And I particularly want the data that is the XTC data and the Grom data. So you can see in my case, if you click on your history item, it'll expand and show you some more information. And in my case, the history items that are interesting are number 36, which is the Grom output from the simulation and number 37, which is the XTC output. So 36 and 37, which are those two there. And I'm going to copy just those two to a new history called htmd underscore analysis underscore 2021. And I'm going to press copy. And that'll take a moment. And you can navigate that history the usual way, but you'll see that there's this pop up over here that explains that you can click on this link. So you can just click on that and take you to the new history. So we're at this new history and now we can get going. And to get started, I'm going to again click on the training materials and scroll down to analysis. So you've completed the simulation and you might be asking a whole lot of questions like, is it converged? And what interesting molecular properties can I observe that are relevant to what's interesting about this particular system? So your research on this particular project. So in this case, we're looking at heat shock protein and we're just going to do some basic analyses to get used to what's available on Galaxy. So the first thing I want to do here, and you'll do this as well, is create a PDB file. And the reason for that is most of the analyses in Galaxy right now would use a PDB file as an input structure and we have a grow format file. There are other ways to do this, but I mean, I find this the best way and we're just going to follow along and use the following tool. So we're going to use the Gromax structure configuration tool. And what we're going to do is just convert the PDB file. So you convert the Gromax file to PDB format. So what I've done here is I've actually copied the name of the tool, just paste it into this tool search section and hopefully we'll find the right tool. So you can see that it's the second hit. So Gromax structure configuration. And then just to make things a little bit easier, I'm going to full screen this window that I'm in. And for me, I have to press function and F11 for you that might just be F11 to go to full screen. So full screened. And what I'm going to do here is use this tool to convert the Gromax structure file, which is history item number one. And you can see there it is. It's a Gromax format. And I'm going to output a PDB file. So you have to change the output format. And that's all that we have to do. And if you just want to double check, you'll click on the training materials again, and you can see we changed the output and the box configuration we set to this. We don't need any of this detailed information. So I'm just going to press execute. And that will then start on the right-hand side in your history. So it's currently waiting in the queue. It's gray. And that should start shortly. And when it's complete, we will then have a PDB file. Now in the meantime, while that's waiting, let's move on to the next step. So let's convert the trajectory file from XTC format to DCD. And we'll use this MDTRAD file converter. Now the job's just started running. It's gone yellow. So it's running now. And I'm going to search for a converter. And there it is. And the input that's selected is history item two. So by default, this converter knows that you might be wanting to convert a trajectory type. So XTC or DCD. It won't select the other history items. And we're going to convert this to DCD. Now I'm always going to go back to the training materials to double check. I'm doing the right thing. Yep, doing the right thing. And I'm going to press execute. And you can see that that is now waiting in the queue to run. And that the other job that we ran a moment ago, the conversion from a growth format to a PDB format has completed. So I'm just going to click on that and just see what that looks like. And you can see the contents of this data here is a PDB format. So it's not showing all of it because it's quite a large data set. So we just see the first few lines. And you can see, you know, there's the first amino acids of alien and all the atoms of alien and the contingent positions X, Y and Z, et cetera. Okay. So the next job is running. That's excellent. So what happens after we've got these structure and trajectory files is we can actually start the analysis. And the first analysis that we're going to run is an RMSD analysis of the protein. And in this case, we're going to use it to check that, you know, the simulation, the protein in the simulation, it looks pretty stable. And as you saw previously, this is a standard measure of structural distance. And in this case, we'll use the C-alpha carbons of the protein backbone to figure out if the protein is stable throughout this trajectory. So we're going to use the RMSD analysis tool. And again, I'm just going to use control C, right click and copy as well, just to copy the name of the tool. And we just want to select the domain to be C-alpha. And then we should end up with an output like this. So let's just do that and pressing escape, rather just clicking on the outside of the training material area to get back. And I'm going to paste the tool name in. Okay, there it is, RMSD analysis using Bio3D. So the DCD file that we want to use is this fourth history item. Note that we could choose number two. And you can see Galaxy would auto convert that XTC to DCD. I personally haven't done it this way. I always use the conversion tools that we've set up for computational chemistry. So if you'd like to give it the other way, I'd like to see if it works. But I recommend doing it this way. And the same for the PDB input. Choose history item three if you've got a new history or as appropriate for your history. Then under select domains, you've got a few options and the history and the, sorry, the domain we want to choose is C-alpha as per the training material. And we don't need any other further notification about it. And I'm sure it's clear already, but a lot of these tools have further information about, you know, what the tool does, what the inputs are and what outputs you can expect. So you can always read more about these tools, not only through the training materials, but also on the tool page itself. So let's execute that tool or execute that job and you can see it's running. So it's waiting on the queue. It's gray. And at this particular time, there's a whole lot of Galaxy jobs going on on this particular server. So while I'm running this, maybe these jobs will be a bit slower than usual. So it's still waiting in the queue. So what I want to do is start the next job so long. So that's the protein, RMST, and we expect results like that. And we can also set up the legend RMST as well. Okay, and it uses the same tool. So it's RMST analysis tool, the same trajectory and PDB, but we're just going to change the domain. So I hope you can see that the RMST job is running in the background. Now there's a yellow and color over there, which means the job is running. What I'm actually going to do is select, I'm just going to copy the name of this residue ID. So what we're trying to do here is also check if the legend is stable. And in that case, it's good to choose the legend atoms and the name of the legend that we have or the residue ID of the legend is G5E. So that's what we're going to choose. I'm going to select that. And we're going to use the same tool. So a trick I often use is to do the following. Here's our previous job for the protein that's still running. I'm going to say run this job again. And it uses exactly the same tool. It's exactly the same inputs as you'd selected before. And of course the same parameters. I'm just going to change those a little bit. So everything else was the same except the residue. So I'm just going to go down to that residue ID. And I'd really copied the name of that residue, which is G5E. If you need to go back, you just do so. And then I'm going to press execute. So those jobs, one of them is already finished. This legend is now running, so you're now waiting in the queues. It's great that RMST analysis job for the legend is running. Let's just have a quick look at the RMST analysis for the protein. Okay, so let's just go back a little bit. Let's get ahead and see that the output that you get is very similar to what I've shown here in the tutorial. So we should see at least two outputs, if not three. And for the RMST tool, there are actually three outputs. So there's just your tabular data, such a time series. Then there's a plot, which is time series of RMST versus the frame number. It's effectively over the course of the simulation. And then there's also a plot, a histogram plot, which then takes all the time series data and places it. So this is all the time series data and it places it and it bends it over the RMST. So this data looks fairly similar. All these outputs look fairly similar to what's in the training material. And the reason they might be a little bit different is remember that we're running very short simulations because it's a tutorial. When you run a MD calculation, you set a random seed to set the velocities for the atoms at the beginning of the simulation. So we expect that there will be some difference. So you might not have exactly the same result as this, but it should be very similar. So what's happening in terms of the time series is that the RMST is generally staying around one. There seem to be some change a little bit further on in the simulation that might be worth investigating. And you'll see that then in the histogram plot that it's quite a broad region from I guess from .5 to about 1.5. You've got a unimodal like distribution, but there is a little bit of an edge over here. But it seems that the protein confirmation generally stays fairly stable and averaging around one, although there's this little edge. So maybe in a longer simulation we would see something interesting going on. And then for the ligand, let's just go a little bit further down. You should have a similar result to this. Then for the ligand, we expect that the ligand will have a stable binding mode for the for the length of the simulation we're running and because this is a well-known system. So you expect nothing dramatic to happen. So if we look at these plots, the ligand again there's some more data from that RMST than the RMST plus. That looks pretty stable. The RMST is about .3 on average, but there's some movement. So we can figure out, is there a lot of movement? We need to look at this distribution as a histogram. So here we go. So it sends it around .3. It's a little bit skewed to the right, but generally this ligand looks like it's fairly stable. It's a unimodal distribution and it's stable in the active site. And of course, in conjunction with these analyses, you could also go and visualize the output of these simulations to confirm this. But of course, now we have a quantitative measure of the RMST of the protein and the ligand over time. Okay, so that's the first type of root mean analysis. Next up, we would like to look at the root mean square fluctuation. And it's valuable to consider, especially for the protein, because we can look at the deviation of these amino acids of the time and figure out if there's large structural changes or large dynamic change of the particular position. So let's use the RMST analysis tool. And let's search in the left panel. There it is. And again, the trajectories in PDB should be automatically selected correctly here. And the domain that we want to look at again is C-alpha. So this RMST works really well on proteins, excuse me, or distinct residues or molecules with distinct residues in them like a protein. Excuse me again. I'm just going to execute that. And we should have two outputs, the raw data and the RMST plot. So that is queuing so long. And the expected result of this would be a plot of the root mean square fluctuation versus a residue position. And large fluctuations, we really don't expect any in this calculation. But anything above one would be considered a large fluctuation. And as mentioned earlier, the first and last residue tend to fluctuate a lot and be quite flexible. But otherwise, we don't expect this protein to be particularly flexible, especially with the short simulation we're running, and as it's in a stable conformation. So if it were to be the case that there was a large flexibility, let's say you can see around residue 100, there is some flexibility, we could figure out why that is. And perhaps if those amino acids are the ones that are in contact with the ligand, or if there's other dynamic motion in this protein, and it might be useful then to visualize the protein and see what's happening there. So I'm going to click back to the output. So in your history, you should have an RMSF raw data, which is then the time series. So for each frame that you save from the simulation, just here labeled from one to the end, and then the RMSF is calculated. Now that looks as follows. So a little bit different to the one in the training material example. And mainly, I guess around residue 100 or so, there's a little bit of a change over here. I don't think that this is necessarily relevant any way around one is considered fine. But if you were to be worried about why this is moving so much at this position, it could be investigated further again via a visual analysis or by looking at another time series, perhaps these residue positions. So amino acids at 50 and whatever's near to them to figure out why they're moving around. We might see this in the PCA as well. So just a general look at this, it's probably okay. This is expected motion for this particular simulation. Okay. So moving on, next up is the PCA analysis or principal component analysis. And what we're trying to do here is take all the motions of the system that are statistically meaningful and try and understand what they are. So we're going to look at using this PCA tool, which we're going to choose the C-alpha domain of the protein. And we should get something like this. So we'll be able to look at the principal components of the protein over time. So I'm going to choose the PCA tool. And there it is. So PCA using Bio3D. And again, your projection PDB input should be the same as before. So in my case, that's history items four and three, the domain to use is C-alpha. And there's quite a lot of outputs for PCA. There's a PCA plot, cluster plot. And then the first principal component plotted versus the RMASF. So that's quite interesting. We can compare the RMASF, which has really helped the exit as well, to the first principal component. And you also get some raw data. So I'm just going to press execute. Okay. So that has been queued. So waiting for that to run. And okay. So it's just turned yellow. So it's starting to run. So you should expect to have a similar result to this one where the first three principal components, of course, of very high variance and doing something similar to what we see here. So a question you might have though is, you know, what is the principal component? So you don't know what the principal component is. It's just the group of atoms with the highest variance. So we're using the C-alpha atoms here. So in C-alpha, which of those atoms have the highest variance and are orthogonal in motion to the other principal components. So you might want to visualize that, which we can do. And then another thing that we might want to do is look at the cosine content. And this calculation gives you some idea if your simulation is converged or not. We don't expect the simulation to be converged. So I'll run that in a moment. So here are the four PCA outputs, the raw data. Okay. So this is for the first three principles, if I, if I correct me, then you've got a PCA plot, which is what's shown in the tutorial. Come back to that in a moment. There's also a cluster plot. So by default, the number of clusters to look out for is two. And in this case, perhaps three would have been a better choice for PC2 versus PC1. But it's the same as the previous plot, but color coded differently because it's clustered rather than over time. So here it's color coded over time. So blue is at the start and red at the end of the simulation. And then the fourth output then is PC1 versus the RMSF. So you can see there are some similarities. They're not identical, but there's some similarities between the first principle component and the root mean square fluctuation. Okay. So let's just, before having a look at the cluster plots, let's start up the next calculation, which is the cosine content. I'm just going to start that up. It just needs the trajectory and the structure file, and we'll give an indication of whether or not the simulation is converged. So let's do it on the first three components. And we're going to analyze the first component. And in this case, the tool uses a zero based index. So zero is the first principle component. So that's a zero over there. So we can just press execute. And this will result in a cosine content value. So while that's queued and running, we can maybe discuss what's happening with this PCA analysis. If you have a look at your PCA plot, you'll see PC2 versus one. So over time, there's movement from PC1 in its negative region to a positive region and the same for PC2. It moves from negative to positive and actually back again. Based on what's happening in these plots, you could decide if it is some kind of repetitive motion or not. And you can look at the sampling of that principle component space. And then you'll also see in your screen plot, the proportion of the variance. So you can see that the first three components are responsible in total for about 35% of the variance. So the first component, 18%, you can see it over here and on the screen plot. And the second component for 11% and together 30% etc. So those first three components are really showing you the majority of the movement. Now the question is, what is that movement? So the next tool that we're going to run off this cosine content is a tool that would save out the principle components. And I said the first one and you could then visualize that and see what the motion is. So let's start that up as well. Let me go through the training. And this is the PCA visualization tool. So if you use this tool, it's going to be here, it's the third one down. Again, the same trajectories and structure, the same selection. So if you've used this specific selection in your PCA analysis, please choose the same one here. So we chose C alpha. So keep that same selection. And let's say we want to look at the principle components, the first principle component PC1. So we choose execute. And this will result in a PDB file, which is actually a number of, has a number of structures in it over time. So it's a little bit like a trajectory, but just as a PDB file. And this file can then be downloaded and viewed using your favorite visualization tool. And then you can figure out what that first principle component looks like. So for example, you can see in this case, the motion for PC1 in the simulation that we ran for the training looked as follows. So that was the principle component we found. And again, because this was run on a short simulation, it's probably, you know, you need to run along the simulation to confirm this. But you can see that the principle component here is quite complex, right? It's not just a vibration of a single bond. There's a whole lot of things going on there. You can see there's a bit of a wagging motion at the back. And in your case, you might see a slightly different motion. So once again, I've jumped ahead. And we did run this PCA cosine content calculation. And let's have a look at what that did. So your cosine content, you should get a number. And if we look at the outcome of the PCA cosine content calculation, you should get a number around 0.93. And that should indicate the simulation hasn't converged. And the tricky thing with those PCA cosine content, when it's close to one, it indicates the simulation is not converged and a longer simulation is needed. For values below 0.7, we can't make a statement about the convergence. So we can never know definitely that we've converged. And this we use a different type of simulation. So free energy simulation, we could converge over a particular degree of freedom. So if you have a look at your results, my result happens to be 0.093, which is a little bit unexpected. So it's a value below 0.7. And then what that means is I cannot make a statement about the convergence or lack thereof. But it's not that it's not converged. It's just that I can't make a statement about it. And in this case, I mean, you know, it's a short simulation. I expect to need to run a longer simulation. You should have a similar number. Perhaps your number will be closer to one. And again, you then would be able to say whether it's definitely not converged or whether you're unsure effectively. Okay, so that's the cosine content. And we've already considered this visualization. So finally, let's look at hydrogen bonding analysis. And yeah, these are really worth investigating. And in particular, persistence hydrogen bonds. And so in this case, we'll look at the protein and the ligand and look at hydrogen bonds that potentially form between them. And we're going to use the hydrogen bond analysis tool using VMD. So I'm just going to click out and go to that tool. There you go. So it happens to be the first hit for me. And there might be some other tools as well. There's another tool that doesn't use VMD. We're going to use the one that uses VMD. And then you might notice at this time that your history items that are selected by the tool are not the correct ones. So the DCD input is correct, but the PDB grow input is incorrect. And the reason for that is we ran that PCA visualization and that created a PDB file. So the tool's chosen the most recent PDB data sets. And that makes sense, but it's not the one we want. So either choose the correct one, which is in my case, number three, the one we've always been using. And you might have seen that cool trick that you can do where you can pull these items in like so. So you could also do it that way. And then what we're going to do here is select the protein of the first selection and the ligand of the second selection using the VMD selection notation. So the protein is going to be put into selection one. And for selection two, I can't be then pasted with name G5E. This is simply the ligand. If you have a more complex system, these selections will probably not be appropriate. So for example, if you're looking at an antibody system with an antigen where both systems are proteins, then you'd have to start being more specific about your selections. So you can see for the protein, we were pretty lazy. We used VMD understands what a protein is. We have one protein in the system. So this makes sense. And in terms of the ligand, we want a specific ligand. We don't want to select the waters. We don't want to select anything else. So we're using the raise name to be more specific. And there's nothing else that needs to be done there. So that should just work. And we can press execute. So if you find that your result for this tool is really odd, the tool runs, but there's no result. That will usually mean that you've chosen a dataset that doesn't really work. So for example, if I've chosen this PCA PDB dataset by mistake, the other reason that might not work is to think about the selections that you've used. And perhaps those are not the correct selections to use. So you could figure out using the visualization aid, the number of these tools, you could figure out which selection is more appropriate to use. Okay, so that ran pretty quickly. And there are three outputs. So we've got the percentage occupancy, the number of hydrogen bonds as a time series. So starting at zero, the zero with frame all the way to the end, how many hydrogen bonds are identified per frame. So in this case, one in the zero frame, two in the first, none in the second, et cetera. And then the other output is actually just a log file from VMD. So the most interesting one for now is the occupancy. And this should agree with the, what's in the tutorial, you should see that there are six hydrogen bonds identified. And two of them are quite interesting. And four of them are not very interesting. So the ones that are interesting have a high occupancy. So the side of the ligand, so G5E. And the side of that ligand is a donor atom, which then the acceptor for the hydrogen bond is aspartate. And the occupancy is 79% or so. And that means that that hydrogen bond is round for or occurs for most of the simulation. And the same for this other hydrogen bond. So again, the ligand, the side of the ligand, but this time with glycine. And you'll recall that glycine has a side chain that doesn't have an acceptor. So you can see here, it says main. So this is probably the covenal group of the glycine. So it's the backbone of that glycine. And that occupancy is about 29%. So the occupancy just tells us that this hydrogen bond occurs for a certain percentage of the simulation that we've run. And in fact, for a certain percentage of the frames that we've saved out, and doesn't actually say whether or not this hydrogen bond is consistent throughout a certain number of frames. So this could mean, let's say that the 30% or 29% hydrogen bond over here occurs for 29% of all the frames that we looked at. But it doesn't mean that it could occur for the first 30% of the frames and then it wasn't there. So we might want to look at a correlation function and figure out whether or not this hydrogen bond is stable and it occurs for long periods of time or if it's flicking on off. We're not going to do that today, but that's something that could be considered. If we look at the other hydrogen bonds, the ones with a low occupancy, the reason they're not interesting is they don't occur for a long time and they're unlikely to be stable. But nevertheless, they could be investigated further. So there's this pherogene of this protein with the ligand, the threonine with the ligand, the lysine with the ligand and then ligand with the serine. So your result then should agree with what is in the tutorial and you can see that the glycine and the spartate have high occupancy, which is what we found. And then the others have a lower occupancy. Okay, so that's that from the analysis side. You've already gone through the high throughput calculations with Simon. And in conclusion, throughout this tutorial, you've looked at a protein ligand system. You've performed molecular dynamics using Galaxy and you've actually also then looked at the output. So what is going on with this particular calculation? What is meaningful? We've used various forms of analysis. So RMSDs, PCA, we've looked at hydrogen bonding. And you should feel familiar with the basic level of MD analysis techniques and MD simulation tools. Congratulations. You finished the Galaxy HTMD tutorial. If you do have any further questions or any issues that you're having, please do find us online. We were happy to help you out. Thank you. Cheers.