 Welcome to this BioXL Winter School short talk about using Gromax and PMX with the BioXL Building Blocks Library to tackle a particular COVID-19 research project. My name is Alamos Pital, you already know me from the BioXL Building Blocks lectures in this BioXL Winter School. Let me start with a really brief one slide introduction on Gromax, PMX and the workflows that we built with the BioXL Building Blocks Library. And now you already know all of that because you've been attending all the lectures in this BioXL Winter School, but Gromax and PMX are two of our key applications in BioXL. We are using Gromax to compute molecular dynamic simulations and we are using PMX to compute free energy calculations. And we are in BioXL using the BioXL Building Blocks to build workflows using these two biomolecular simulation tools, but many other biomolecular simulation tools together thanks to this interoperability that the library is offering. And we are executing and controlling the workflows using different workflow managers. And in this case, in this particular example, I will present you results using the PyCom's workflow manager, which is being developed at the Barcelona Supercomputing Center and is HPC focused, you will see. So this project started when the pandemic appeared and at the moment many different groups in the biomolecular simulation field started to be interested in the COVID-19 infection mechanism on how SARS-CoV-2 was able to infect the human cells. And I'm sure that you all recognize these pictures from this slide, but just as a reminder, this is the SARS-CoV-2 virus. These red protuberances that you see here are the spike proteins. And these spike proteins are the ones that attaches to the human or the host cell and they are doing that using this protein, green protein represented here, which is the receptor binding domain protein from the virus, RBD, which is attaching this human ACE2, the angiotensin converting enzyme to from the human cell. Here an schematic representation on the ACE2 receptor and the RBD here in green and what we are interested in is just this interface between both proteins. So the interface that makes the virus able to recognize the protein and to make the infection. So we are interested in this green, sorry, yellow interface here highlighted interface from the RBD, but also from this interface of the human ACE2 protein. And what we are interested in are in particular the mutations on this interface. How this RBD, for example, interface is mutating, how these mutations are affecting in the infection of the virus to the human cells, and also the polymorphisms in this human ACE2 protein affects the infection from the SARS-CoV-2 virus. This is the first part of a more ambitious objectives of the project. The second phase of the project was to go from just the SARS-CoV-2 virus to different variants of the virus. So, for example, the SARS, the previous virus that appeared in 2003 and affected the human population, but also variants of the current virus, such as the virus, such as the American variant, or why not different species? This one is the virus which is affecting the BAT species. And we are interested in understanding what are the mutations coming from the BAT virus, this one, the one affecting the BAT, to the SARS-CoV-2, the one that is affecting us. And so how did the virus evolve? What mutations appeared going from one species to another? And how are these mutations affecting in the infection? And finally, in the most ambitious phase of the project, we want to go and take a look at different species, many different species, pangolin, zybulin, the BAT, the human, different variants of the virus, but also different variants of the ACE2 protein, which is the one that the virus is recognizing in the different species. So, as you can see, this is a very ambitious project when we want to compute the differences on binding of this RBD and the human ACE2, or ACE2, from different species using this biomolecular simulation workflow. So, what we are interested in is to take a look at the variants of the different viruses and the infectivity of these variants, the difference on the infectivity. If we could identify mutations that are increasing these infectivity, and how are these mutations affecting these infections, where are they placed, why they are affecting so much the infection or the binding between both proteins. And the final objective is, if can we predict the future evolution? That means with all the information gathered together here, can we start seeing or predicting if a particular mutation will increase the infectivity if this mutation is already there or it could be there in the near future. As you can see, this is a combinatorial explosion, so it is really a complicated project with many different combinations to compute, and you will see in this presentation how are we planning to tackle this combinatorial explosion basically using HPC resources with many different cores, actually supercomputers. So, how are we predicting the effect of the protein mutations in these RBD ACE2 complex binding? We are doing that using this thermodynamic cycle. I'm not going to enter into details here. I'm sure that you already recognize this kind of cycle. You have attended the PMX different lecture show. You may also see this slide there, but basically what we are interested in is to extract relative protein binding free energy differences, and we are doing that using alchemical free energy method. And what we need to extract these delta-delta G is to compute these two delta G's here, the one and the four. The one is coming from a monomer RBD if we are interested in mutations on the RBD, or ACE2 if we are interested in mutations on the ACE2 from the wild type to the mutated structure. And we are doing the same with the complex RBD ACE2 with the same mutations. We are going from the complex wild type to the complex with the mutations. We are extracting these two delta G's and subtracting both values, and we obtain the final delta delta G which is giving us information about if these mutations is affecting the binding of these two different proteins, RBD and ACE2. In a more technical, detailed slide, and this is exactly what we built in our workflow, we compute two delta G's, delta G1 and delta G4. We need to compute first four different equilibrium molecular dynamics simulations, one for the RBD wild type, one for the RBD with mutations, one for the complex RBD human ACE wild type, and another for the complex RBD human ACE with the mutations. So with these equilibrium molecular dynamics going from nanoseconds to microseconds, then we extract a number of snapshots, an ensemble of structures, in our case 500 different structures, and we compute thermodynamic integrations going from state A to state B from the wild type to the mutations. And then we are doing the same in reverse from state B to state A. In total, 1000 different short molecular dynamics simulations, thermodynamic integrations from 10 picoseconds to 200 picoseconds, where we are doing that in the monomeric structure and we are doing that also in the complex structure. We obtain the two different delta G's, and finally, subtracting both values, we obtain the delta G. All of this information, it was really nicely explained by Betas in the chapter of this book that you have here. And you have also attended the PMX lecture, so maybe you already know all of that. And this is the final diagram of the project, and remember this is our combinatorial combination, the combinatorial explosion that I was telling you before, a really complex and ambitious project, but we have divided this project into two that basically are our workflows. The first one generating the molecular dynamics, the equilibrium molecular dynamics, and the second one from this molecular dynamics, computing the fast growth thermodynamic integration and obtaining the delta G values. Of course, all of this needs praise resources, praise computational power, and we are using supercomputers such as the Marinostrum, one that we have in the Barcelona Supercomputing Center. But how are we taking advantage of these resources? We do that going exascale, and we are doing that using the PyCom's workflow manager. PyCom's workflow manager, as I was telling you in the first slide, is developed in the Barcelona Supercomputing Center. It is basically a tool that it's able to identify our Python-coded workflows. It is able to identify these loops, which have completely independent branches, and it is able to automatically identify these independent branches and run them, execute them in parallel. So these branches are executed in parallel, and thanks to the fault tolerance, if one of the branches crashes, the rest of the branches still reaches the end of the workflow, so we will only need to take a look at these branches that have crashed and the other ones will give us the results. And it is also using an efficient... It's efficiently using the CPUs of the different nodes on the supercomputers, as you can see here. And you will see that in another slide. Remember the diagram? We have two different workflows. The first one for generating the molecular dynamics. The third vector is the second one to compute the free energy simulation. So the first one, we have an automated modeling of all the mutations, the setup process and the production run of the MD simulations with different parameters that we can change, of course, in an MPI regime. As an example, we can run in one particular job, in one single job, using more than 8,000 different cores of the supercomputer, 22 mutations, 21 mutations plus the wild type. And we can run that in hours. From these simulations, we can run then the free energy calculations using the second workflow, and this workflow is extracting the ensemble of structures from the equilibrium dynamics that we have generated with the first one. And for each of these snapshots, it is generating the hybrid topology using PMX, and it is running the thermodynamic integration using Gromax. All the work values computed for all the snapshots are then integrated together to extract the final delta G. Remember, this is one delta G, just one. We need to run twice this workflow to extract two delta Gs, one for the monomer, one for the complex, and then obtain the delta delta G on the mutations, on just one mutation. As an example, we can use 32 nodes, more than 1,500 cores in one single job, and that means that in five hours, we are able to compute one of these delta Gs. An example, this particular example that I was showing you in the previous slide, using 32 different nodes, you can see here how BiComps is able to use these 32 nodes with 100% of CPU usage and just the green line, which is the node reserve to orchestrate the whole workflow is the one that is not using all the CPUs. And this is, remember, 1,000 different thermodynamic integrations using a supercomputer. So preliminary results using all of these workflows, starting from the human polymorphisms, you can see here on the right hand that, for example, if we identify with the workflows a mutation, which is giving more infectivity to the virus, the lower the number, the more infectivity the virus have, if you look at the frequency of these mutations, you can see that this is really appearing in the population. In the human population, it has a higher frequency of appearing in the human population, whereas the other ones, for example, these ones which are given lower infectivity to the virus, they are appearing in less number of people for the human population. So the virus, in a way, has already adapted to the human polymorphisms that are present in many people as of today. If we look at the virus variations and we look again at the column of the PMX results, if we look at the ones that are giving the mutations that are giving more infectivity, and then we look at the frequency of these mutations, you can see again that once the mutations that are giving more infectivity have higher frequency, so that means that as this mutation is giving more infectivity, the virus has kept this mutation and this is evolving towards having more infectivity. Whereas if we look at these mutations that are giving less infectivity to the virus, we can see that the frequency is really lower than the others. If we look at something more complex, which is the different mutations to go from the virus affecting bad to the virus affecting humans, so the Raji 13 to the SARS-CoV-2, there's 21 different mutations to go from this Raji 13 to this SARS-CoV-2, and if you see there's blue and red boxes that means that these mutations are charge changing that are introducing a charge or removing a charge, lysines, arginines, glutamic acids, or aspartic acids, and if you look at the mutations to go from the ACE2 of the bad species to the ACE2 of the human species, you can see more or less the same trend. There's a lot of red and blue boxes or a lot of different electrostatics changing between the different species and if we paint the interface of the bad virus versus the human virus, you can see also that there is a high redistribution of the electrostatic potential on the interface of the RBD and the ACE2, so that is giving the SARS-CoV-2 higher infectivity and also is giving higher flexibility to be able to infect not just the human ACE2, but also different ACE2 from different species. This is just the final slide with examples on charge changing mutations, the PMX running the workflow and extracting the Delta-TG values and you can see how the sum of the different charge-changing mutations in total are giving more infectivity to the virus or SARS-CoV-2 just using the charge-changing mutations is gaining 5k of infectivity and now we are taking a look at these higher numbers, the ones in red which are giving a lot of infectivity and also the ones that are removing part of the infectivity and we have confirmed all of these numbers by slow growth, thermodynamic integration, something more computational expensive that has fast growth, even more expensive PMFs and also we are confirming that with experimental results. We have a collaboration with an Italian group that are confirming all of these with experimental results and finally, please remind that this is a really ambitious project. We have concentrated in this short talk in just one particular virus variant from the bat to the human, but we are now exploring all the different variants and so if you are interested in that, please keep an eye on the BioXL website and also on our Twitter and thanks for your attention and if you have questions, I would be happy to answer them now.