 Welcome to the Bioexcel building blocks lecture of this Bioexcel summer school edition 2021. My name is Salamos Pital. I'm coming from the Institute for Research in Biomedicine in Barcelona and I'm leading the Bioexcel workflow team. Tomorrow we will be doing the live session with Janice also from IRB and POW from the Barcelona Supercomputing Center, which are the main developers of the Bioexcel building blocks library. We have two lectures today from 35 to 40 minutes, so the introduction on the Bioexcel building blocks library and the introduction about the biomolecular simulation workflows using this library. And tomorrow we will have a live session, a tutorial, we are going to work with a protein complexed with a small molecule, the setup of the molecular dynamics simulation in Gromax using the Bioexcel building blocks library. But also we will have a couple of query and answer sessions, one for the theory, the introduction that I'm going to give you today and one also for the tutorial tomorrow. So we'll have time to discuss about many different things tomorrow, but today we are going to see the introduction on the building blocks. So let me start with the introduction of the Bioexcel Center of Excellence. I know that Rosen talked about that on Friday, he gave you a short introduction about what is the Bioexcel Center of Excellence. What I wanted to tell you is just where are the workflows of biomolecular workflows within this Center of Excellence. So, as you know, Bioexcel is a central hub for biomolecular modeling and simulations that we are many different partners from Europe, and we are coordinated by the KTH in Stockholm. And our area of expertise is here from the atomistic structures to the macromolecular and macromolecular complexes scratching the cell structure. Our main goal is to enable better science. And we wanted to do that with three different, dividing the main goal in three. So improving the performance and functionality of the key of our key applications. And I will tell you in the next slide what are these key applications. Then providing support to non-experts entry-level experts and also advanced users. And finally, developing user-friendly computational workflows. And this is where the biomolecular workflows are in Bioexcel. But first of all, the key applications are Gromax for molecular dynamics, BMX for free calculations, free energy calculations, Hadoq for protein docking, protein protein docking, protein vegan docking and protein DNA docking also. And finally, CPMD and CP2K for QMM. And I'm sure that you're familiar with these codes and you will be hearing from all of these codes in this Bioexcel Summer School. Also, we are giving a lot of support and we are giving support in using different webinars, training events as the one, schools as the one that you are attending this week. Conferences, also industry visits and we have a forum where you can ask everything that you want to ask regarding our field of biomolecular simulation. But what is the main point for us and the most interesting point for us is the workflows, biomolecular workflows. And to achieve this goal, what we want to do is to generate biomolecular workflows that are easy to be built, easy to be used, available, reproducible and portable to many different infrastructures, platforms and shareable with the scientific community. And we are doing all of that in collaboration with the Elixir European project following a set of software development best practices that I will introduce you in one slide in this presentation. Okay. Why are biomolecular workflows so important for us and we should give a step back and think about what is a workflow and workflow. I'm sure that you all have a workflow. You have run a workflow or you have even developed a workflow in your professional life. A workflow is just a group of processes, sequential processes like this one that you run one after the other where one of these processes can be many different tasks at the same time. But at the end what the workflow is generating is a set of results. And if you stop for a minute and think about what is a biomolecular workflow, I'm sure that you will think about, okay, in the last workflow that I have run, I needed to download, for example, a structure, a PDP structure. I need to convert between different versions maybe models and missing piece of the structure, maybe run a docking or parametrize some ligand or run chemo informatics things on the structure, run molecular dynamics, maybe quantum mechanics or maybe a hybrid QMM. Then after the simulation you maybe need to or I'm sure that you need to analyze the results, the trajectory. Maybe you are interested in running free energies and I'm sure that you will do some kind of data analytics that is, you know, the state of the art analysis on the simulations nowadays. All of these different processes are completely connected to all of these many different programs that we have available in our field. MD, GROMACS, AMBER, NAMD, docking, Hadoq, Autodoc, Autodoc, Bina, QM, parametrization of ligand, chemo informatics, machine learning, analysis, many, many different. And I'm sure that you're familiar with most of the programs that you see here. The problem is the reality is that when we start working on a workflow what we typically end up having is something like that which is a shell script with different steps, one after the other. This one produces a result that is used for the second one that produces another result that is used for the third one and so on and so forth until you have the final results. These shell scripts, and I'm sure that you agree with me, they have many, many problems, usability problems. So if you wrote this kind of shell script two years ago and you want to execute it again today, I'm sure that you will find many different problems. You will not remember what were all the different lines doing. You will have problems with the software in your machine that will be updated from the one that you were using two years ago. You have interoperability problems from one step to the other. If you're using the same software system, that's okay. But if you are using many different software tools as you saw in the previous slide, then I'm sure that you will end up having interoperability problems between the different steps. And very important, if you want to share, if you want to port or reproduce this workflow in another machine, this is really complicated because you need the same version of the same software, the same dependencies installed in different systems. And finally, if you want to scale that, if you don't just want to run this workflow once, but you want to run it in a high throughput way 200 times, then I'm sure that you will have a problem. You will need to write something else different from this JavaScript. With all these things in mind, we started a project at the beginning of BioXL that was called BioXL Building Blocks. And this is where the idea of the BioXL Building Blocks started. So what are the BioXL Building Blocks? Actually, the idea is really easy. The concept is just that the BioXL Building Blocks are, the BioXL Building Blocks library are a set of Python wrappers on top of different tools. That means that at the core of our wrappers, we have an execution of a tool, a script, a library, but at the end it's an execution. That needs input parameters and produces output parameters, of course, output files, sorry. And what the wrappers are doing is only adapt a set of parameters, input files and configuration files that have always the same syntax to the local input parameters that the tool is understanding. And it does exactly the same for the output file. So the tool is generating output files and we are adapting these output files to a set of output files and log files, which always have the same syntax. That means that basically this Python wrapper that we are adding this layer is adding a compatibility part from one building block to another, making all the building blocks interoperable between them. On top of this Python compatibility layer, we also have an external layer that is called workflow manager adaptation layer that makes these building blocks compatible with different workflow managers. And I will tell you more about that in the presentation. So the concept is really easy. We have building blocks that are completely interoperable between them. That means that we can use them in a really easy way and put them together to build our own workflows, biomolecular simulation workflows. And once we have our workflow, then we can run the workflow. And to run the workflow, we can control the workflow with different workflow managers. Graphical user interfaces, workflow managers such as Jupyter Notebook, KNIME, Galaxy are also HPC based workflow managers, as for example, PyComes. Okay, as I was telling you at the beginning, we are following a set of best practices for to develop the building blocks. And we are doing that in collaboration with the LX European project. And we are using the fair principles that were used typically for data, but they also can be exported to software development best practices. And that means that our software is now findable. It is registered in different portals such as the BioTools portal. It has its own website where you can find everything that you need to know about the library. It is benchmarked by this OpenEvent infrastructure in LX here. It is publicly open using GitHub repo. It is findable through search engines as Google using bio schemas. It is documented and this is the accessible part documented very easy to use because it has all the documentation in standard. Read the docs, open API for the rest API, common word for language description for all the building blocks for all the different tools that we that we are wrapping. We package all the software in Conda. And I will talk about this Conda in a minute, Conda packaging, Docker containers, singularity containers, so very easy to use in different machines. They are interoperable, of course, by definition because this is the concept behind the main idea behind the Python wrappers. And it is really reusable. For example, with the exported and you would see examples of that Jupyter notebooks that we have as a demonstration tutorials or Galaxy. Even NIME or even our own web servers implementing the workflow. So it's really a fair software. Just a minute to let me just introduce you the Conda packaging system because I think it's really really important. And with that, that's perfect because this is the, the software that we are going to use tomorrow in the live session. If not, I strongly recommend you take a look because it is really convenient for reproducibility and we are using it a lot in our development for the library and also for the workflows. So with this Conda packaging, what you can do is to generate a closed environment in your computer. So taking this example, person number one, he has, he or she has a computer environment with these programs and these versions of the program installed. It implements workflow, runs the workflow and the workflow is giving results which are fine using these programs with these versions in the closed Conda environment. He or she installed all of these software tools with these versions. It runs the same workflow and the results are exactly the same. But then when he or she tries to share this workflow to another person, person number two with, which is working in another computer environment, having different software installed and different versions of the software installed. If he or she tries to run the workflow, then the workflow gives an error, the software is missing and the software has different versions and I'm not compatible with the software. If instead of running the workflow in the computer environment, it runs a workflow in a closed Conda environment that you can easily reproduce from the Conda environment from person number one's computer. Then you can run the workflow and you can obtain the same results because the environment has exactly the same programs and versions that were used in the first execution. And we are relying a lot on this concept of packaging of all the dependencies for a particular workflow to reproduce these workflows. And you will see that in the second lecture and also tomorrow in the live session. Okay, coming back to the BIOSL Building Blocks Library, we gather together these building blocks that were doing a same, let's say functionality in different modules. So for example, all the building blocks related to molecular dynamics with Gromax where they are containing a BIOSL Building Blocks MD module. Those that are wrapping amber tools are containing the BioBB Amber module and the same for the modeling tools, machine learning tools, virtual screening tools, input output tools, that means retrieving information from biological databases, analysis of molecular dynamics trajectories, modifications or information that you want to change or modify or extract from PDB files, chemoinformatics, free energy calculations and we are still working on many others every day. So interaction potentials, coarse grain, algorithms, DNA specific analysis and simulations, etc. All this information, it's gathered together in one central website which is called BIOSL Building Blocks. You have the URL here, we are going to use a lot this website tomorrow but just for you to know and I'm going to present this website very quickly at the end of this presentation. But I just wanted to show you now that one of the sections of this website has all a table with all the different modules with links to the GitHub repository where you can find all the source code, links to all the documentation and read the docs, links to all the Bioconda packages, docker containers, singularity containers that you can download and execute in a really easy way. If you click on one particular module, for example, analysis, you will get a list of all the contained building blocks. If you click on one of the building blocks it will open you a read the docs documentation for the particular building block. But what I want to show you now is that if you click on one of these Bioconda links you will be able to download and install in just one single line, one particular Bioconda package that contains all the dependencies needed to run all these different wrappers that are contained in these modules. So for example, for the analysis part, analysis of molecular dynamics trajectories, if you type condi installed in your machine, of course you need to have the conda packaging system installed. If you type condi installed, it will automatically install for you, create a new environment for you with these dependencies, Humber 2019 and Gromac, so that you can then run all of these building blocks without having to install anything else and you can reproduce that in different machines. The syntax, this is important because I was telling you at the beginning interoperability is a central piece of the idea of the Bioconda building blocks, and that means that we need a standard syntax for all the different building blocks, and this is the syntax. The syntax is you need to import a module, define inputs and outputs, paths to your file system, properties that are parameters of your particular tool that are being wrapped, that is being wrapped, and then you need to launch a building block. Very easy. This is an example. This is the EditConf tool from Gromac's package that creates a solvent box. You need to import the module, then create, define inputs, outputs and properties, input, output are two files in your file system, and properties are the parameters. In this case, I want the box to be cubic, and I want the box to have one angstrom, one nanometer, because this is Gromac's, of distance from the protein to the end of the box. And finally, you launch the building block with the inputs, the outputs and the properties. This is all you need to know to run our building blocks, and I will demonstrate, I hope that I can demonstrate you during this presentation and also the live session tomorrow, but you may be wondering how do you know these inputs, outputs and properties that you have to define to launch the building block. So these are specified in the documentation for each of the building blocks. In this documentation for the EditConf building block, you have parameters that are mandatory, input, grow, output, grow path, and also the properties that you can define. And let me show you a couple of examples so that you see that the synthesis is exactly the same for all the different building blocks. For example, if you want to, if you are interested in mutating a residue, in this case an arginine here for an alanine here, the same position, different residue, you just import the building block, define inputs, outputs and properties, in this case an output and the property. I want to modify arginine 5 to an alanine and then launch the building block. And you obtain, in this case, wrapping the modular tool, mutate the destructor. Another example, clustering of a molecular dynamics trajectory, you import the building block, you define inputs, outputs and properties here, and you launch the building block and you obtain a cluster, a set of an ensemble of structures from a molecular dynamics trajectory. And you see that it's exactly the same syntax. And going a little bit further, you can join together, as I was telling you at the beginning, one building block, two, three, four different building blocks. In this case, downloading a small molecule from the PDV, adding hydrogen atoms to the PDV using open bubble, minimizing the hydrogen atoms, the energy using open bubble again. And finally, paramaterializing this structure with the hydrogens to be used in a molecular dynamics simulation. And this is a workflow that you can easily run and you can find it in the demonstration workflows that we have in the website. Those are the demonstration workflows that I was telling you. So this is one particular type of the website where you can find many different workflows in Jupyter notebook format, in this case, for example, a protein MD setup, but you will find automatic ligand parametrization, protein ligand complex MD setup, which is the one that we are going to use tomorrow in the live session and also free energy calculations. And also we have a brand new one with a protein ligand docking example. Okay, I wanted to give you a couple of success stories to see that what we can do with the BIOXA building blocks library. The first one, I'm going to be quick because I don't have time, but the first one is an example that we run a couple of years ago with a widely studied, one of the widely most used and studied enzymes in biochemistry, which is a part of, part of a kinase because it's following the glycolysis and also because the mutations affecting the structure. They are known to cause anemia disease. And actually what we wanted to explore is what these different mutations, how these different mutations were affecting the dynamics and the flexibility of these bacteria. And for that, we downloaded 200 different annotated mutations from Uniprot, some of them pathological, some of them not. And we simulated all, modeled and simulated all the 200 different mutations in one single job using these pycoms workflow manager using the adaptation layer, the external layer on the BIOXA building blocks. That allowed us to run 200 simulations, model the simulation, the mutation, prepare the simulation and run the simulation. All of, each of the simulation in four different nodes in the Marino Stone supercomputer in the Barcelona Supercomputing Center using almost 40,000 different cores in one single job. And this is the result with 200 different folders with all the simulations that we are now exploring and trying to understand the differences between the flexibility and what the mutations are, how the mutations are affecting these dynamics and flexibility. So after that, we went to something a little bit more complex than that. And in this case, we moved to the EGFR, Pedermal Growth Factor Receptor. Again, another widely studied enzyme that is known to be related to many types of cancer. In this case, the overexpression of this EGFR is ending up in cancer, but we have FDA approved drugs that basically are stopping the function of this EGFR and those are the ATP competitive inhibitors. But unfortunately, we know that some mutations that are appearing in the last years, they are giving resistance to these FDA approved drugs. So we wanted to explore how these mutations are affecting the binding of these drugs. And for that, again, we run a massive amount of molecular dynamics simulation using the library. And with these simulations that were run for the wild type protein, for the mutated protein, for the protein with all the ligands attached and with also the protein and the mutation with the ligand attached. So all the different combinations, then we run free energy calculation and all of that to explore the impact of these mutations on the ligand binding affinity. And we use the workflow that you will see also in the last days of this BIOXEL summer school in the PMX lessons that is the fast growth thermodynamic integration workflow. And finally, we used exactly the same approach, this fast growth TI for the exploration of the mutations, of course, on the COVID-19. In this case, this is the COVID-19. I'm sure that you will recognize the virus. These red protuberances here are the spike protein, which is this protein here, that sometimes one of these receptor binding domains goes up. And when it goes up, it tries to be attached to the human cell, this angiotensin converting enzyme from the human cell. And then it is the starting of the invasion process in humans. So we can explore the mutations on this part here, on this region here of the receptor binding domain of the virus where basically, for example, here it is the well-known British variant. It's exactly here in the interface between the virus and the human cell. But we can go one step farther and not just looking at the mutations here, but looking at the mutation that appeared from the mutations that make the virus that is affecting the bat different from the virus that is affecting human. We know that there are almost 30 different mutations to go from one to the other, but we can explore one by one the difference, the ligand binding free energy for all the different mutations here, and then sum up all the results. And I'm not going to spoil you, but the manuscript is submitted and it's being checked also experimentally. So we are very proud of the use of these biological building blocks for a particular study with very impact nowadays, which is the COVID-19 zoonotic transference here. Okay, so you have all this information in the paper that we published a couple of years ago. We are very proud of being recognized by the European Commission as one of the high potential innovations of last year. But what makes us really proud is the usage now of renowned groups in the field, especially about the online tutorials that we are developing. That has these two it says are exceptional. We are very proud of that. And these tutorials are the ones that we are going to use in this tomorrow in the live session, and I'm going to introduce you in the second lecture today. Okay, there's much more about building blocks. This is just the tip of the iceberg. I'm going to stop here, but just let me tell you that we have many different tutorials in the website. I recommend you to take a look at the website. All the information is there. So we have common word for language examples, pressed APIs to run the building blocks from your own computer, but running the building blocks in another computer in a remote computer, no need to install anything in your own computer. A common line workflows, which is the approximation that we used in these access stories, we have a web server, we have examples on Galaxy and also on NIME. So I will recommend you to take a look at the website. Okay, so we are going to see the website tomorrow. I'm sure so I'm going to skip these slides of the website. And just tell you, just as the last one before jumping to the next lecture, that you have the possibility also to build your own building block and you have all the different steps to build your own building block in the website. So if you are a developer and you want to integrate your tool in the building block, you can do that. Just follow the steps that you will find here. Last slide as a summary, the BIOS building blocks offers tool interoperability mainly, and with this library you will be able to build workflows in a very easy way and we are going to see that tomorrow. It offers you repressibility of the whole workflows and portability and we will see that in the next lecture and we'll tell you about the conda packaging of all the workflows of the whole workflow. But you need to know that the BIOS building blocks are not magic, so you still need to know about the tool that is being wrapped because most of the properties, parameters of the tool, you need to know what they mean and what they are. And you still need to build your own workflows, but don't worry because we will take a look at these two complicated points in the next session about the workflows tutorial, so please stay with me for the next lesson. Thank you.