 Welcome to this afternoon session of the BioExcel Summer School. We are going to talk about the BioExcel Building Blocks Library here. My name is Adamus Pital-Gask. I'm coming from the IRB Barcelona Institute for Research in Biomedicine. And the work that I'm going to present you today, it's basically done by these two main developers, Pao Andrew and Janice Bayari. They both will be with us in this in the live session that we have the last session today, where we are going to do a hands-on with one of our tutorials, the Protein League and Complex MD Setup tutorial using the BioExcel Building Blocks. So we are now in the first session of this afternoon, which is the theory behind the BioExcel Building Blocks, which is supposed to be 40 minutes long. After that, we will have 30 minutes of how to use this library, this BioExcel Building Blocks library to build my molecular simulation workflows. And after that, we will have the query and answer session, where the three of us will be here answering your questions. Then we'll have a beautiful break. And finally, we'll have the hands-on session, the tutorial, the Protein League and Complex MD Setup using the Jupyter Notebook and the BioExcel Building Block library. So starting from the beginning, I know that you had a nice introduction about the BioExcel Center of Excellence this morning, yesterday morning, the first session. But let me just summarize a little bit what is the BioExcel Center of Excellence. We are a hub for biomolecular modeling and simulations with quite a particular expertise, which is this area of expertise here from an atomistic structure, the quantum level to the macromolecular and also macromolecular complex structure. The center started four years ago as a Horizon 2020 project from the European Commission and includes a lot of different European partners. We are one of them. This is an IRB, but as you can see, there's a lot of different partners, and you will meet some of these today and these days in the BioExcel Summer School. The Center of Excellence wants basically to enable better science. And to do that, we have three main key points. The first one is to improve the performance and functionality of our key applications. And I will introduce you what are these key applications in a minute. We want to provide support to non-experts, to both non-experts and to advanced users. And finally, we want to develop user-friendly computational workflows. Our key applications are these ones. I'm sure that you all know them. That's exactly why you are here today in this BioExcel Summer School. You want to see how to use these key applications, GROMACs for the molecular dynamic simulations, PMX for the free energy calculations, Hadoq for the protein-protein docking processes, and QMMM, CP2K, and CPMD for the QMMM hybrid approach. We want to provide support and we do that from different training events. We have a lot of different webinars. You can take a look at the bioexcel.eu webpage and you will find more than 30 different webinars for biomolecular simulation tools. We have many documentation and training material also linked in the BioExcel website. We have all the events of course published there and we are providing support not just to academia but also to industry, as you can see in the news here. But what is really interesting for us in this particular session of the BioExcel Summer School is the third key point, which is the workflow part, which is to design, deploy, and make available solution-oriented biomolecular workflows. We want to do that with a particular focus, which is the excellent usability. We want to build these workflows, being easy to use, easy to share, easy to reproduce, and easy to be not just built but also executed in different infrastructures and different platforms. That's why these partners here from the BioExcel Center of Excellence joined together and started with the development of the BioExcel building blocks. You will see the story. So, biomolecular workflows. At that moment, four years ago, when we started the BioExcel Center of Excellence, we sat down all of these partners and we thought about the challenges of the biomolecular workflows. Basically, biomolecular workflows, simulation workflows are built from a number of tools which perform different tasks. Some of these tools or some of the different tasks are summarized here. We usually have five-format conversions inside the workflow, as we have so many different formats. We, for sure, will need to convert between one or another. We maybe have some structure modeling to do. Maybe we're interested in molecular dynamic simulations, maybe quantum mechanics simulation, maybe a hybrid approach, QMM. Of course, if we produce trajectories, we need to analyze these trajectories. We have also docking software, if we need to dock a protein with a ligand or a protein with a protein, we can compute free energy values. We maybe need a ligand parameterization to use the ligand in a molecular dynamic simulation. We maybe are interested in chemoinformatics and something involving small molecules and data analytics is something that is really popular nowadays. So, what all of these different tasks basically means that we are dealing with many different tools. So, again, molecular dynamics, you probably know GROMACs, NAMD, and there's more analysis tools. You can also analyze with VMD or you can visualize data with VMD. You have chemoinformatics, you have open-babel RDKIT, AC pipe to parameterize ligands, QMM approaches, free energy like PMX, more analysis, docking, autodoc, harddock, modeling tools like Modeler, data analytics, TensorFlow, scikit-learn, and many, many more. You can name it. There's a lot of different tools that we need to use together in different workflows. And how are we typically doing that is using shell scripts. So, we are calling one step here. So, calling one of the tools in the first step and then calling another tool in the second step and so on and so forth till we end up with the last step of the workflow. That has many different problems of usability. So, you need to understand the script. If you want to modify something, if you want to modify one of the parameters, imagine that you have here many different command lines with many different tools being executed and being used. Interoperability, all the different tools, they produce different file formats. You need to convert between them. You need to understand what each of the tools needs to receive as an input and what they are producing as an output. They are mostly not compatible, one with the other. Then we have issues with the portability and reproducibility. So, if you want to export this script to another infrastructure or another machine, the same versions of the same tools, they need to be installed properly in these machines in order for this script to, for this workflow to run properly. And then you have, we have problems of scalability. If you want to export that script to the workflow to HPC Center, for example, and use thousands of cores in a supercomputer, we will need to adapt that. So, that's when we started to think about developing a library. And this is, this was the starting of the Bioexil building blocks. So, what is the Bioexil building block software library? It's basically a collection of Python wrappers. It's nothing more than that. Well, it's something more than that. But the core of the library are Python wrappers on top of popular biomolecular simulation tools. This library is offering a layer of interoperability between the wrapper tools. That's the main point of the library, making all the tools, biomolecular tools compatible and prepared to be directly interconnected to build these biomolecular workflows that we want to build. So, how are we doing that? So, we are doing that with a unique syntax. That means that all our building blocks, regardless of the tool that is being wrapped, they all have requires input files, output files, and input parameters of the tool, which we call them properties. All of our building blocks needs inputs, outputs, and properties, no matter what is the tool being wrapped. And you will see examples of that, of course, during the sessions today. All these building blocks, we have divided them in different categories. And you will see why also during this presentation. We call them by UVV modules. And these categories are divided depending on the functionalities of the tools that are being wrapped by these building blocks. The second step after the building block is to build workflows. So, we want to build and share biomolecular simulation workflows. And using this library, we just need to join and connect the building blocks together because of the interoperability. So, with this library, we have an easy way to build the workflows. An easy way to develop and test the workflows using, for example, interactive graphical user interface such as the Jupyter notebooks that you will see today. And you actually have already worked with the Jupyter notebook in the previous session with the Gromac session. So, we will also use Jupyter notebooks for our session this afternoon. You also have the opportunity to use a drag and drop graphical user interfaces such as Galaxy and Nymes. We are compatible with them. Our workflows will be also reusable and reproducible. And that is because we can package the workflow in a single Python script with a condi environment, or we can describe the workflow with CWL specifications because we have specifications for each of the building blocks in the library, which is just a matter of joining together the specifications to have a description of the whole workflow. And with that, we can share and reproduce the CWR workflow, the common workflow language workflow everywhere else. But we are also prepared for the exascale. That means that our workflows built with this library can be used in HPC centers or HPC supercomputers. And that's thanks to the adapter layer, a new layer of interoperability on top of the tool interoperability that makes them compatible with workflow managers, HPC focused, such as these ones by Consorto or others. And finally, we want an easy way to execute to install and execute the workflows. And for that, we use, again, packaging of all the library modules, which allows easy installation and run in many different infrastructures. You will see examples of, for example, the Conda packaging during the session today. This is on one hand, the packaging of the library. And on the other hand, we have this adapter interoperability. If the packaging is giving us infrastructure, is making our library infrastructure agnostic, let's say we can use it in different infrastructure, the adapter interoperability layer is making our library workflow manager agnostic. That means that the workflows can be controlled and orchestrated by many different workflow managers. As many as we built adapters from. As an example, we have adapters for the Jupyter notebooks, Pycoms, HPC-based workflow manager. We have NIME and Galaxy as a graphical user interfaces. And we are working on different workflow managers. Okay, the philosophy behind the Biaxel building blocks, the FAIR. When we started all of this, the development of this library, there was something that was called FAIR guiding principles for scientific data that was giving a set of best practices on how to produce scientific data. Basically, data should be findable, accessible, interoperable, and reusable. And at that moment, like four years ago, people started to think about doing the same using best practices, such as these ones used in the scientific data, but in this case for research software. And that's something that was published in last year. So FAIR principles for research software, and we started using this FAIR best practices at the beginning from the day one in the development of our library. And that's because we aligned with the LXEAR bioinformatics infrastructure, European bioinformatics infrastructure, and Biaxel together to build and develop this library. And I will tell you the meaning of this FAIR for the software library. Our building blocks are findable because the library is easy to find from many different sources. That means that we have the library registered in the LXEAR register infrastructure, the bio tools. We have a website advertising everything with all the links to all the documentation, everything you can find really easy. The platform is monitored, technically monitored by OpenEbench, which is another infrastructure from LXEAR. All the source code is in GitHub since day one. We have bio schemas so that all the search engines can easily find all the information that we put on available online for the library, for the software library. So it is findable. It is also accessible, and that means that you can find them in many different sources, but you can not only find them but easily install and use. That's the meaning of accessible for the software, for the research software. In our case, we have our library packaged in different packaging managers, such as containers, docker and singularity, bioconda packages. Our workflows are packaged again in Jupyter notebook tutorials and can be exported. You will see examples of that easily exported and installed and run in different infrastructures. We are also using graphical user interfaces. Those are just examples of the accessibility. We have, of course, interoperability. This is in our core of the library. Our different building blocks are completely interoperable between each other. That means that we are adding this new layer of compatibility, interoperability between the different biomolecular tools that we didn't have before. And finally, we have pre-usability. It's not just a matter of installing and using but also reusing this. And for that, we, for example, have documented, defined it and described all of our building blocks and workflows using read the docs documentation, common workflow language, open API in the REST API part. And we also are packaging the modules, as you already know, with bioconda, docker and singularity containers. With these, that's an extra point of reproducibility. So all of these, for fair best practices, we were following them since the beginning of the development of the library. This is an important point. Okay. Our building blocks are divided in categories. Those are the ones that we have now. The categories depend on the functionalities of the tool frapped. And this is very important. Every module is available from an independent GitHub repository. So if you go to the GitHub source code in bioxl, the bioxl repository, you will find all the different modules, all the source code there. All of the modules, each of them has an associated bioconda package, associated docker and singularity containers, and also read the documentation pages all in a separated way. So all of these collection of building blocks, they all have their own GitHub repository, bioconda package, docker and singularity container, and read the documentation. This is very important. You will see why in a minute. What we have so far is this bioBB common, which is the base package required, required to use all of our bioxl building block library, which is installed any time you install one of the other packages, automatically installed. So input output is the collection to fetch data from biological database. It's basically wrapping rest services from PDB database, from Uniprot, from PDBE, knowledge base. And the molecular dynamics building block collection, of course, is a collection to perform molecular dynamic simulation. Now it's wrapping basically Gromax tools, but as you know, from this morning session, Gromax is a really powerful package with hundreds of different tools to set up and analyze. In this case, these ones, the tools wrapped in this molecular dynamics collection are the ones to set up and run molecular dynamic simulations. The analysis tools are wrapped in another collection, which is the bioBB analysis, to perform analysis over MD simulations and trajectories. In this case, we are wrapping Gromax, but also we have different tools wrapped from the AMBER tools package, from the AMBER molecular dynamic simulation package. We have the bioBB PMX module, which is a collection to perform free energy calculations using the PMX software that you will also have the opportunity to work with during this bioxl summer school. We have the bioBB structure, utils collection to modify or extract information from a PDB, which is really, really useful. It's something that is in every one of the workflows that we are building. The bioBB model collection to check and model 3D structures, maybe missing residues, maybe hydrogen atoms. There's something important to check the structure before starting molecular dynamic simulation, always, also to mutate residues. And finally, the bioBB chemistry collection to perform K-informatics analysis and format conversion. This is particularly useful in virtual screening or when you need to work with the small molecules. All of these different categories that you can find here, this is the website. I will talk about the website. I will introduce everything about the website and where you can find all the information about the building blocks in the final part of this presentation. But now, just look at the different modules that we have here and each of them with a separated source code in the GitHub, separated read and doc documentation, bio-conda package, docker container and singularity. Here you have the version. This is important because, for example, for the bio-conda packages, you have one package for each of the different modules, and that means that, for example, for the bioBB analysis, if you click on here, it will expand the module, the collection, and you will see all of the different tools that are wrapped in this particular collection. And as I was telling you before, this particular collection is wrapping two different tools, Gromax, all the tools that are used, well, all part of the tools that are used in analysis of trajectories from Gromax, but also amber tools. In this case, a CPP-TRAH tool from the amber tools. What this bio-conda is doing is that when you install the package, the bio-conda package for the bio-BB analysis is automatically installing all the dependencies to. That means that it's installing the amber tools 19 and it's installing the Gromax package automatically. So you just install the bio-BB analysis and the dependencies will be automatically installed. And that happens also to the rest of the modules. If there is a dependency, a program, if we are wrapping AC pipe, AC pipe will be automatically installed. If we are wrapping open bubble, open bubble will be automatically installed. That's what gives us this level of reproducibility and portability using this packaging with bio-conda. Just an example, don't try to understand this slide yet. I can assure you that you will understand that during today, during the different sessions that we will have today. But this is basically to illustrate you that we can run a REST API call to download a particular ligand structure. And then we can run something with using open bubble like add hydrogens to the ligand structure that we had already downloaded here. Minimize the hydrogens that we have previously added in this part here. And then use AC pipe, for example, to parameterize this ligand, all using the same syntax and all with this compatibility interoperability between the different steps. This is exactly what the BIOXL building block is doing. And you will see this slide in the next session. Don't worry about that. You will understand all of this. Okay, a couple of success stories before going to the website. These success stories are for you to illustrate, to see the power of the BIOXL building blocks and how we are using them in real scientific projects. So the first one is this project that we called it the PyMD setup, the workflow that we produced. It was called PyMD setup. You still can find it in the GitHub repository, but I didn't include here the link because it was done more than three years ago now at the beginning of the BIOXL building blocks. It's not compatible anymore with the BIOXL building blocks version that is now the current version that we have available from the website. But it was for us really important because it was showing already the power of our software library. In this case, we were, the objective of the project was to train a predictor able to, from just flexibility information dynamics or flexibility information from the protein, understand if a particular mutation was pathologic or not. And for that, we used information from a particular protein, which is a paravidovidocaine kinase, which is this big protein or homo-tetra protein with the final system in the monocular dynamic simulation of almost 400,000 different atoms. This is an important protein because you maybe know it is involved in the anemia disease. For this particular project, we select a set of 200 different amutated mutations from Uniprot for which we know that they are pathological or not. This is what we use to train our predictor and to obtain the flexibility patterns for all of these different mutations, we built this this PyMD setup workflow that was successfully launched in the Barcelona Supercomputing Center Marinostrum supercomputer using this workflow manager that is called PyComs. We, the workflow was able to model the 200 different mutations and run the molecular dynamic simulation using four nodes per each of the mutation. That means that we used in one single job almost 40,000 different cores of the Marinostrum supercomputer and we were able, and this is a screenshot of the results and this is an RMSD of the first nanoseconds of the 200 different simulations, we are able to obtain trajectories and flexibility information that now we are using to train the pathological mutation predictor. So that was really nice for us and actually we are using this, the part of this workflow which is generating modeling the mutation and running the molecular dynamic simulation, of course, updated to the last version of the BioXL building blocks in another success story which is more recent, which is a moving mutational analysis into the structure of field for drag design, in this case a little bit more complicated. What we want here is to analyze the effect of a set of mutations, in this case just 18 mutations on the resistance or sensitivity to a set of FDA approved drugs for a particular protein which is the EGFR epidermal growth factor receptor that you may know that is involved in many different tumor diseases and cancer diseases. So in this case we are using, as I was telling you before, the workflow to mutate the protein and to run the simulations from the previous workflow, in this case we are producing 18 multiplied by 5 multiplied by 10 different replicas, so 900 different simulations with 90 microseconds of accumulated time and on top of that, and to understand the effect of these mutations on the resistance or the sensitivity to the drug, we run 903 energy calculations on top of these simulations and each of these free energy calculations, it's another pre-exasperated workflow inside the other workflow, which is this one that you see here, that is running a fast growth thermodynamic integration approach that I'm sure that you will have the opportunity to work with in the last session of this BioXL summer school with the PMX tool. So in this case our workflow contains, of course, using the BioXL building blocks, tools wrapping the Gromax and the PMX, sorry, wrappers of the Gromax and PMX tools, and it's running 200 for each of these free energy calculations, it's running 200 short MD simulations, so that means a lot of different MD simulations, which are extremely parallelizable and again using these Pycoms workflow manager and the Marinoston Supercomputing Center, we were able to obtain many different Delta-Delta disease, many different results that allow us to obtain a high correlation between these final results and the mutation causing drug resistance. Actually, there is, you will have the opportunity to see in a more scientific view this particular project in one of these, of the last sessions of the BioXL summer school, so Francesco Colizzi, one of my colleagues postdoc, a very talented postdoc from our group will explain you in 10 or 20 minutes this work in a scientific point of view. You can have, of course, as all our workflows and building blocks, you can take a look at the workflow in the our GitHub repository, in the BioXL GitHub repository. If you want more information about the BioXL building blocks, you can always refer to the journal publication that we managed to publish in the scientific data journal last year, but I have to say that in just one year of life, we have many different updates, different releases of the workflow, sorry, of the BioXL building blocks library, so this is okay if you want to take a look and understand the philosophy behind this software library for interoperable biomolecular simulation workflow, but if you want to be to know what is the last releases and the last functionalities, you should go to our website. So here you have the URL, the link to the website, and basically if you go to the about section, you will see basically what I have explained in this presentation. So an introduction about what are the BioXL building blocks, the fair philosophy, best practices on research, software development, the different modules and categories where we divided the different building blocks, infrastructure where you can run, install and run the building blocks, and the workflow managers compatible with our library success stories, training material, which is really important and it's really useful. Here you can find presentations and videos like this one, for example, and the people involved in the project, but what is more important for you is the availability tab here. So here you will find information about how to launch, how to install and download, how the all the documentation, all the links and all the tutorials, so one by one. You will see you will find information on how to launch the building blocks. So you have three different ways on launching. Launching means that launching without a need to install anything in your local machine. We are working on a web page offering a pre-configured workflow so that you can, for example, set up a protein, protein ligand, something like the tutorials that you will see today, but all of them pre-configured and working in a web interface. And there's also a REST API server that means that if you don't want to install all these different tools and building blocks in your own local computer, you can use this REST API server and remotely execute all the building blocks, sorry. In our installation, we have a cloud infrastructure behind that. Of course, we are not offering one microsecond of molecular dynamic simulation, of course, but you can use that to maybe set up a simulation. Actually, you have a tutorial on molecular dynamic setup of protein using this REST API. So you can take a look at the tutorials and see that. You can also run directly the building blocks, a part of the building blocks, a selection. We have not all the building blocks yet implemented in a Galaxy project. This is an implementation, a local implementation of Galaxy, but you can try it and they will be integrated in the toolshifting Galaxy so that you can use also all the building blocks soon in the Galaxy project. There's also how to download and install the library. Many different ways that you have seen from the previous slides. You can download and install from GitHub source code directly. You can install and run from Docker containers, from Bioconda packages, from Singularity containers, from Biocontainers, which is a registry of different containers. There is also the possibility to download. It's not yet implemented, but will be soon. The web server that I was explaining to you in the previous slide, the web server with pre-configured workflows inside, we are preparing a virtual machine with everything inside so that you can download the virtual machine and use this virtual machine in your own premises with the graphical interface of the website inside the virtual machine. So it's another way to download and use the building blocks and the workflows from this library. Important source and documentation. This tab here is the one that is giving you the information about the different modules and categories with all the different links to the source code documentation, Bioconda packages, Docker and Singularity containers and so on. If you remember, if you click on one of these modules, you will see a description of all the different tools that are wrapped inside each of the collections. And finally, we have the last section on the availability tab, which is the tutorials. And here you have three different collection of tutorials, installation tutorials, which basically is helping you step by step on installing Anaconda, which is a particular version of the Conda packaging that includes the Python language in the base packages and 150 high quality packages that we are also using in the library. So it's a Python distribution for Anaconda, for Conda, sorry. All of these tutorials are explaining you step by step how to install easily, really easily install this Anaconda to Mac, Operative System to Ubuntu or to Windows, Operative Systems. This is everything that you need if you want to start working with the BioXL building blocks in your own machine. Then we have library versatility tutorials, which show the power of our library in terms of compatibility with workflow managers on how to orchestrate and execute the workflow. So this tutorial is helping you on using our workflows described using common workflow language. This one is helping you if you want to run a workflow using the REST API of our library. And this one is helping you on building a command line workflow, which is basically used in high throughput execution. So if you want to, if you need to run your workflow 1,000 times or you have 1,000 input, different inputs, and of course you need to run this in a loop 1,000 times again, you need the command line workflow. So this is helping you on how to build and run these particular workflows. And finally, and the most important one, we have the demonstration workflows tutorials, which are basically built to show the syntax of our BioXL building blocks and how you can build workflows, easily build workflows, using the library. And you have four so far, the protein MD setup, an automatic ligand parameterization, a protein ligand complex MD setup, which is basically coupling the protein MD setup with the automatic linear parameterization, and it's the one that we are going to use in this afternoon session. And finally, mutation free energy calculations with the contains a small part of the second success story that I have presented before, how to run a fast growth, thermodynamic integration using the BioXL building blocks, of course, in a reduced way so that it can be used in a training session. You also have the final type of the website, which is build your own building block. And this is basically a guideline that you can follow. If you think that this one tool, biomolecular simulation tool that you really use or want, and it's not, it's missing in our BioXL building blocks library, you can either tell us, I'm really interesting on having this biomolecular tool integrated in the library, or you can try to integrate it by yourself. And this is all the different steps that you need to follow. Of course, we have a template and we have built a template for you. You just need to modify the template, having the particular tool of interest. So it's the BioXL building block library is a completely open source software. You can, of course, collaborate and contribute to our software using the GitHub repository. To summarize everything in just four lines, what the BioXL building blocks is giving you is offering you is tool, basically tool interoperability, so that all the biomolecular tools that we are wrapping can be easily interconnected to build workflows in an easy way. And these workflows can be reproduced. They have reproducibility and portability thanks to this packaging initiative that you have seen. But the BioXL building blocks is not magic, I have to say. You should be aware of that. So that means that you still need to know about the tool wrapped. For example, you have attended a previous session about Gromax. So you know all the possibilities, all the different parameters, all the tools in Gromax have. For example, you have an MDP file, which is a configuration file. You still need to know all of these parameters, what the meaning of them before running. For example, the GromPP or the MD Run tool, which is being wrapped by our library. If you don't know about this, the meaning of this, you still, you should know about this before starting to use these BioXL building blocks. And of course, you still need to build your own workflows, although you have a lot of different tutorials and materials that will help you in this process, just modifying them or adding new building blocks. And that's exactly what you are going to see in this afternoon session. So this is the end of the first session. But please stay with me, because in just five minutes, we will start with the second session, which is the workflow tutorials. Thank you.