 Hi, I'm Maxim here. So today I'd like to welcome like at any of us about this from the comparative bioinformatics group at the Center for genomics regulation in Barcelona. And he's going to talk about like protein fall, which is like a pipeline I'm really looking like forward we are about more and before the first release which is from what I hear like coming soon. So before we actually start, I'd like to say that just send you back initiative for helping us out. And listeners, you will be able to admit yourself at the end of the talk for questions. Thank you, thank you, Maxie for the nice introduction. I'm very happy today that I will present the NF core protein fold pipeline, best practice for informatics pipeline for protein structure prediction. Let me introduce first myself. I'm Athanasius Balzis. I'm a PhD fellow in Cedric Notre Dame's lab at Center for genomic regulation in Barcelona. And my thesis is about applications of protein structure modeling on multiple sequence alignment and phylogenetic reconstruction. And this is where it connects to protein structure prediction methods because I am very interested in all these tools. And I use them in my daily routine. So with the brief introduction to protein structure prediction, as you may know, there are there is scarce data on experimental protein structures, mainly due to technical difficulties with the already existing techniques. It is a long standing question for the community. How can we start from one GM in acid sequence to go to a 3D structure and gain more insight into the function of proteins. And for this reason, the so called protein folding problem. And for this reason, many techniques have been developed during the last mid century, and can be categorized into two main categories the templates based methods. For example, the gene modeling or for recognition, which rely on already existing experimental structures that are used as template to fold the query, the default, the query sequence. And so on the other hand, we have also template free or ab initio methods. We have a lot of some categories. For example, molecular dynamic simulations, where we tried to use physics law to find the conformation with the lowest dips energy, Gibbs free energy fragment based approaches such as Rosetta. Lately, per why special restraint based approaches where you use techniques to predict the inter residue contacts or the distances between into the query protein, and then use them as restraints in simulates in simulations in order to get the final predicted model. However, in the last year, a alpha fold to achieve the major breakthrough in this field. And it's now able to predict the protein structures from sequence with an unprecedented accuracy. This is mainly based on the incorporation of deep learning frameworks in the field. For example, here in the figure below you can see a brief representation of the alpha fold workflow, where we start with an input sequence. And this genetic databases in order to form an MSA from homologous sequences and convert it into a tensor. And also on the other hand you search for structural templates in order to populate these bearing matrix which actually represent the inter residue interactions of the input sequence. Alpha fold is actually consists actually of two main neural network blocks, the former where it gets unsimpled the MSA representation and the pair representation. And actually the MSA representation tries to populate and optimize the pair representation matrix. Afterwards, we have the structure module where we convert these two types of tensors into a tensor that contains the translations and rotation of the model. And during the learning process, this is optimized. And finally it gives us a final 3D structure. And of course, for better performance and accuracy, this happens three times there is a reside three recycling steps. After the release of alpha fold, there are several other tools with similar or even better accuracy and performance than alpha fold. However, the problem with this software is that they have a lot of dependencies and mainly we refer to the genetic databases you need at this step here in order to build the input multiple sequence alignment. So, as many labs and researchers in the community, try to use alpha fold in a large scale. This is what we did in our MSA F2 NF pipeline. We were interested in to develop a pipeline that they can take care of all these dependencies, the databases, the alpha fold parameters, in order to be able to get fast and as reproducible as possible models predicted models. And after actually the release of our pipeline here, we, many, many researchers got in contact with us from the academic or the private sector that were interested for a scalable alpha fold pipeline that deals with this problem with the dependencies. And for this reason, we got in contact with NFCOR and Secura Labs, and we collaborated in order to develop such a scalable protein structure prediction pipeline. And here we have an overview of what we already have at the moment. As you can see, we have four modes, mainly based on two sub workflows, the Alpha Fold 2 one and the Colab Fold one. But let's go through this overview step by step. Let's start with the input sample sheet, which is quite similar with the input that is already used in the majority of NFCOR pipelines, but it's a bit different in the sense that here we have a comma separated file with two columns. The first column is the sequence header and the second column is the pass to the FASTA file. For monomer predictions, it is recommended to use multiple entries for each monomer sequence you want to predict. And here you have an example of a FASTA file. And for multiple predictions, it is recommended to use one entry with a corresponding to a multi FASTA file. For example, here you have this multimer and you have a look here at the multi FASTA file containing as many entries as the sub units that you want to predict for this multimer. Once the pipeline checks for the validity of the input, you have two options to sub workflows. The first one is the Alpha Fold 2 sub workflow, where we first, it first passes through the prepare Alpha Fold 2 sub workflow, which checks if you have provided the AF2DB parameter, which actually specifies the path where the pipeline can find all the databases and the parameters that Alpha Fold will use if you have downloaded them. Otherwise, it downloads themselves to the required databases and model parameters. I would like to point out at this point that this is quite computationally expensive, since it has to download around 2.2 terabytes. You can use Alpha Folds in two modes. The first one is the default one, where you just feed the input CSV to the Alpha Fold and it computes the multiple sequence, the input multiple sequence alignment and say it does the model inference in the same process. For the sake of computational cost, we also implemented another mode of Alpha Fold, we call it Alpha Fold Split, where it actually gets the input CSV and the FASTA and produces the input MSA in a separate process than the model inference. And if you think about it, this is quite convenient, for example, in cloud infrastructures, because this step, the AF2 prediction step, requires GPU. So if you run everything, it's these two steps in the default mode on GPUs, it's much more computationally expensive than it costs. It costs much more other than the AF2 split. You can specify these two modes using the standard AF2 parameter here. True for the default Alpha Folds and false for the Alpha Fold split. The second standard, say sub workflow, is about collab fold run, where we have more or less a similar strategy. We have the prepared collab fold sub workflow, where you can specify if you have already downloaded the databases and the required parameters of the model, you can specify the path using the collab fold DB parameter. Otherwise, it downloads automatically the required databases and model parameters. And here again, it requires a lot of storage around 1.8 terabytes. We have two modes in the collab fold as well. The default mode is the collab fold web server, where you actually depend on a web server that can run the database search and MSA creation. By default, this web server is the one provided by MM6 team. But using the parameter host URL, you can specify the URL to your custom web server if you have set it up. In order to specify this mode, the collab fold web server, you just have to use the modes parameter. The second mode is the collab fold local mode. And it's quite similar to the Alpha Fold split mode we have seen in the last slide. In the sense that you first have a process to compute the input multiple sequence alignment. We're using the MM6 and then you have a separate process for the model inference and the proteins structure prediction. Again, you can specify it using the mode parameter of the pipeline. Let's have a look at some more advanced parameters. For example, the huge GPUs parameter when available. As I explained before, it's a much more computationally expensive to use only CPUs, especially in the prediction steps. But you should also pay attention to the configuration profile you are using in combination with the GPU. So was to define the corresponding prediction process to the GPU to or machines you have in your infrastructure. For example, we have the in. We have in the github repository of the pipeline, a CRG institutional profile that we are using at the moment so we can have a look. Then with the outdoor parameter you can specify the output directory. This is this applies for all the NFCore pipelines. The specific alpha for two options, the full DB, the full DB's parameter where you can select if you want to use the full databases for the first part of the sequence search and the creation of the MSA, or you can use a reduced version of the databases, which means that the pipeline will run faster. But with the beat of trade off in the with the accuracy of the produced model. The model preset parameter where you have to specify what type of prediction you want to do and which model to use. For example, we have a three monomer models. The default is this one, the other two are actually provided by the alpha for two team for reproducibility purposes, this one was used in the cast competition cast 14 competition for example, or the monomer model for multiple predictions. Regarding the column fold specific options again, you can specify the model type. The alpha for PDM, which is the default for monomers and to a multiple models. The, the most improved version is the default version to you can also specify if you want to use PDB structure templates or not in the first step, where you have where you populate the pair representation matrix. And also some more specific and detailed description on the parameters available at the moment in the corresponding web page of the NFC or protein fold pipeline. Regarding the output here you can see the three structure let's say of the produced output. Alpha fold, you have an alpha fold a directory and one more sub directory with the sequence name you have provided in your CSV file that contains the computer MSA is the unrelaxed and relaxed structures the rank structures. And the role model output some meta data and of course timings. And the, the first ranked model that you probably is what you want to use in your research that contains as well. The PLDT scores, which is the confidence metric used by alpha fold per residue. And another directory that contains the symbolic links to the downloaded databases and parameter files. The same applies for call up fold where you have an out, an output directory, depending on them, the model you have selected call up fold web server or call up for the local that again contains all this information we have explained for alpha fold, and the symbolic links to the downloaded databases. Of course, as in all the NFC or pipelines, there is a directory with the pipeline info execution trace files, and so on. And the next steps. We are now at this point that we have to set up and run the AWS full tests in order to create the first release of the pipeline. The first release is we're planning to want to add more open source protein structure prediction tools, such as open fold, or even a newer generation of prediction tool tools such as ESM fold, or omega fold that use protein the language language models. And for this reason, they are more or less in order of magnitude faster than alpha fold call up fold and the but without losing accuracy. In fact, they, they have the same levels or even better levels of accuracy. We're also interested in incorporating more advanced software for product in product interaction, such as fold book, because there are plenty of researcher in the seeding predicting a advanced multimers. And moreover, to add to solve bug fixes and add more optimization upon request. We are very, very open to contributions and ideas in order to improve even more of the pipeline and adapted to the needs of the community. So please do not hesitate to contact us and propose or contribute to the already existing repository. At this point before finishing I would like to thanks my colleagues, first from the Notre Dame's lab, Jose Spinoza and Luisa Santos, that are contributing to this pipeline, as well as Secura labs and especially Harsil Patel for all the guidance and the help during the implementation process. And the collaborators from the Interline Therapeutics, especially Norma Gouda and Walid Osman for testing the pipeline in the clouds. So thank you for your attention. And I would be very happy to answer if you have any question. And that's it, thank you. Thank you very much for the amazing talk. So I will allow everyone to unmute themselves. If anyone has any questions, please let's go. Otherwise, I think I have like one question. So at the moment you just have like alpha fold two right and you're planning to add like more, more tools, but not in this first release but in the coming one right. Exactly. Yes. And yes, and I assume that's the main issue with like having more tools is that it's a lot of databases that you need as an input. Exactly, exactly. That's true, because each tool uses its own databases, let's say. So you need a lot of storage to be able to test everything or even to compare between tools. May I ask along this line. So you basically retrain the model every time you run the pipeline or at least like every time institution retrains their model from scratch or do you use pre-trained models? We use pre-trained models. We just download the already provided models by alpha folder, column folder, let's say. And it still takes these huge databases. Yeah, yeah, yeah, yeah, yeah, yeah, because this is separate from the training process. These databases are needed in order to create the input multiple sequence alignment. To actually have this homolog, all this bunch of homolog sequences in order for the model to be able to find all the correlations, the interest in the correlations and form the final model. I see. Thank you. I think we're good with the number of questions. So, thanks again, that was like an amazing talk. I'm super happy to have learned more about it. And I'm hoping to see this release like coming. So thank you all again, everyone.