 Hello, welcome to the GTN Smorgasbord global Galaxy course. Today I'm going to cover a tutorial about the function and analysis of proteomic data using Proteori, which is a proteomic research environment that is built upon the Galaxy framework. This is the link for the access to Proteori. My name is Yves Vandenbroek. I'm carrying out research activities in bioinformatics and proteomics for biomedical research. And I'm a member of the French Institute of Bioinformatics and also work in the proteomics team located in Grenoble. So what is Proteori? This is, I already said, a Galaxy-based instance for proteomic data interpretation with a strong focus for biomedical research. Here is represented the main steps covering a classical proteomics workflow, starting from your biological sample, then you have a typical experimental design. You process, you obtain the data, raw MSMS data, this MSMS spectra on which you perform quality control. Then you perform what we call the primary analysis that aim at identifying proteins and quantifying these proteins. This is a typical output that is delivered from a proteomics platform, a list of proteins that have been identified and quantified. Then the question is how to go further? That's what we call the downstream analysis. The idea is to perform the functional analysis and the exploration of the raw young data by translating, trying to translate this primary information, this long list of protein into meaningful biological knowledge for interpretation purpose. So the idea of proteology is to provide a set of tools and workflow as well to perform this function and analysis. As proteology relies on, is built upon the Galaxy framework, this is also a biologist-oriented resource, which means that this proteology is easy to use even for biologists with no or few programming experience. So proteology is, of course, available and coming along with online training materials available from the Galaxy Training Network, where you can find in the proteomics folder and will also provide direct support. So this slide is just to give you an idea of the tools that we have designed for proteology. Proteology today is comprised more than around now, exactly 21 tools, sorry, that are organized in four sections that are listed here. The fourth section is more generic, is related to data manipulation and visualization, such as ID conversion or volcano plot, for instance, for protection of your proteomics data. The second section is about the annotation process. Try to retrieve information from databases, such as protein feature, exploration data or SMS observation coming from UniPro database, the export, the human protein atlas or peptide atlas. As we are mainly focused on biomedical research, three species are supported, which are human, mouse, and rat. The third section is organized around tools for functional analysis to perform functional profiling, classification, and enrichment analysis. This is mainly based on the gene ontology and for those who are not familiar with this classification system, we recommend the reading of this documentation that you can find here. Then in the fourth section, can find tools to go further, to go further with the interpretation of your proteomics data at the pathway lever, that's what we call the pathway analysis, so you can perform pathway enrichment analysis using RACTOM or using cake for mapping and visualization of your information, and also to build protein-protein interaction network using experimental data coming from biogrid. So now let's come to the tutorial, the practical part. Today we are going to perform the functional analysis of a protein list coming from MSMS-based proteomics experiments. We are going to work on a real data set that has been published more than two years ago, that proteomic coming from the following study, that is the proteomic characterization of a human exhaled breath condensate. That's what we call exhaled breath condensate, we call later now with the acronym EBC. So this is the experimental design, the EBC sample were proved and prepared, protein were then concentrated and separated using 1D gel SDS page. Performed then trypsin digestion was performed and this complex mixtures were analyzed using liquid chromatography coupled with mass spectrometer for the analysis. The sets of MSMS spectra were then used for to perform database search using MaxQuant as database search engine against Swiss product database, the human section as we are working on a human sample, of course. We performed the identification of peptide, the grouping of protein and then those peptide and protein were validated by applying the first discovery procedure with a threshold below one percent. We then ended up with a list of 161 protein edited by and validated. This is actually the dataset with which we are going to work. In the column number one are listed, the accession number coming from Uniprot, column number two, the protein name and column number three, the number of peptide that has been observed for each protein. And then now we are ready for the function analysis so this protein is using proteolytunes here. So this tutorial is mainly based on this enzyme that is available in the Glaxi-20 network that we will find it in the proteomics directory. It is not exactly the same that we are going to do but it is very similar and we are going to answer to the following question. Going to learn how to filter out technical contaminant. Then we will check for tissue specificity because we are working on the exhaled press and there could be possibly a biological contaminant coming from, for instance, salivary glands during the collect of the sample. Then we will see how to perform an enrichment analysis using gene ontology. Then we will map our protein list at the pathway level using the reactant database. And at last we will compare our protein list with protein is coming from other studies previously published. This is the access to proteolytunes. First you have to access at this following address, proteoly.org using your favorite browser. And the first thing you have to do now is to first to create your account. If it's not the case, I propose that you take a break here while you create your account. Then one's done to continue the video. Okay, so now let's switch to the practical parts. First, we have to access to a proteolytunes Galaxy server following this web address proteoly.org. Okay, so this is now how your proteoly interface looks like with here proteolytunes that already described. With a different section. Okay, this is a main home page and this is your history panel. So first, as I mentioned previously, you have to create your own account if it's not already done. So to do this, you can go there. By clicking on login or register a menu. If you do not have an account, so you can register here and fill in the form and you will receive your account details. If you already have an account, then you can log here. Then I log in my main address and password. So now, as you can see in user menu, I'm logged. We're connected to the proteolytunes Galaxy server. So before starting this tutorial, what I would recommend is you scroll down to the bottom of the menu, the main page, home page. Then you can have access to the training material. Today we're going to annotate the protein lead that's been identified using LC-MSM experiment. You have to click and you have a new window here and you can have access to the end zone, training material from the Galaxy training network. So we're not going to do exactly the same thing. I mean, not using the same exactly same parameters for practical reason that we're going to do, perform the same steps. So the first thing we have to do, if you scroll down, is first to get the input data sets that we're going to use on which we're going to work with. So to do so, it's coming back to Galaxy server. In the main menu here from the shared data, you can access by clicking on data libraries to the data set and you select this following folder for the other tutorials, the data set. And then you select these three data sets that we're going to export to our history as data set. And then you can select an history or create a new one. For instance, your analysis, the protein list analysis. Okay, you import now your data set to your own history and coming back to the main home page by clicking on the home analysis home page. And then as you can see now, there are three data sets for this tutorial that we're going to use for this tutorial. The most important one is this one coming from the study of the exerberate content. And just have a look of what is the content of this data set to check for the content. You can click on the icon I to view the data. As I told you, there are three columns. The first one about the protein accession number from Uniprot and then the protein name and the number of peptides here and you can scroll. Okay. So if we come back to the tutorials now, the first step is to filter out technical contaminants. Indeed, a group of 10 proteins were identified in both technical control sample with an enrichment in the EBC samples below fixed threshold. This contaminant should be removed from our initial data set. This is a list of the Uniprot accession number. And to do this, we are going to use in the proteary section. This tool, the filter by keywords tool from the data manipulation and visualization section. So I click on this tool. I select my data set, the computer. This is my protein list. We're going to work with. We would like to keep on to this card. You are going to discard this technical contaminants that we know. We are going to filter by keywords. Identifier is actually a keyword. So we copy paste the list of the, this contaminants. This identifier we copy. We paste here in this window. And note that this accession number are present, are listed in the first column of the Lacombe data set. So I live here by default. The column number one C1, which means column number one on which I have to apply the filter. So now we can also sort by column, for instance. The protein name is more convenient. And then we execute the task. So the job now is running. As you can see. Two outputs are created. One containing a technical contaminant that I've been discarding. So you can have a look. 10 contaminants. Here. Charity, for instance. And this is a remain set of protein. With which we are going to work. And this represents a list of 151. Proteins. So now in the second step, if we're coming back, come back to the second step is to check for the presence of biological contaminants. And then we have a list of proteins. The second step is to check for the presence of biological contaminants. As I mentioned. The X-alpha rays have been collected. Using. Art. Art tube collection device that may contain. Saliva. And that potentially represents biological contaminants in our sample or original sample. So we would like to check for the presence of this contaminant biological content. So that is, we have an interesting tool. To give us information. I'm going to add the expression data tool. To check for the presence of. Political contaminant that are highly enriched in a salivarygon, for instance. This tool rely on the human protein at last information database. So, as you can see. You have to provide ensemble gene ID because simply. Human protein at last relies on ensemble gene ID. And here. Unfortunately we have. Uniprot accession number. So the first thing we have to do. Is to convert this accession number uniprot accession number into ensemble gene ID. So to do this, we first come back to. The ID converter tool that you can find here. Then you click on this tool. I will apply the conversion to my data sets. The field filtered. Like on that I said. The. Uniprot accession number are listed in the column number one. So we let. We set column C1. This is of course human species. The release of uniprot we're going to use for the mapping from uniprot accession number to. Other type of identity identifier, which are. The following ensemble gene ID. And we're going also to consider. The entry. Gene identifier. And we will of course remove uniprot accession number. Since we already have it. So now I run. This is the. Test now the job is running. Okay. So we're waiting for the. Mapping from. Uniprot accession number to. Ensemble gene ID and. Gene entry. Identifiers. Okay, this is done now. Again, you can check the content of this new output. By clicking on the icon. And if you scroll. Here. As you can see. Two new columns have been created. The column number four. Containing the gene ID identifiers and the column number five contains the ensemble gene ID. So now. We're going to check for the. Specific expression of this protein thanks to this new identifiers by using human protein at last. So coming back to the ad expression data tool. Okay. Our input ID now contain gene ID. Applied to this output. This ensemble gene ID are listed in the column number five. So I have to document this explicitly set. Column number five. There is a header. Of course in my. So I say yes. And I can add several information. Retreat from human protein at the database. So, for instance, a gene name. In description. We're going to select. Are in a tissue category. And also are in a tissue specificity abundance. Now we run the tool. The job is currently running. And again, we're going to check the content. Of these new output. That now contain. The information will retrieve from human protein at last. The new colon I've been created such as the gene name. The gene description. And also the area tissue specificity. Telling you how much enrich is your. Protein in a given tissue. And you have the information here. Which is listed in this new column, which is an area tissue specific. Normalize value. Which tell you. In which tissue. For instance, here. Salivary. A given protein is ID enriched. This is typically the case of the. Amela's alpha. One. That is specific. Express. And secreted by the salivary blend. So this is a potential likely. A biological contaminants. Okay. So. Now I would suggest to just as a tip to. Rename this output to make it more. Self explicit. Okay. So. So. Still the lack on that asset. Futed. Okay. With expression data. From HP. Save it. For this output has now been renamed. And we are going coming back to the end zone. Switch to the next step. Which is the functional notation. Of this exa-braze condensate. EBC proteome by performing. Gene ontology terms. So to do this, we're going to. Select. This. Which is named classification and enrichment analysis. That is part of the functionality section. So we click on this. And. We're going to apply this analysis to our data set. So it's okay. So. In this case, we can. Consider either as input. In this case, we can. So to do this analysis. To do this analysis, We're going to. To do this analysis, To do this analysis, We're going to. Create a protection number on three. G90. So we're going to select G90. Just to see what happened. G90. Are listed in the column number four. So we said column number four C four. The type of ID is now. The. Entry. And we can select different. A category of the gene ontology. There are three different. molecular function and biological process. Here we're going to only select the biological process. We perform the analysis. We can change the level within the gene ontology structure. And we are going to keep the same background. We are going to compute the enrichment against a reference background, which is in this case the whole human proteome. So now we run this analysis and again, it should take a while to run. So here, this is a good place to take a break now. Just waiting for the results. So okay, the job is now over and you're in your history panel. What you can see is that you have now three outputs coming from this analysis. And the results are there. They can be in the form of textual information. Okay, EGO is for enrichment gene ontology applied to biological process category. And we can view the information, the textual information. This is the go theorem that is significantly enriched according to our analysis with this p-value, the adjusted p-value. And the number of gene ID of your protein is that belong to this category. And you can have also the information about the gene ratio, telling that 48 gene of your protein list belong to this gene ontology, go terms, against this background. And you can also have access to the graphical information, to graphical results by clicking here. Again, this is the enriched gene ontology, the results applied to the biological process that you can do in the form of top plots. What you can see here is the gene ratio telling you the number of genes that are significantly enriched according to the p-value represented by the color from blue to red. Red being the most significant. And what you can see here is that the neutrophil mediated immunity and neutrophil activation are significantly enriched in our protein list. Okay, so if you want to know more about the meaning, the biological meaning of this results, so you can refer here to our hands-on at the end. There is a short comment about the output. You can also read the reference paper, of course, here if you want to know more about this analysis of this protein list. So now what we're going to do is to visualize the our EBC proteome mapped on biological pathway by using the reactome pathway database. To do this, coming back now to proteome galaxy server, we're going in this section, the pathway analysis, see what's happened at the pathway level. And by clicking on this tool, the pathway enrichment analysis reactome. And we're going to see that as we see, you can use as input ID, uniprot accession number, origin ID, and or either gene name. This is applied to our initial dataset that has been filtered out as contaminants. The information is the uniprot accession number are listed in the column number one. And of course, we have nothing to change because this is still applied to human species was the working on human species that we execute. Okay, the job is now running. It's quite fast. And then again, we can visualize the results. And we can analyze by directly connected through a web service to the reactome database. And what we can see here interestingly, is that the pathway that is named neutral field degradation has been formed as significantly, significantly, sorry, enriched in our protein list. So if you click on it, reactome, then you can have access to this pathway that is part of the human system. So now it's coming back to our tutorial. And the next step now is to compare our proteomics dataset with all the datasets already published from previous today. To do this, coming back to the galaxy proteoreserver, coming back this by performing a typical vein diagram for comparison purpose. So first, we have to enter the list to compare our list that is like computer expression. As you know, the protein accession number are listed in the column number one, you can enter the name of this list. And we're going to compare from the two other datasets that we initially upload imports in our history, which are the bread bag dataset, which is simply composed of a list of protein that have been identified in this study from bread bag and collaborator. So we're going to select this input bread bag text. So the unit protection number that will be used for as a key for cross comparison are listed in column number one. So everything is okay. So the name of the list is bread bag. Okay, we're going to consider another dataset, which is, I forgot something importantly, as you may know, notice here, there is no header in this dataset. So you have to select this option. Does file contain header? You click on no. We're going to do the same for mucili, a mucili dataset, this one, that also has no header, as you can see. This is also a list of unit protection number. So we click on insert list to compare a new one. So this is now the mucili dataset, the third one. Okay, and again, the column number one in which are listed the unit protection number and the name of the list is mucili. Okay, now we can execute the tool. If you scroll up to your history, two outputs have been created to new output. This is a textual output and the Venn diagram. You can view the resulting Venn diagram showing how many proteins are specific to a given study or are shared among the three studies. You can click on this and you have this interactive, you have access to the list of unit protection number of proteins that are shared between these three study studies. And you can also download this image to prepare the figure for your publication, for instance. And in the second output, you have also access to the textual outputs. You can see here listed what is specific from each dataset, which proteins are shared between those two dataset or those two or between these three studies. Okay, so let's come back to our tutorial. So this is now the conclusion. We are about to finish this one. I would like to remind you that you can find the useful literature here. If you want to know more about the biological meaning of these results, we have key points listed here. Also, we'd like to thank you for listening to our tutorial. We truly appreciate your feedback to improve our content. So if you see this feedback present in each and every training ends on tutorial, and if you go to the bottom of the page here, you can find the feedback form. So we really appreciate your feedback and I thank you for attending. And so to conclude, I would like to thank you for for attending. So if you have any questions or concerns, please contact us through this email address or you can also contact us through our GitHub for theory. And of course, through the Slack channel at this name, for your protein list annotation. So please note that there will be also another tutorial. There is also another enzyme that is available through the Galaxy Training Network in the proteomics directory. That is about the selection of biomarkers candidates. And at last, I would like to thank all my colleagues for the proteo teams and also people from the GTN for help and support, especially Bernice Batu and Saskia Hultemann and the Galaxy for Proteomics workgroup with Galaxy B and the Freiburg teams. So if you want to know more about how to use proteo practical use of proteo I also listed here some publications, some use case, real use cases using proteo that have been published or about to be published. So thank you.