 Welcome to MOOC course on Introduction to Proteogenomics. Welcome to the hands-on session of WebGSTART. In today's session, Dr. Bing Jain will teach you about how the results and jobs summary can give you useful information. He will also show you how the enrichment analysis can be visualized in different forms. He will discuss about different types of network-based methods like direct neighbor-based approach, module-based approach or diffusion-based approach. So, let us welcome Dr. Bing Jain for today's hands-on session on WebGSTART. Go SLEEM summary. So this is not, first, this is not enrichment analysis. Don't use this to report as your enrichment analysis. This is just simply a classification of the genes you submitted based on some pre-selected biological process, cellular component and molecular function categories. Give you an idea of how many, for example, how many genes are related to biological recognition and mathematical process or response to streamers. These are the high-level categories. Yeah. Well, again, that is I think you are very good representative of the users. We got this request from a lot of users. So, yeah, that is our to-do list to add to, yeah, we will do that in the 2019 release. And it is just basically similar to the pie chart you typically see, I mean, to classify genes. And then the important part is this enrichment results. Let's don't do the redundancy reduction yet. And then the default view is a pie chart. So basically, this shows you the enriched categories because we choose the top 10 options. That's why we get the 10 categories here, right. And on the y-axis, you have the enrichment ratio from the vicious analysis. And then you can, this one can be downloaded if you right click. And then you can download this as PNG or SVG for your presentation or for, and even for publication, I think the quality is good enough. Or you want to maybe visualize in another way. You can visualize all your results in a book kind of plot like this. And then you see these are the goal terms highlighted the top 10 categories. You also have the option to change it. You can change the label from the gene set like the gene ontology, which you don't understand what it is to something is more descriptive like the description of the gene ontology terms. If you don't have this volcano plot, then you are still using the old version probably. So you have to go to the website and if you, the old website is like this and you have to use the webcast dot 2019 beta version or you just put the 2019 in your URL. Yeah, next, this is a completely new feature. So and the programmer also did a very nice job as you can see because these labels are crowded and sometimes you won't be able to see it right. But the good thing is you can actually use your mouse to move this around, to move this to the right place that you want it to be. And of course, now it's difficult to see where it's, where you can draw the link and then basically you can rearrange them in the way that it can be used for publication. Click on the draw link yeah and the yeah the axis is the log 2 of enrichment ratio. So basically this is you have expected the number of genes right if you do random sampling and then if you have a enrichment what is the enrichment ratio this is this and then the y axis is FDR in the log scale. But after you move this around you can see now you basically have a very nice view of everything. You can keep the link or you can remove the link to get a clean view and then you can download this plot that is very yeah minus log FDR yeah yeah. So the smaller will end up with a higher yeah. So this is a deck of a kind of view and then you can also do the deck of view because these genes are all guys thing directed a cyclic graph in gene ontology you can actually see this as well but this is available in the old version as well. And from any of these and sometimes if you end up with too many categories some of these are quite redundant as you can see in go right they it has this hierarchical structure I mean the parent term and the child term. So you want to simplify this and for example you can do the affinity propagation based on simplification let us look at this if you let us do a bar chart you originally have 10 10 significant terms but if you do the affinity propagation you end up with 3. That means this will group the 10 terms into basically 3 clusters and only pick one from each as a representative that will simplify your interpretation and another algorithm we implemented is the weighted set cover this end up with 4 either of this is useful for you to simplify your for 10 I mean it is not a problem at all but you can imagine when you have 200 and this would be a very useful feature to look at. And now we have the overview of the results and you can download and save but of course you want to understand why this is enriched right you want to look at the detailed result you can click on any of this bar that will show you the detailed results for that bar at the bottom bottom part of this. For example this shows the apoptotic process and then you can click on this it is linked to the database Emiko so this will give you a description of what is apoptotic process and then here you have the FDR result the p-value before the adjustment and then you have the gene set size you have the expected value and the overlap the number of genes and the enrichment ratio so basically the enrichment ratio is the overlap divided by the expected and then you have also very easily understandable one diagram to help you to understand this result and then all the genes in overlapping genes here in this table you can browse all the genes and you can also sort them in different ways. So, yeah and then let us say you did this analysis and you want to share with your colleagues or if you want to share with your supervisor he just does not have time to run this analysis so and then you can click on the result download. So, this will save the result as a zip file if you open this zip file it will include the html file and other files and if you click on that html file that will basically reproduce this exact result. So, it also provides a very easy way to share your results with others. So, yeah it is very simple like this, but you can explore this in many different ways and you can go back and then change the let us say we did the analysis against the biological process right. Now, we want to do pathway let us say we do change the functional database to pathway type and then we use a wiki pathway for the analysis and maybe we change the cutoff from the top 10 to fdr.05. So, and then you can submit. So, basically this will compare your gene list with all the wiki pathways let us see what we get. I am using wiki pathway as a demo yeah you can choose a rectum if you want or a keg, but yeah I am using wiki pathway here. fdr yeah I think you can do 0.05 for example yeah it is up yes yeah, but usually I mean you can you can do up to point some people use 25 percent 0.25 that is a not just I think you can go yeah, but I think usually you go with 0.05 or 0.01 this is a result I got. So, basically it is very similar to the gene ontology analysis, but of course, we do not have the dag because I mean this is not gene ontology. So, there is that directed a significant graph structure anymore, but you still have the volcano plot and then you can still put the description here. And now you can see because we have a lot of enrich the categories. So, it is more useful to apply the affinity propagation for example, you have a lot right and it is difficult to go through and then use affinity propagation or with it will give you a reduced set that can represent basically highlights the representative ones. And yeah that will make your volcano plot easier to look at as well. And another difference between the this analysis and the gene ontology analysis for gene ontology and they just put genes together, but without defining the relationship between genes within that gene set right. But here and we can have the pathway map yeah this is the result. So, but if you click on one of these pathways and then if you click on the pathway ID it will also show you the overlapping genes here there are 6 genes here right and then you have the pathway map and the genes will be highlighted in the map. I do not know why only one is highlighted here. Let us try another one. So, the highlighted ones are like because sometimes one load might have represent a complex or something like that there might be multiple genes in one load. But this provides a very easy way that you have not only have the pathway identified, but also you know where your genes are located on the pathway. And this highlighting function is available in all of the pathway analysis like Keck, Reaktome or Weki pathway. So, that is the first example and I think you can change the parameters in different ways. For example, we only look at the gene ontology biological process and the Keck Weki pathway and using a specific cutoff, but you are free to change those and do your own analysis. But in the next example and I want to show you how you can do the gene setting enrichment analysis in Webgestot and the specifically what we have here is a pre-rank based analysis. We do not do the differential analysis in the system, rather you do you choose your statistical methods and you do your differential analysis and after that you get a statistical p-value that will allow you to rank all the genes. In this case for the ORA as I said let us see if ORA and the input is very simple, it is just a list of genes. So, that is basically you do a differential analysis or correlation analysis and you identify 100, 200 genes and then you copy paste here or you put that in a text file and upload here. But if you want to do GSEA, let us see let us click on this GSEA sample run and what you need to have is something like this. You have every gene has a value associated with that gene and in this case for example, this could be t statistic or minus not p-value or something like that. So, that will allow you to rank the genes, but notice this and in this case you do not set any cut-off if you do a proteomic study and then you identify 5000 proteins and then you have t statistic for all the 5000, you need to put all every single protein in this, but in this format one gene or one protein and the value this and then the system will sort the genes and the identify the location of each gene for specific gene set where the genes are located in the ranked list. So, yeah because of the time let us just do the GSEA sample run I think that would be easier than getting the file and upload, but when you prepare your file is basically same as this just a file text file with two columns remember for ORA is just one column or just a gene list GSEA always you have to have a gene and a value from your statistical test. For example, you can calculate the correlation of the genes to drug sensitivity and then for each of those you have a Pearson's correlation and then you can use that to rank all the genes. So, this is if you succeed in getting the result I mean the top part will be very similar to the ORA analysis basically you still get the bar chart of a kind of plot or if you do the go you get the deck of your enrich the go terms. The major difference is here and rather than the the Venn diagram that shows you the overlapping genes between your updated gene list and the genes enrich the gene set. For example, let us see if we are interested in this focal adhesion we click on this and then it is you let us say this is the rank the list this is the data you in this is your input. So, basically now because every gene came with the value then the system was able to rank all the genes from the largest value to the lowest value right. Here you can see this is ranked list metric this is the number one gene this is the last gene basically from the positive values to the negative values. And now for the focal adhesion genes this bar basically represent where are the focal adhesion genes located on this ranked list. As we can see there is a tendency for this genes to be located on this part rather than this part basically that means, there are a lot more up regulated focal adhesion genes than the I mean the very few down regulated focal adhesion genes that is why it is enriched at this part. Also it has this scoring plot enrichment score plot and this part is the leading edge meaning these are the genes that give you the enrichment signal and these are the genes that are listed in the table at the bottom. There are a total of 53 genes out of the 128 genes that are in the leading edge that is giving you the enrichment signal. So, the so basically I mean if you think about the one diagram and this you can probably can help you better understand the difference between these two approach. One approach you have a cutoff and then it becomes two sets and then you do the overlap and then you use a facial test or hyper geometric test to check the enrichment of the overlap. But here you are not setting a cutoff then you get all the values for all the genes and then you are testing their enrichment in the ranked list. So, I think this is again I mean you can also at the top there is a download button and then you can also save and the download and then save and share your results. So, finally, let us take a look at the NTA sample run. So, the NTA sample run the input is also very simple is just an again a list of genes you are put let us say you have a list of differentially expressed genes or some genes you can just put here. The idea is to say I mean if I identify let us say we do a association study we identify 300 genes that are potentially interesting which one should I test first right do an experiment first and this will help you to answer that question. So, in there are two options one is to do network expansion, but the other is network retrieval and the prioritization. This prioritization is when you have a lot of genes you want to pick the top and then you do this. But this can also be used let us say I can show you another example later let us say you have one gene, but you do not know the function of this gene then you can just type the name of the gene here and then you can do the same analysis that will help you to retrieve the neighborhood of that gene and then to go enrichment analysis to help you to predict function for that gene. But let us look at this first yes I think the analysis probably will take very long because of there are a lot of people submitting job at the same time. But this is the result and if you do the analysis later you succeeded in getting the result it is like this and at the top you have the job summary and here you have the sub-network. So, yeah if we go back to here we see where basically this is our input gene list right and the network we chose the PPI biograde. So, this is a protein-protein interaction database. So, basically we are mapping these genes to a protein-protein interaction network and then first first we retrieved all the genes that are included in the network and then you get a sub-network like this and then you notice that and the top 10 genes because of your parameter is and you want to get the highlights the 10 genes top 10 genes right. Then these are the 10 genes that are highlighted and then you can zoom in and it will tell you and for example, the FN1 and ELM collagen A1 and TGF beta 1 I mean these genes are the top genes based on the network diffusion analysis that means these are probably the half genes or the have more connections to the other genes in this network than the genes that was not selected as for highlighting. On the right side so basically this is an enrichment gene ontology analysis for the network and it will tell you what are the enriched functions for these genes and you can click on this for example, this is response to wounding and then that those genes will be highlighted. So, basically it provides you a way to retrieve the network and also understand what are the major functions associated with the network and then you can also visualize and which genes in my network has that function. And at the bottom you have the all the genes you submitted and the rank like here you can get all the rank like FN1 is the number 1 and then THBS1 is number 2, but basically you can have your complete list here. Yeah here this is a gene ontology enrichment analysis result for the network you have so basically the typical gene ontology overrepresentation overrepresentation analysis report result similar to the ORA analysis, but focusing on the genes in the network and using the gene ontology biological process for evaluation. So, these are the three I mean major functions for the webgestot. So, again the ORA and NTA the input are just genies it is very easy you can just copy and paste to your list to here, but make sure you understand the parameter setting and get the right parameters and then the result is very simple to understand I think you and you can download the figures for very easily from the interface. I hope today's session was useful to you where you got an idea about exploration of Webgestot result. So, we learned that Webgestot gives you data of gene enrichment, gene ontology, protein-protein interaction modules and pathway analysis. In pathway analysis we learned that we can choose different functional database such as keg, reactome and wiki pathway. The filtering of pathway analyzed data can be done with stringent criteria such as 0.05 fast discovery rate. I hope you appreciate that resources like Webgestot are so precious open access available to the community where from your complex mass spectrometry or even NGS dataset you can try to now further get an idea about what is the best biological sense of a data. How to really get the best biology out of that the complex mass spectra? How best now you can try to address the biological question which you originally wanted to start with? So, these resources are highly valuable of course these short lectures and hands on sessions may not be able to provide you all the information but more and more you make yourself familiar by using these tools you will then appreciate that the same dataset which you have obtained from these high throughput technologies can now give you some very novel insight and probably the right answer to the biological question which you wanted to address. In the next lecture Dr. Bing Zhang will teach you about network analysis. Thank you.