 Welcome to MOOC course on Introduction to Proteogenomics. This is Dr. Bing Zhang's hands on session on WebGSTART, WebGSTART or Web Based Gene Set Analysis Toolkit is a functional enrichment analysis web tool. Dr. Bing Zhang will briefly describe about the software and its three well established and complementary methods for enrichment analysis including overrepresentation analysis, ORA, Gene Set Enrichment Analysis, GSEA and Network Topology Based Analysis, NTA. He will also discuss about different parameters of each method, input gene list and run the analysis. So, let us welcome Dr. Bing Zhang for today's session. I am going to introduce two tools that have been developed in my lab mostly for pathway and the network based analysis and also for using these tools to analyze data that has been generated by TCGA and CPTAC cancer programs. So, one of the tool is called WebGSTART, the other one is called NICDOMICS. So, I think I will give a little bit of structure for this hands on session. So, first I want to give a very brief introduction to the WebGSTART tool and then we can have some practice and then we can have a little bit break if you want or we can go to the NICDOMICS introduction and the hands on part. First I think the underlying method has been described both in some of the classes talk and also my talk yesterday. So, the introduction can go very quickly. So, for the WebGSTART, so this actually was my first bioinformatics project and it was a tool I created when I was a postdoc. So, the motivation is that when you do RIC or proteomics and this high-throughout technologies actually when I started this I was doing microarray at that time and then usually you need to do like differential expression analysis and clustering analysis this type of sense right. And eventually you end up with a list of genes that have of possible interest like list of differentially expressed genes or through clustering analysis introduced by Dr. Manley during the previous lectures. You are going to get clusters of genes that have similar expression pattern and you want to explore those genes. And one way to do this is to leverage the information we already have about the pathways. So, and the typical first in order to do this you have to have some pathway definition right like the pathway databases or actually it is it can be also like loosely defined functional categories or just any gene sets. And then there are two types of analysis over representation analysis and the gene setting enrichment analysis I think we already talked about this yesterday in Carson's lectures. So, in Webcast thought so we have collected a lot of pathways functional categories and the other types of gene sets. So, basically it is from different databases or through some computational analysis I will because it is too small I do not think you will be able to see it, but maybe you have the lecture note. So, you should be able to read on your computer, but it can be separated into different categories. And in total there is a very large number of gene sets that can be used for your analysis. For the over representation analysis just a quick reminder you have a list of genes of interest. Let us say you do an analysis or proteomics experiment and then you get a list of differentially expressed genes and then you want to compare with a gene set or a pathway group to see whether there is any association between them. For example, we want to compare this with a development gene ontology and then you can do the overlap and then you can count the number of overlapping genes right. If you see over any overlaps that indicates some of your genes are involved in this biological process, but the question is whether this is enriched the representation of that category or not you do not know right. And then what you can do is to randomly sample the same number of genes as you can and from next 581 genes randomly sampled from the proteome you study for example. And then you can do the overlap and the for example, here you have you observe the 98 gene overlapping here, but only 65 in a random experiment. And then you can say ok I have more overlapping genes than here, but is that significant or not in order to do that you do the facials exact test or it can is also called hypergeometric test and I think Carson already talked about this yesterday and then you can get a p-value that can help you to know whether it is significant or not. And because when you use gene ontology or other databases to do this type of analysis, you are testing many different go biological processes or pathways at the same time. So, you also need to do the multiple test adjustment in order to justify your observation. So, I want to mention the limitation of this approach. First you have to define what is differentially expressed what is not differentially expressed that means you have to set up a cutoff for example, people you only use force discovery rate 0.01 let us say or 0.05 at the cutoff. But the question is and how about the gene right below the cutoff like 0.011 I mean that not useful at all probably not right, but setting this cutoff is very arbitrary. And secondly after you set the cutoff for example, for all the remaining genes genes with a p-value or FDRS and 0.01 you consider them as the same right, but actually they are not the same some of them may have a 10 fold change, but others may only have a 2 fold change they are not the same. But if you do this approach basically you consider all the significant genes the same. So, that is a major limitation. And the gene setting arrangement analysis basically addresses this limitation it use a rank list in order to do this. So, basically instead of looking at the overlapping it uses data from all the genes and basically you can rank all the genes from the most downregulated to the most upregulated. And then for a gene set of interest you can look at the location in the ranked list. If there is no association between the ranked list and the gene set and you would expect all the genes to be randomly distributed or evenly distributed across the ranked list. But in this case for example, you see kind of the over representation of this gene set the top of ranked list that is why you may think all something there might be an association between this gene set and this ranking. And to do that we use a GSEA statistic or it is a statistic test derived from the comograph small of test. I think Carson already talked about this today. So, I will not go into the detail, but anyway from this and then you can generate the random simulation and then you get some random distribution and then you can get a p-value. Again if you do this for multiple gene sets like all the go categories or the pathway databases you need to do the multiple test adjustment. This is better in a way than the over representation analysis, but still there are certain limitations like it relies on existing knowledge on the pathways and databases. But of course, I mean knowledge on the pathway and the gene ontology I mean those annotations are still limited for a lot of genes we do not really know a lot about cell functions. And also it treat one pathway as a separate separated entity it ignore the cross talk between the pathways, but we know and although you can consider that as a relatively independent unit they still talk to each other. So, as I talked yesterday every protein is actually linked in the cell system. So, that is why the second type of approach is to map your data to the network and then do some network based analysis like the we talked about a few different methods yesterday. And we also mentioned about this pathway interaction databases and the basis for using this type of analysis is the proteins like close to each other in the network tend to have similar functions. And yeah so basically we talked about a few different methods and so in the webcast thought we are basically implemented to approach. So, one is the module based approach basically for each network we pre compute the modules and then we treat each module as a gene set and then you can do the enrichment analysis against those predefined modules or you can use a diffusion based approach and that will allow you to do gene prioritization. So, this is a kind of overview of the webcast thought. So, the goal is to translate a list of genes of interest into some biological and pathway level understanding and the system can support 12 organisms. So, not only human or mouse, but other model organisms can also be supported. I talked to some of the students in the audience and I know some of you do not actually working human although all the examples today I am going to give will be in human, but if you are working say elegance for example, you will be able to use this resource as well and we support a lot of different types of gene IDs. For example, whether you use Unipro or whether you use ensemble to do the proteomics database search you will be able to use this without worrying about the system do not recognize your ID. And then there are a total of more than 100,000 gene sets different types of gene sets that can be used to do your analysis and coming from the gene ontology from different data pathway databases as I said from different types of protein-protein interaction networks or from the gene cancer derived gene expression networks and also like phenotype associate gene set drug related gene set etcetera. And we support all three types of analysis like the over representation analysis GSEA and the network topology analysis which is I mean basically the diffusion based analysis and then the output is very interactive and this was based on a figure part in the 2017 paper, but today I am going to what I am going to demo is 2019 it is not 2019 yet, but we are preparing for that paper. So, it is actually this is my first demo of that new system and you guys will be the first group of people to know this new system it is still the beta version, but I think it is quite stable and you guys should be able to use it now. Yeah. So, I put some hands on next step by step guide about the examples we are going to go through today. So, you can go to the workshop area and have everyone got that file a PDF file or word file about the steps to do the analysis. So, yeah let us get started. It is a web application. So, you do not need to install anything or download anything you just go use your web browser to go to the right URL and then you should be able to get started. I would recommend you to use Chrome as a browser because most of our developers are sorry I do not know what is causing the yeah most of our developers are using the Chrome as a primary web browser for the development. So, it is the best tested in that browser, but if you use Safari or I think there is a guidance at the bottom of this if you use Google Chrome Safari or the latest version of the IE I think you should be fine, but Chrome is free. So, I think you anyone can download and use Chrome if possible. So, let us go to the Webgestot website and you can either just typing the URL you have the 2019 one or you just Google Webgestot and then go to the 2019 beta version and you should see this interface. And in the first example I will show you and how we can use Webgestot to understand the list of genes that are associated with colon cancer. So, it is already pre-configured as ORA sample run. So, let us say you click on this ORA sample run it will automatically fill in all of the parameters and also upload a gene list. So, this is a list of genes related to colon cancer is we got this list from another tool developed in the lab called glad for you gene list automatically derived for you. So, that one so basically you can type in anything you are interested in and then it will give you the list of genes that are related to the concept you are interested in it could be a disease could be a biological process or could be something you are interested in. But anyway this time we get 487 genes that are related to colon cancer from that tool. So, we can go through the parameter setting to just try to understand what this means. So, the first one is easier it is select organism of interest because these are the genes from the human right. So, you select a homo sapiens, but if you have other data from other organisms in your future research you can select the right organism in the future studies and then select method of interest we are going to do over representation analysis because we do not have a rank list. This is the type of situation that you have a list of genes without any statistical metric or anything that you can use to rank them. For example, if you do a clustering and then you get a list of genes just in that cluster and then there is not really a rank right then you cannot use GSEA and so we choose the ORA. And we will just use a gene ontology biological process for this basic analysis. But here you can see in addition to gene ontology the functional databases related to pathways and network modules, disease related to gene sets, drug related gene sets, phenotype related gene sets and the genes in different chromosome locations or some community contributed the gene sets. And for example, some of you are interested in certain types of studies like in infectious disease or other diseases and you may sometimes have lists of genes that you are interested in and you want to contribute to the community allow other people to test their data against your gene set right. This is why we created this community contributed these are from some labs they want to share their gene sets to let others to compare their new studies against their gene sets. So, let us see gene ontology and the biological process for today and then the gene ID type here what we have is gene symbol. But as you can see we support a lot of different gene IDs from all the different types of microarray IDs to the enuminary IDs and then for proteins we have the ensemble, Raph6, Unipro all these different types of IDs. But this one particular one we are using is gene symbol and the reference gene list in this case and we use a gene all the protein coding genes in the genome because this was basically a list come from the literature search and we do not really know what is the space they used to and we just consider every gene has been probably have some opportunities to be studied. But in your own research in the future it is very important to consider what to use as a reference set for enrichment analysis. For example, if you do Raph6 it might be safe to use all the genes because Raph6 is kind of unbiased you get access to investigate all of the genes in the genome. But let us see if you use proteomics specifically and usually you can only get part of the proteomes a lot of the proteins you did not identify. So, when you see you identify 100 differentially expressed proteins that is out of the maybe 10000 proteins you quantified or even sometimes 8000 or 3000 right. So, a lot of proteins you did not have the opportunity to get the statistics and then you should limit your search space or the background reference space to those proteins you can actually analyze. This is very very important I review a lot of papers and sometimes people just use the for example, in a lot of proteomic studies they use the complete like all every all the genes in the proteome as a reference that is not correct. Because that will very likely you can easily identify for example, ribosome or those type of highly abundant proteins as enriched, but it is just because those proteins has better chance to be detected not because they are differentially expressed. Just to be careful and selecting your reference gene list upload the gene list here yeah. This one was from I because we clicked on the sample wrong it is automatically filled from the I mean as I said from the 417 genes from the data another tool. But in the future let us see if you have 100 gene samples you just copy and paste here or you can go here and you can save this in a text file and the and then choose the file and then you can upload as a file as well. There are two options either you can just copy and paste or you can save that as a file and upload as a text file. But make sure when you upload as a file your file extension has to be dot txt and then you do not want to have any special characters in your file name. And for the advanced parameter settings the minimum number of genes in the category we do not want to look at the categories that have only one or two genes because that is not very interesting right. We also do not want to look at the gene ontology terms with maybe more than 2000 genes because those are too broad to be meaningful. And then for the multiple test adjustment we choose Benjamin Hochberg correction. So, this is one of the most popular or I recommend just use this method for multiple test adjustment it is less conservative than the buffer only for example. And then for state significance level there are two options here and you can use a FDR cutoff for example, 0.05 or 0.01 or but when you do the first round of analysis I usually I choose a top 10 or top and you can change the number. But so, this way you can always get some result back like you know what are the most enriched term look like. And then because sometimes if you pick for example, 0.01 and none of the terms come back as significant and then you end up with empty result set that is not interesting right. So, and sometimes if you your difference is really huge you end up with thousands of enriched terms and the then you cannot comprehend all those either. So, do the top 10 at the first step can give you a sense and how strong the segral is in your data set. And the number of categories realized in the report so, I think we are we generate results for all the significant gene sets. But for example, in this case we are going to realize the results in a deck if the number of gene sets is too big then for example, if you have 1000 significant sets and then the realization will be too crowded and you will not be able to see any sense. So, that is why we need to set a cut off here. And then you want to color the terms in using continuous data meaning using the significance level in the deck. You can also choose binary, but usually continuous is a bad option. Well, you can download the result I will I actually downloaded the result on my I can use that as well. But let me try yesterday I found sometimes I submit the first time it did not work and I submit a set second time. So, I can show you the result I got here, but I hope yeah I also got the result. So, maybe let us use this. So, to most of you get your results back like this. So, in this report at the very top I mean some of you might have used the old version of the webcast thought before, but I think this new version is much improved under user interface. The results are much simpler, much easier to understand than the easy to browse. So, at the very top you see the job summary. If you expand this this basically will remind you about all the parameters you used or I mean what you did for this analysis. You can expand and read this, but you can also close it. I hope today's session was useful where you got a brief idea about WebGstart. Dr. Bing Jiang discussed about different type of input ID and how the three methods are different from each other. In conclusions when we do not have a statistical value with the gene list we can do ORA, but if we have a statistical values like p-value we can go for GSEA. Lastly he also described briefly about the job summary, how it looks like and what are the information we can obtain from it. In the next hands-on session we will learn more about the result visualization, protein-protein interaction modules, pathway-based method and network-based methods. Thank you.