 Welcome to MOOC course on Introduction to Proteogenomics. In the last lecture by Dr. Bing Zhang, you were introduced to the concept of network analysis with emphasis on protein-protein interactions. In today's lecture, you will be explained the need for network visualization and the various resources available to make sense of the complex data. I am sure you appreciate that not only analysis, but also the data presentation and visualization is very crucial. And in this slide, Dr. Bing Zhang's today's lecture is going to provide you good insight and opportunity to look at various tools available for data visualization. So, let us welcome Professor Bing Zhang for today's lecture. Next I am going to talk about for our data analysis and what can we use this network to help us and one way is we use this to for visualization and the other way is to do some data analysis. So, I will talk about the network of visualization first. So, network visualization is usually through the load link diagram as we said and so each gene or protein is a load and they are connected by edge. This is called the link. So, it is called the load link diagram. And the beauty of this is not only you can see this and then you can overlay data to the network in order to understand your data right. For example, I am interested in a certain type of proteins in the network. Let us say I am interested in transcription factors and then I can change the shape of the load to a triangle shape and then immediately I say ok, those are transcription factors. I can overlay my differential expression result to the network and then we can say ok, these genes are upregulated and these are downregulated. So, and there are tools that have been developed to facilitate the network based both the network visualization and the network based data visualization and this review article in Nature Methods summarizes a lot of tools that you can use to realize biological networks and this is the most popular one is the cytoscape it is very easy to use and very powerful. This is one network I did in earlier publication and if you can you can also put some artistic change to the network I mean actually this was also done in the cytoscape and after I did this and this was became a cover of that issue of the gene line. So, within the cytoscape you can actually do a lot of questions to realize your network and your data. But there is also challenge I mean although the cytoscape is very good for small networks, but when you have the whole protein-protein interaction network you want to look with and 10,000 nodes and maybe 100,000 edge it is very difficult to see anything it becomes a hairball again right. And also if you want to overlay a lot of different types of data like in multi-omics study we want to overlay proteomics gene expression copy number all these data to the network it becomes challenging I mean you can only change the shape color but I mean the lot size, but there are lot a lot you can do. So, one thing we did was to use the hierarchical modular organization property we learned from the biological network and then try to use that property. So, if we can use this bar to represent the whole network and then we can use smaller bars to represent each of these sub networks and then we can each of these sub networks can be further organized at smaller modules and eventually we have each of the nodes integrated in under these modules. So, as you can see this is because the way we can do this is because the network is organized in hierarchical modular way right that is what we learned from the network property. The beauty of that is now instead of using two dimension to visualize the network we only use one dimension and this way we can visualize the data in the second dimension either on the paper or in a browser. So, I won't go into the detail, but we have the method to convert a network into this and modular structure or hierarchical modular structure and then to also a tool called network storage that you can use to explore the data like the TCGA data under this framework. And yeah finally, I want to talk about I mean we talk about the network realization network based data realization right. So, at the end I want to talk about how we can use this network to help analyze our own data. So, this is basically based on the observation that the nodes in the network are not random again not randomly connected usually genes or proteins that are functionally similar to each other are more likely to be connected to each other in the network. As you can see in this plot like if we have a way to quantify the functional similarity between two proteins and then we can see the protein pairs that are directly connected to each other meaning that one step from each other they have much higher average functional similarity than the protein pairs that have shortest pass lengths of 8 or above. As you can see this relationship is very obvious. So, by leveraging this observation so, we can come up with a few different ways help us to do either predicting gene function or prioritizing genes in our studies. So, first we can use because we know that direct labels are more likely to share the same function we can explore the directory interaction partners of the proteins in the network. And then the second approach is we can divide use some graph algorithms to try to separate the networks into modules we know there are modules in the network right. And then we expect that the proteins in the same module will likely to share similar structure share similar function and the last method is called the diffusion based approach. So, I think this is a very local method that you only focus on the direct relationship. This is a relatively more global method you explore the module, but the diffusion based approach basically try to explore the whole network as a whole system I think this is probably a more powerful approach. So, we can I can show you a few examples to help you to understand this approach. Let us say if we this is the small protein protein interaction network let us say and then we know that red proteins meaning protein have function a and then the blue proteins these are the protein associated with function b. And then we have other proteins in the network that we do not know what they do in the network and then we try to use this network to help us to predict what are the functions of this network do they more likely to have the red function or have the blue function. And if we do a direct label food analysis. So, basically for each node we can count how many red labels it has and how many blue labels it has and through this we can assign the function through a majority voting algorithm. So, basically if there are more red labels then it is red and if there are more blue labels then it is blue. So, this is very easy to implement and you can quickly get the function of the proteins. But one limitation of this approach is that because these three proteins we basically have no idea about the functions at the beginning, but maybe for some proteins like this guy we are pretty sure it is more likely to be a red protein, but for others it is less clear right. So, you can also think about doing this in an iterative way. So, let us say you do a you have an intermediate step here you make a temporary assignment for example, this is to pink meaning it is more likely to need to be red and this is a light blue meaning it is more likely to be blue than red. But after that you read after this you read for the green proteins you recon do the recounting and in this case we can see this protein get to then to a red function rather than a blue function. So, it so this iterative process can better leverage more information than just the using the original label code analysis. And we can also use a module based approach to do this for example, in this case we only have one protein based unknown function and then we want to assign the function here and if you do the label code counting and then you will say 1, 2, 3, 4 here and 1, 2, 3 here you will think maybe this protein is a blue protein. But if you do a module based approach and especially if you have a method that can allow the module to have overlapping members and then you can have a module like this 1, 2, 3 modules and then this basically is a module dominated by red proteins and this is a module dominated by blue proteins. Then you can probably think okay this protein might have both functions if we use this and that is more likely to be true right a protein a lot of times may have multiple functions depending on what it interact with in a specific condition. So, and the diffusion based approach this is particularly useful in gene prioritization. Let us say if you do a high throughput study and then you get a let us say we do a GWAS study we may come up with multiple SNPs or SNPs associated genes that are associated with the phenotype or if you do a differential proteomics analysis you come up with 20 proteins or 10 proteins that are very likely to be associated with the phenotype and then you want to do experiments next step right. But which protein to choose to do the knockout experiment if you have a 100 candidate it is very difficult to know. So, the network based approach can help us to prioritize. So, let us say these are the candidates we have and then we can map them to the network and after you map to the network you can use a process called random work process. So, basically we can imagine and each node is a person let us say I am a person. So, I start from here from the red load and then if you just take random work at each step to the next node for example at the first step I can go either here or here or here or here right, but after you go here I have the next step I can go here or here or here or here or here. So, but you can use some iterative updating process and then you at the steady state they should end up they have the probability to end up somewhere. So, let us say if you start from here then at the steady state I may have higher probability to end here or here or here right. And then we can calculate if we start from can from start from all the each of these 10 nodes what is the probability of ending on a particular protein and then you sum that up and then you get the steady state probability for ending at that protein. And then I use the color shade to indicates the probability and as we can say here and if we start with this 10 possible positions where we are likely to end in this area, but it is also possible to end in these areas. So, anyway I mean for the 10 proteins you can probably see ok this 4 proteins are more important than the other proteins. And also this might sometimes help you to identify new genes not included especially let us say in proteomics we have a lot of missing identifications right. For example, this might be a low abundant protein and now you see maybe this is I mean somewhat possibly important protein. So, let us say some real world application of this method. So, in this study and they through a GWAS study let us say you identified a lot of genes or the SNPs associated with these genes that are potentially to be important or candidate genes for disease. But let us say you also have prior knowledge on which genes you already know to be associated with disease. Now, you have a protein-protein interaction network and then you can map all the long genes and the new genes in the network. And then if we go through the diffusion process we can estimate if we just start from this node the known disease related genes and then what is the probability of ending on this proteins. And then we get a new score for these proteins and based on this process we can rank and you would expect the gene 3 well I very likely to be disease related protein than this gene 1 right. So, this is very easy to understand. And also we used this for in a study for gene signature prioritization. So, we worked with another group. So, basically to try to develop gene expression signatures for colon cancer, but we are just one group to do this study and that many group because this is an important question colon cancer gene expression signature and the 7 published the studies on this topic. But if we look at the gene signatures reported by these 7 studies they do not overlap actually you see very neat overlap about of the gene signatures reported by these studies. So, then we we were thinking I mean what is causing this discrepancy is it because just all these are incorrect identifications or there are some possible other explanations. So, one way to think about this is maybe and if this is the network and each study may and this is the network that is actually driving colon cancer pro pro process. And each study may identify some of the important component in the network, but maybe they did not get all the important nodes, but also they observe some other proteins that are not critical to this network, but they just co vary with those nodes. So, for example, gene signature 1 may only identify this and gene signature 2 only identify this. So, if you only do the overlapping they have no overlap, but if you map all those signatures to the network we will be able to identify maybe it is this region that is important. And the similar idea is not only this is only for mRNA based gene signature study right, but a protein activity can be altered at multiple levels. At the DNA level the for example, this protein it can be the activity of the protein can be altered by mutations or copy number alterations or mRNA the expression change or protein expression change or PTM modification all these can potentially alter the protein activity and if protein activity in this important network is altered in a sample, then you are going to see potentially a phenotype right. So, that means, if we have multi-omics studies we can also map all the observations to the network that can also help us to prioritize the findings from those studies. One example is we again collected all the gene expression signatures from the 7 studies and also we collected all the mutations in colon cancer and then we mapped those to a protein protein interaction network and then we through the network algorithm. So, basically it is a network diffusion algorithm we talk about and then we get a new list of proteins that are important not only because of their differential expression mutation, but also because of their kind of centralized location in certain part of the important part of the network. And then we were able to come up with a prediction models based on those signatures and then we were able to show that the signature you get this way has better reproducibility when you apply to a new cohort than the gene expression signatures you started I mean from individual studies. So, most of the network analysis whatever you are getting the information for the networks is being obtained from experimental computation and all these tools. So, most of the practice are transitive interactions that are interactions between fully transitive interactions they are not cognitive interactions. Yeah, yeah that is right that is right. So, now in the experiment and how well the representative of these data basis as far as these are acting. So, that is a very good question. So, if you go to Belgrade for example, you download all the protein protein interaction in the database and that is basically the collection of all the possible interactions that has ever been reported in any of the publications. It could be in a disease state it could be after EGFR treatment or it is not condition dependent. So, basically you get is a map it is just like a Google map with everything on the map, but you do not have context dependent information. So, there are a few different ways to do addresses for example, you can try to build your own conditional specific networks through experiments you can do pull down or like yeah you have to do pull down in the specific condition or you can also try to leverage some gene co-expression information. For example, if you are interested in a colon cancer you can take a look at the colon cancer co-expression and try to overlay or integrate the co-expression information in that condition with a big protein protein co-interaction network and try to get some conditional specific network. That is actually a very active research area and the people are trying to develop algorithms to come up with context specific interaction networks rather than just the this global. But I think that provides still provides your reference map that you can to start to derive conditional specific networks, but that is a very important question yeah. Well, I think yeah yeah of course, you can you can do the prediction based on the sequence. I mean that is for example, one approach I talked about the domain based approach and basically if you see two and I think people are start to use deep learning actually some people in my lab are doing this type of analysis they try to leverage the deep learning and then when you have enough training models and then you can use that as a training examples and then to see how can we use deep learning or typical machine learning approach to capture the sequence features that can help you to predict yes and the sequences are also important in prediction yeah. Exactly yeah yeah of course, yeah I think people are and there are some recent deep learning based methods they try to incorporate the linear sequence and also the protein structure information and some domain information as features to predict. In today's lecture, we learned about the importance of network visualization and interpretation of complex data. We also learned that proteins connected next to each other usually have a higher functional similarity. This forms the basis for network analysis tools. If the aim is to predict gene function, the network based methods like module based approach and neighborhood majority voting can be used to get important leads. The diffusion based approach is used to get information on gene priority in a network and network visualization tools like cytoscape and net jet start are widely used in clinical studies. I hope these two lectures and various tools which Dr. Bing Zhang showed have been helpful to you to now try use your own data set or publicly available data set and try to create the various networks and visualize them using these tools. Thank you.