 Welcome to MOOC course on Introduction to Proteogenomics. It is well known fact by now that most diseases are caused due to dysregulation of pathways or networks and they are not caused just because of effect of a single gene. It could be just in very few rare cases but otherwise it is a group of genes or proteins or a pathway is actually going to affect a given disease. As it is important to understand how proteins interact with each other. In today's lecture, Dr. Bing Zhang will introduce you to the concept of network analysis and various tools available for its use. I hope this information will be very helpful for your own projects when you are looking at how proteins interact with the protein and form a given network. So let us welcome Dr. Bing Zhang for today's lecture. So in this lecture, we are going to talk about biological network analysis. So I think we had plenty of sense and formulas and sense like that. So I will start this lecture with a poem. Actually it was a very beautiful poem from poet John Downey. This was written 400 years ago. It's titled No Man is an Island. And I don't want to repeat this but it's a very beautiful piece of work. I basically express the idea that in the holistic view of the society so no man can survive or strive without the support from the community. So I think it's still true although 400 years has passed. It's still true in today's society and you can think about people around you. For example, through connections with WhatsApp, I think that's the one you guys use in India. You have a connection through that social network, right? And so a few years ago when I was reading a review article in this Nature Review Genetics from the Albert Barbasi. So basically this sentence reminded me about this poem. And the idea is that if there is a change in a gene in the network, it will have the impact not only to the gene itself, but the impact will pass through the whole network. So I was thinking, okay, so it's not only no man is an island, actually no gene is an island. So that's the thing we're going to talk about today. So I mean we need to understand biology, understand the genes in the context of the networks. So indeed during the past decade also a lot of studies have shown that complex phenotypes including most of the disease phenotypes are the result of the dysregulation of the networks rather than individual genes like network dysregulation cause disease, not individual gene cause disease. And in order to understand the network I will first introduce two terms, one is called node and the vertex or vertex. So those are basically the individual elements in the network and they are connected by edge. So with these two terms then we can look at some typical biological networks we have. They can be divided into two categories the physical interaction networks. So these are the types of networks the nodes are connected to genes or proteins that physically interact with each other. For example the protein-protein interaction network and in this network each node is a protein and the edge represents the interaction between two proteins. And the signaling networks and it's a kind of specific type of protein-protein interaction network and it's not only the interaction of the proteins but the protein A can kind of regulate protein B right. For example a kinase as we just talked about kinase could regulate the downstream target. So in this way you can imagine a network like this but the edges are directed for example this is a kinase this is a target. This is called a directed net protein-protein interaction network but in signaling networks it's this protein has can modify the other protein so there is a direction direction associated with the edge. And also the gene regulatory networks in this network the nodes are the either the transcription factors or it can also be the microRNAs and their target genes it's also a directed network. So it represents the physical interaction between the TFs or microRNAs and their targets. And the metabolic networks in this one the nodes are the metabolites and the edges are the reaction going from the substrate to the product also it's a directed network. And another type of network is called functional association networks in this type of network and we don't really know whether the two genes in the network interact with each other or not they may or may not interact with each other. For example the co-expression network if we have a lot of experiments we always say two genes keep going up and down together we can in for the co-expression relationship between them and that can actually be quantified for example based on the Pearson correlation. And in this way I mean this we can also call this a weighted network meaning the edge can be weighted by the co-expression level right and it's also undirected and the genetic network and the nodes are the genes and the the relationship indicates the genetic interaction meaning you do some perturbation experiment if you knock out gene A or gene B you get the symphenotype then you can indicate guess maybe there's some relationship between the genes. So the first question I want to talk about is how we get all those networks because we talk about the network we need to get the network first in order to do something on them right. So first we talk about the protein-protein interaction network and in order to get protein-protein interaction networks basically we want to establish the relationships or interaction relationship between two proteins as you can see here right. So the experimental approach that can help us to get this type of relationships including the yeast to hybrid experiment or the pull-down experiment pull-down followed by the Metspec analysis and there are also computational approach that can help us in for the protein-protein interaction relationships. For example we can start from the known protein-protein interactions and then we can try to infer I mean which domains are actually interact with each other and then we can generalize to new protein pairs if they have those interacting domains we can guess maybe they interact with each other and then we can a lot of studies in the model organisms we can also through the orthogonal relationship we can also map those to human and then for example guess the interaction relationship in human and we can also do phylogenetic profiling so meaning and you have a lot of proteins in each organism right and then you look at the existence whether this protein existing organism A and B or etc right and then after you do this for a lot of hundreds of organisms you will be able to see some proteins tend to co-occur together for example in this map in this table can you tell me which two proteins are more likely to be interact with each other than the other proteins pairs B and C yeah exactly A and C because always appear together right so if you need two proteins to interact they have to coexist in that organism in order to interact and similarly and through the gene expression or protein expression experiments if we see two proteins always come up together and then we can in for the interaction relationship but of course those all computational approaches just help us to make inference needs to be validated yeah and another way is I mean you don't want to do experiments and you don't know how to do the computational inference but there are plenty of protein-protein interaction databases that you can use to download those information and here I listed quite a few I mean databases that you can I don't want to go through them individually but it's in the handout and then you can get those get to know those resources by yourself after the class and for the protein DNA interaction so basically we want to establish the relationship between the transcription factors and the target genes and the experimental approach includes chip chip which is the early version of the study now it's people are usually doing chip sick and the computational approach and we can do some promoter sequence analysis through the motif analysis or we can do reverse engineering from MRI profiling data and also there are databases that we can get this type of information from the transfect I think this is now commercialized but the JASPA is the open source resource that you can use and the metabolic networks those are the networks that have been very well studied for a long time and the very very established database for this type of networks and the two commonly used ones includes the CAD and the meta-sick these are two well used metabolic network or pathway databases and for the co-expression network which is also typically used I mean this is mostly from the computational analysis you start with gene expression or protein expression matrix each row is a gene each column is a sample and then you can use one of this method to do co-expression network inference for example the WGCNA what you do is to you calculate for each pair of genes you calculate the co-expression relationship and then for example through the Pearson coordination and then you get a score and you get a weighted network and then what they did was to raise the coordination to a certain power to further discriminate the highly correlated ones from the lowly correlated ones and the next same package we developed I mean so basically try to convert this weighted network into some unweighted networks because there are certain graph algorithms that we can use can apply to the unweighted network but cannot easily apply to the weighted network in this case I mean we can think about a clean nearest neighbor approach so basically for each of the load in the network we can ask what are the I mean let's say we are talking about a two nearest neighbor network and we are asking in this network what are my two best friends and then we get for each gene we vote for the two best friends for each gene and then when you think someone is your best friend the other guy may not think the same right and then we also remove those relationships we only keep the ones that are mutual I mean you think I'm your best two friends and you also think so so basically this can give you from this network to a very robust relationship based but it's unweighted network now so directly is also another very popular tool to use to build a co-expression network rather than using Pearson coordination or Spmn coordination that deriving the relationship between two genes based on mutual information so that is also good idea because I mean the mutual information can capture different types of relationships not only the linear relationship or monotonic relationship there are more types of relationships that can be captured by the mutual information so let's say you went go through all this and then you were able to build a network right this was what happened the next maybe in early 2000 there are a lot of experiments that have been done and the people start to build the protein protein interaction networks and from the early time for the east and then also the human protein protein interaction network three experiments have been published at the very beginning people get very excited oh this looks great and we get a lot of information but then if you look at this people start to realize these are just tables right I mean it's beautiful to look in a way but what can we do come from this so then the next question is if you have the network what can you learn from the network and in order to do that maybe I will introduce a few more terms in order to better understand these networks the first is the degree the degree means the number of links or edges each node has for example we can look at this node this purple node in this network and basically it has three links and a degree of three this is for undirected network but if it's within a directed network each node have incoming degrees and also outcome in degrees and for example this M8R this gene it has out degree of one two three but in degree is one so degree is a very simple but important measurement for the nodes or genes in the network because if you look at this you would think this is kind of in the more center position of the network if it has higher degree right so this it's a very simple indication of the centrality of each node in the network and the second thing we want to talk about is a path so that's how we start to explore the relationship between two nodes in the network right for any pair of nodes you pick and you can find paths that link these two nodes for example here and we can find this path which includes two edges and then we can also go one two three here or we go one two three four five here right so that a lot of different paths that you can find in your network and the one with the shortest the total length of a path for example from this node to this node I think this a path with two links is the shortest path this is called the short path between them so with the understanding of the degree and the path then we can start to explore some of the property of these networks I think the after we get a network the first thing we want to do is to understand and the how the network is organized what are the characteristics we can learn from those networks the first one I want to yeah the first property we want to talk about so maybe I will give show you this example I mean these are two networks each of them have 130 nodes and 215 edge so the number of nodes and edges are the same but if you look at these two networks they are not very that actually quite different right so one thing is that and this network is more homogeneous meaning every node is very similar to each other so basically they have the same number of connections the five red nodes basically with the highest number of links only reach 27 percent of the other nodes but if you look at here and the nodes are very heterogeneous meaning some of them have a lot of links but most of them only have one or two links so in this network the five red nodes with the highest links that they can reach 60 percent of the other nodes in the network so this is that means the nodes are very similar in this network and the nodes are not similar in this network and in reality in most of the real life networks like social networks or the even biological networks they usually have this organization rather than this one and the Barabasi and who is the kind of very important person in network analysis he named this network scale free network and this is basically a random network if you randomly connecting nodes this is what you get but the real life network are not like this they are more like this this is called the scale free network so this can be probably best understood in social networks and we can immediately understand what are the hubs in the network right of course those are the people that are celebrities like the stars and for example if you look at the social network they of course have a lot of connections and for us guys I mean maybe we don't know too much too many people and they only have a few connections so the hubs are the celebrities in the social network but biological networks are also have this scale free organization so of course and people interesting to know I mean what are the hubs in the biological networks do they play a different role in the biological processes than the other nodes in the network so in early 2000 like 2001 there was an interesting study so at that time and there were a genetic study in yeast basically try to delete each individual protein in the yeast protein and try to see the impact of those proteins deletion of those proteins and they found some of these red proteins had an lethal impact meaning if you delete that protein the cell will die and some of them I mean does not cause much impact or some of them only cause throat growth but at that year I think the protein interaction network of the yeast have also been published so this group try to combine these two types of data and see and now I group the nodes in the network in based on the number of links or edge or degree they have in the network and here is the nodes with only one link and here is the nodes with 20 links so basically you group them based on the number of links and then they look at that the percentage of the nodes in that category that has a lethal impact after when the protein is deleted and you can see a very nice positive coordination between the number of links of a node and the percentage of the lethal proteins in that category that means the hub proteins meaning the nodes that have more connections in the network when it is deleted it will have a stronger impact to the cell itself and during after that publication and people keep exploring what are the other possible properties of the hubs in the biological networks and in this review article by Mark Vado in 2011 and he summarized the major findings so basically the first study shows the hubs correspond to the essential genes and they tend to be older proteins and usually they evolve more slowly than other proteins in the network they have tendency to be more abundant and they have a larger diversity of phenotypic outcome when it is deleted it is they could have different types of functional that means they may be involved in more different types of functions so and then you may wonder why the cells choose to establish or the evolution has I mean evolved into this scale free network rather than a random network right does this give the cell any benefits and indeed if we think about in a network like this and if it's a scale free network let's see the mutations we know that then mutations or random attacks on genes right but if the mutations occur randomly across the genome it could hit any of the proteins but most of the proteins are the proteins only with one or two links so like mutation in these proteins will not affect the cell as a system right so it provides the cell the robustness to survive so meaning mutations typically don't have a important consequence because it does not make significant impact to the network as a system but this also gives us some we can also think it this way now if we want to try to try to cure bacteria in the cell and what should we do and in that way we can think about the targeted attack of the central node meaning the hubs for example in the bacteria network because if we attack these important nodes then it will destroy the bacteria I mean that's one way we can think about how to prioritize genes when we want to treat a disease we may try to cure fungi or cure bacteria in our human that's the first property of the network is the existence of hubs or the scale free property of the network the next thing I want to talk about is the small world network so this was originally and started by the in the social network context by a scientist called Stanley Milgram I think he was a social scientist in Harvard and he did this experiment in 1967 so the idea is he want to understand how people in the are connected right to each other of course now it's easy because we have the internet and since we know how to do this but at that time I mean it's actually very difficult to do this how can you even think about I mean to a way to estimate or to understand how people are connected to each other I think he come up with a very interesting experimental design so he prepared the 160 packages and he gave those packages to random people in a very small city Omaha and Nebraska in the US and then he gave these packages to random person in this small city and they asked them to try to send the package to a stock broker in Boston which is far away from Omaha right and then you cannot directly send this to him because you don't know him you have to every time you have to pass this to somebody you know this is his instruction so and then at the end he got collected all the netters from this stock broker and then he counted how many times is required for the netter to reach the stock broker from the original place and surprisingly the average number of passes to reach him was six so that is a famous sixth degree of separation I think probably some of you have heard about of this phrase came from from his study and then through this study he understand well although it looks like everyone is so far away from each other especially when you think about it was in 1967 right and maybe nobody knows each other but now you see for any person you don't really have a good connection I mean still you can reach him in six steps and so this is what he calls the small world network and if we look at the biological networks it's also in the average pass length between any two nodes and it's between next three to four actually it's even smaller than the social network he estimated at that time I think one reason with the biological networks use a small network structure is probably because it can pass information more efficiently and the next property I want to talk about is the motifs so when we talk about motifs it's a patterns that keep occurring in a system more often than random chance right so if we think about three nodes in I mean what are the possible relationship between them there are actually 13 different types of relationship between three nodes if they have a somehow connected and then for example this pattern is very well studied it's like the fifth forward loop right node a can have a positive relationship with this node b and then it also through node c it also has a cb right so there is a it's called the fifth forward loop and this is called a feedback loop so basically it go around and then for each of this maybe let's say this is a real network and then you can count all the how many times you see a fifth forward loop the three node motif in this small network and then you want to know whether this is a motif I mean because the definition is statistically more frequent than random chance right what you can do is to randomly shuffle the nodes in this network and build some random networks and then you can after you do the shuffling you can count the number of the fifth forward motifs again and then for example here we see a lot more fifth forward motifs in this network than the random networks and there is a study in Ecoli transcriptional regulatory network because in engineering we know that the fifth forward loop and the feedback loop both are very commonly used I mean to regulate a system right but in the biological network like the transcriptional network they actually found that the fifth forward loop they observed 42 in the Ecoli network but they only didn't observe any feedback loop in that network and this is significantly more than what you would expect by chance so that means this is kind of a motif or the way of organizing the nodes that the biological system actually use but not the I mean the faithful feedback loop and the next one is the modularity so this basically says I mean the genes and or proteins in the network tend to form groups rather than randomly connected to each other and so in the transcriptional networks and these are of course the transcriptional modules meaning the target genes of course and they are if they are targeted by the same set of transcription factors they form a group right and in the protein protein interaction network like this one we can see and the these are the groups of proteins in the network and these are usually the protein complexes in the network and the signaling networks we can have the signaling pathways that represents the modules in those networks and also the modules are not only and occurred like separately they also organized in a hierarchical way meaning there are small modules that get connected to each other to form a relatively large modules and then eventually and reach the whole network and we can think about the protein complexes and the two complexes may interact with each other to form a relatively larger network and eventually we know the six degree separation right everything is actually connected so those are the five different properties associated with network that I have been reviewed through a lot of studies. In today's lecture you are introduced to various types of protein protein protein DNA and other biomolecular network analysis these interactions can be either experimentally determined or through the use of computational software predictions databases like dip, mint, biogrid they contain information of various protein protein interactions. Metabolic networks can be studied using databases like keg or metasick while co-expression networks can be computationally determined. Pearson correlation coefficient to the direct indicator of lethality and connectivity it is important to note that biological networks generally follow the characteristics of the small word networks. I hope this information is giving you more clues and ideas how you can utilize these available resources and do more protein protein and protein biomolecular interaction and network analysis for your own data set. In the next lecture we will learn more about the visualization of network and Dr. Bing Zhang will continue his lecture and show you how to use various tools to do data visualization for network analysis. Thank you.