 My name is Sarah and I did my PhD here at U of T working on gene networks and gene function prediction. So it's the stuff that I'm really familiar with. So feel free to ask any question that you might think is simple or if you think it's too complicated. I'm happy to discuss. I actually don't have too many slides, so I'm hoping that you guys will ask a lot of questions. So let's look at the learning objectives. So the main idea is that we want you to get across the concepts of functional interaction networks, GEOPIA association and their concept of a do you recommender system and then understand how you can use context specific functional interaction networks and also understand how direct interactions are different from label propagation and most of the gene function prediction algorithms. And then if you kind of understand those conceptually, then you'll be able to use a do you recommender system like G-mania to answer questions like what do my genes do and which data is most relevant to answer questions about my gene sets and things like that, but we'll go through that in a more elaborate way. So here's the outline which I'll skip. So just to motivate some of the stuff that I'm going to tell you about, as you know as biologists today, you have access to tons of different data sets and different assays and this is really exciting because you can measure at very detailed genome-wide level various aspects of cells and their behavior. So for instance, if you are interested in a list of genes that came up in an assay that you did or you somehow know that they're correlated with some sort of phenotype, you can learn so much about what these genes do and how do they actually result in a phenotype of your interest in a very detailed and mechanistic way. And what I mean by that is you can come up with a list of genes and look at public databases. In most cases, you don't even have to generate new data as long as you have some genes that you want to understand what they do. You can look at public databases. For example, you can look at various gene expression data sets that are available widely to see what the temporal or tissue-specific kind of expression of your genes of interest are. You can look at interaction databases like genetic interaction data or protein interaction data to learn about their interacting partners. You can look at other proteins with similar domain composition to kind of figure out what are their molecular functions of your genes of interest are. And you can look at pathway databases like MCDB or KEG or Reactoman and these kind of databases to understand what pathways are your genes in. All of that data is supposed to give you some clues to be able to say how does my genome interest result in some sort of phenotype or cellular behavior that I might be interested in. So of course, you can do this manually, but this would be very pre-regional learning era. And I'm sure many people still do this manually. You do some assay, you come up with a list of genes, and then you go to all sorts of databases and start looking up what your genes do and what do they interact with. And actually just for my reference, how many people do these kind of things that I'm talking about, like come up with a list of genes and look at those genes in different databases. So I still do that routinely. And of course, you do that initially when you start your research project. A little bit manually, but eventually to make this more systematic, you want to do it in a comprehensive and a more systematic way. And that's what we're going to talk about today. So I would say if you were doing this manually, the upside would be that you have a lot of human intelligence that you can use to say this data is not relevant, that one is relevant, this is actually not time what I want and I'm going to ignore this data. But it's a process that requires a lot of your prior knowledge and biases, but at the same time, your results are going to be biased to what you kind of had in mind about what your genes do. And also it's not going to be reproducible and applicable to another similar analysis that you might do in the future. So we're going to talk about how are you going to make this process in a much more systematic way. And to do this, there are approaches out there under the guise of gene recommender systems. So they're called gene recommender system and you will get to know why this is actually an appropriate name for them to try to do what I just said in a very systematic way. So to tell you about those, first I'll tell you about the concept of functional interaction networks. And that's because a really good way to understand what these gene recommender systems are doing is by really understanding what functional interaction networks are. So a functional interaction network is a network where the genes, where the nodes are the genes and the connection between them is some sort of similarity and I'll elaborate on that in a second. But this similarity typically signals something to us about co-functionality between the genes. So if their two genes are connected together with stronger lengths or stronger edges, that means they're somehow more similar and that kind of implies that they're more likely to be co-functional. So another way to think about these networks is in terms of co-functionality networks and some people actually call them co-functional networks. So the very abstract representation of data in terms of networks of genes and the connections and the disconnections are supposed to tell you something about the strength of co-functionality between the connected genes. So the motivation from this actually comes way 20 years ago. So how many of you have seen this paper or know about this paper? Eisen, Pianist, 98. So this is a very famous classic paper because it was first of many things. It was one of the first papers to use gene expression data or generate gene expression data. One of the first papers to actually propose clustering gene expression data and also one of the first papers to show that by clustering gene expression data you can gain insight about the function of unknown genes. So let me elaborate on that a little bit. So what they did in this paper is they generated gene expression data in yeast in multiple different conditions. So this is what their data looked like. So you have all your expressed genes under rows. You have several different conditions that they were investigating on the columns. And then they clustered this gene expression data set. And one prominent thing that they saw was this two kind of patterns that distinguish two sets of genes. So there you have sets of genes on top that kind of have this pattern of expression across all those conditions. And then there's a set of genes at the bottom that have a different pattern of expression. When they looked into this they realized that the first set of genes are mostly genes that are involved in cell cycle although there were some genes in that set that were unknown. And the set of genes in the bottom cluster were mostly genes that were involved in protein degradation. And then again there was some unknown genes in this mix. So another way to think of that whole thing is that if we use this data to draw some sort of network that connects genes based on their similarity in expression pattern across all these conditions then the genes that have stronger connections are going to be the ones that are more likely to have the same function. So for example some of the genes that were in here part of the protein degradation complex are here. And then I mentioned there's some unknown genes so there's some here and then there's another set of genes on top involved in cell cycle processes. So if you represent this data in this way then you can start making inferences and this is going to be very simple and of course it's more sophisticated than this but just to give you intuition if you have some unknown genes here then you can say probably this one is involved in protein degradation and the one on top might be involved in some sort of cell cycle process. Does this make sense so far? So now I want to talk about the slight complication that I kind of overlooked but it's going to become important later so we have this gene expression matrix I mentioned that we construct this network based on correlation and expression pattern now can some of you anticipate what their problem is if basically you want to construct these edges and you're measuring correlation across the conditions for pairs of genes to come up with these edges? So what if some genes have negative correlation across the correlation? So that's a good point so it could be that some genes have positive correlation and some negative it's not really obvious what to say about negatively correlated genes like this concept seemed more naturally applied to genes that have positive correlation and expression So the size of the correlation? The size of the correlation So one thing that's kind of important to keep in mind is when you are using this kind of data to build network based on correlation it could be correlation across an expression across certain conditions correlation is never zero right? Correlation is kind of like if you look at the histogram of correlation values that you could possibly get it's basically the histogram is centered at zero with some standard deviation So when you see a nice network like this that a gene is only connected to three other genes that's not really realistic because if you actually computed networks from this data but the method I mentioned this gene would be connected to all the other genes but then the weight of these edges will differ So some of the genes are going to have much stronger connections some of them are going to have weaker connections but you're not going to see a nice network like this and I'll just write that on the board because maybe we'll come back to that later So just to summarize what I said you have this data, your gene expression data X which has G rows if G is the number of genes and let's say C columns if C is the number of conditions so some notations that you might come back to Xij Are you guys familiar with matrix notation? Can someone tell me what Xij is? The number of elements is i Yeah exactly, so that's the expression of gene i in condition j So when I'm talking about taking correlation between two genes what you're doing is you're taking two rows of this matrix so row i and row j and you're computing the correlation between those and you're doing that for all pairs of genes and that's how you get a network that actually looks very dense in the beginning but then you can use tricks well not really complicated tricks but what you do is you then set a threshold to say below what edge strength you're not going to consider an edge on edge basically so there's ways to do that and we can go over that if you guys are interested but that's basically one thing to keep in mind So there are different types of functional interaction network depending on what data type we're talking about and I should mention that this is a very generic concept I mentioned how you construct a functional interaction network from gene expression data but this could be applied to any other data that you measure for example it could be applied to protein domain data where you list the protein domains that a particular protein has so then you can compute the correlation between two proteins and that would tell you how similar they are in their domain composition and you can apply a similar concept based on like phylogeny phylogeny information and other stuff so now you can think of three different types of functional interaction network that could be potentially constructed one is based on functional interaction networks that are actually based on data that already measures interactions between pairs of genes or pairs of proteins and I'm going to use those interchangeably genes and proteins, hope you don't get confused by that so can some of you give me an example of a data that already measures networks as opposed to measuring kind of profiles like gene expression or a data type that would measure edges like in the representation that I mentioned so you're kind of constructing that network based on your profiling data but is there a data type that already measures edges as opposed to measuring something about the nodes and nodes networks any idea so a protein interaction dataset like for example when you use to hybrid or other techniques where you're pulling down a complex of proteins then you're kind of by design you are capturing multiple proteins together so that directly tells you if there is an edge between them or not in that kind of representation versus with profiling data you're computing, you're inferring those edges based on similarity in some sort of profiling that you have so there could be a direct interaction data that tells you directly about functional interaction between genes and examples of those are protein-protein interaction datasets and there's tons of data like that that are publicly available there's also genetic interaction data where you are basically considering double knockouts across the whole genome and measuring if there is you say there's a genetic interaction between two genes if the knockout of both of them is more than knockout of individual basically put together so those are for direct interaction dataset you can have inferred functional interaction networks and that's an expression dataset as an example of data where you would use to infer functional interactions or you can have networks that combine multiple types of data and probably it's obvious that networks that combine multiple types of data are going to be more powerful because you're kind of getting rid of some of the noise that could be particularly observing one data type and not the other instead of low signal to noise ratio and hopefully enhancing interactions that are much more likely to be reproducible and re-observable across different systems so in terms of inferred interaction networks from multiple datasets there are several resources that have put these together already so examples are SL networks string which is really widely used so how many have heard of string before only a few people so this is something that you can check out it's a very popular tool that put together multiple types of datasets to construct these functional interaction networks there's also things called human, worm and etc. there's biopic the first category that I mentioned are examples of functional interaction networks that are what we call context-independent so there's just one giant network that somebody has constructed by thoughtfully putting data from multiple sources together but they're not optimized in any way for the types of questions that you as an investigator might have and I'll tell you in a little bit why that might be important then there are networks that we call context-dependent and these networks are basically networks that you put together from multiple datasets but the way you put them together it's very specific in terms of the query or the gene sets that you want to analyze and you'll see how that is important and what that really means later too when you say genetic interactions I mean for the important man I understand that you interact with each other so genetic interactions so genetic interactions actually have you heard of SGA so there's a lot of there's a large set of experiments that people do where you genetically knock out two genes at a time and look to see for example people do this in yeast a lot so you look in growth media to see if the knockout of two of their gene results in more depth compared to each individual gene so in that way you infer it's not actually a physical interaction but somehow those genes are both working together in a way that makes them essential together any more questions please feel free to ask questions otherwise I'll try to go slowly so there's a lot of concepts in here so the whole this module is about predicting gene function and there are two ways computationally you can approach that problem the first way is you ask a question like what does my gene do and here what you can do is use what's called the guilt bias association principle to predict what your gene does based on the connection that gene has with other genes and the function of those other genes you can also ask a question like I give you some genes give me more genes like this and this is where the concept of recommender sister comes in and if you do this then you kind of self question to this you can already answer the first question so the second type of answering the question what does my gene list do is more powerful than the first approach and you'll see that soon so the first question that you actually might have is why do we have to go through all of this to try to use all these data to predict gene function didn't someone already do this for me for example didn't someone already created a gene ontology database and already record what all the genes do and I guess the answer is that not completely yet so there are databases like gene ontology and panther and I think there's a few other and there's a few other species specific ontologies that try to describe function of all the genes in the organism's genome but what you'll realize if you start dealing with these databases is that they're very incomplete so for example for human gene ontology according to statistics a few years ago only had annotations I would call informative annotation for 40% of the genes so what does that mean is that often if you look up your genes of interest in gene ontology you do see something for them but that annotation is very uninformative it will be something like your gene is involved in immune function or it's involved in a biological process something that doesn't really give you any detail about what exactly does your gene of interest do so the whole idea of these gene recommender systems is to be able to use data that's becoming more and more abundant to answer questions about what your what does your gene do based on connection of your genes to other genes as can be inferred from all these large scale data sets yeah hope is there for that so I would say it's getting a look so what's your species making more yeah so definitely some species are better than others so some of these data sets for example gene may need are we going to talk about they're trying to infer or transfer these kind of annotations to other species or organism that are not as well studied but I would say that's very ongoing still and it's yeah so there is not huge amount of resources although we're hoping that that would be better because you have these kind of engines you can apply the same type of principle you don't need to have someone manually curating your databases all you need to know is kind of as long as you have some annotations and you can infer the rest of the annotation based on connectivity and the network relationship between your genes of interest and other genes that maybe people know something about it so that's the principle and that principle can be applied to any species as you said when people collect all of these data and put it on other people can use some species are more represented than others so let's talk about the first type of question that we could answer the first type of question is what does my gene do in a very simplified way a gene recommender system would answer this question as follows you basically the gene recommender system will collect all sorts of public data for example public gene expression data protein-protein interaction data protein domain composition convert all of these data into a static network that basically links genes together based on the similarity of those genes across all these different data sets and we'll talk about that in a second and then once you have such a giant network that's built based on multiple data set you can look up your genome interest let's say your genome interest is CVC4-2 you can look what other genes are linked to your genome interest in this giant network that's been compiled so if you basically you have your gene here you know what other genes it connects to then you can do functional enrichment analysis to see what functions are enriched in this sub network that you extracted based on your genome interest in this case so the direct neighbors of these genes are some of the genes that are here and this sub network is enriched and polarized grows cellular but small gtk's regulatory activity so based on knowing what the network neighbors of these gene is we can kind of infer what the core functionality might be based on what we know about other genes that it connects to so is that obvious to everybody or do you guys have questions about that because I see some faces that kind of look surprised so is there something that I can clarify so it's the same as gene prioritization it is the same so it's the same concept is the systems that you use are they the same as like Amazon recommends you books when you go on exactly yeah so we'll get to exactly that but it's the same idea it's the same concept so let me talk a little bit about how to be so I mentioned all you need to do is to go to one of these recommender system with your genome interest and they've already kind of done everything else they've collected those data they've constructed the network and now when you go in matured gene all they have to do is figure out what other genes are linked to your gene and then do an enrichment analysis on the set of linked genes so let me tell you how they actually combine these data in a little bit more detail so and I don't know you guys probably can't see the boards if I write over there is that right you can't see it okay so that's why I'll just describe what's what's happening so each of these data sets so in this example we have to read a set a gene expression data approaching for interaction data and a protein domain data set so for each of these data you build a network between genes and you have an edge between each pair of genes you threshold a network somehow so for example you say I'm going to only consider edge weights that are in the 10 percentile based under how large they are you do that for each of those data sets separately so you have three networks and then you have to add them up and then the process of adding them up distinguishes different approaches that you have simplest thing that you can do is just simply add the edges that you saw for each pair of genes so gene A and gene B had an edge weight of 0.5 and the first data 0.7 and the second data 0.8 and the third data then you just add all those numbers to get an edge weight that represents all those different network functional interaction networks that you've constructed so once you have that then you look at CDC 42 and what genes it's linked to and then again you have as long as the edges between the genes that is linked to is above some threshold you're going to draw that as a network for that query gene so we call this a query gene and then you can of course do background return analysis on that is that is that clear no questions so one question one important question that was brought up is what do you do with negative edges most people just ignore negative edges because if you actually test these kind of system to see how well they can predict function of genes that you know in a cross-validation setting where you kind of pretend you don't know their function they try to predict a function of some of their genes you'll see that if you include negative edges you're not gaining that much so that's the rationale people use to just ignore any negative edges initially what people were doing where just take the absolute value of negative edges because maybe they're still related to each other it's just negative edges you just treat them as some sort of relationship so you take the absolute value but that turns out that it's not really helpful in terms of predicting gene function so I would say most approaches now don't consider them and then in terms of so once you have this network then you can do functional enrichment analysis and have you guys seen functional enrichment analysis I think it was one of your modules so you've probably seen it so what actually happened is that for example this system in this recommended system they've already downloaded all gene autology annotations and they go through each one and do an enrichment test and then they report the functions or go annotations that pass p-value threshold that's where you get multiple kind of functions that could be significant and the protein interaction network is standard like if you're talking about human data yes that's a very good question none of these things are static so protein interaction depends on which tissue which system which technology it depends on many things of course there are going to be some edges that are always stable like the complexes that are for example cell cycle complex they're always going to be connected together most data that you would measure but there's a lot of variation from data to data and what people like informatics or computational people try to do for this kind of system is to include as much data as possible and in that way you're hoping that by adding the data together you're averaging out the noise and you're enhancing interactions with all data types but yeah that's a good question none of these data are stable in the sense that it's not the truth but together hopefully they're telling you something that could be helpful in terms of understanding what kind of other genes your genome interests interact with so what I mentioned just now with this dream recommender system that you saw in the previous slide is a context independent dream recommender system and that's why I went through the example of these networks together simplest way to add them together is we call uniform weighting which is basically each data contributes equally to the final edge which is just some of all the edges that you observe for a pair of genes so you can do that in a slightly different way but I would say uniformly weighting different data is probably one of the widely used approaches that people view for this the other approach you can take is to construct a context dependent network and there are only a handful of approaches that take this helpful method that take this approach that includes g-mania and this is hard to say I'll tell you about g-mania and how this is done so here's the intuition behind it so let's say that you're interested in p53 protein and if you ask me what does p53 do you could mean different things by that you could be asking about the biological process that it's involved in you could be asking about the biochemical or molecular function that's involved in you could be asking about where does it localize to so a sub-cellar localization you could be asking about its regulatory targets or you might even be interested in this temporal expression pattern so depending on what you have in mind that question might mean many different things if you just ask me what does my gene do ideally if the algorithm could figure out which of these you're interested in you don't have to really specify that's exactly what I want but hopefully there could be an algorithm that can guess which of these is the most interest to you and this is the goal of the gym recommender system in the context dependent networks so we're going to talk about that and here is going back to someone asked a question about like recommender systems general recommender systems like Amazon they say that you are interested in learning about different cities and you just told me that you're interested in Memphis well if you had given me Memphis Knoxville what I would tell you what I would recommend to you would be very different than if you told me Memphis, Alexander, or Cairo so the context that you put your query in really determines what you really mean by that query so in the first case you're just interested in cities in the US whereas in the second case you're interested in ancient cities and if you just told me Memphis there's no way I could have known which of those kind of intentions you had in mind but if you give me a little bit more you don't even have to tell me what you want just give me a few more examples of what you have in mind then I would try to figure out what to do and computers are really good at trying to figure out the pattern so if you give these two lists they can figure out exactly what you had in mind and recommend other cities so actually a few years ago ten years ago there was something called Google sets I don't know if any of you have ever heard of it, it doesn't exist anymore but it was exactly a recommender system like this and there was a motivation behind some of the G-mania work and how it worked was that it basically had five to ten boxes and you would put anything that you want in those five boxes so you could put like five cities or five types of cars or five mountain names or whatever and it would try to guess what other thing you could be interested in so if you put cities on top it would guess other cities for you and that's exactly the idea of a gene recommender system that we'll talk about so the idea is that instead of just trying to predict trying to say what does my specific gene do we can say give me more genes like this and in the process of doing that then you can actually understand what is the function of your gene of interest so this is what G-mania does this part is the same so you have your database functional interaction networks that are already pre-computed then instead of providing one gene you provide a list of genes and it gives you a network back and an enrichment of those functional enrichment of genes and that network back and we'll talk about each component of this but the basic idea is that now by giving G-mania a list of genes which of these network databases best fits what you had intended about what you wanted to know about this gene list so just to recap G-mania is one of these gene recommender systems that uses context dependent network it has three components so the first component is driving a common representation for all sorts of data that already exists out there and in the common representation we're talking about these functional interaction networks that I mentioned so in the G-mania database you have the D data sets that you've downloaded and you represent each of those as a network so you have D networks that are already computed then in the second step you construct a composite or combined network that's a weighted combination of those original D that you had and then in the third step use this network to come up with predicted function for the list of genes that you provided so let's go through the steps of this and to do this we're going to go live demo muscle contraction the heart and so on so let's see so we're going to take this gene set and this is G-mania website so G-mania.org and you're going to go through examples and you're allowed to go through it so don't have to worry about kind of following every step so I just pasted my list of genes here and then if I run the G-mania query I get this network but first let me mention this that you see this little icon on the top where my pointer is so that's a human but you can select different organisms here so depending on the organism that you select you might see different things but hopefully not too different for processes that are conserved so there's Arabidopsis C. elegans Drosophila E. coli, human mouse, rat and yeast that you can see on this list so what I did was I just pressed this search and G-mania gave me this network and now part of this network let me try to make it a little bit bigger so ok so a few things so what you see is the edges between genes some are much thicker than the others and that's basically because some of the cultural interactions are stronger than others and the other thing that you might notice is that they have different colors so what that means is that depending on which data sets actually told you about that edge you color them differently for example co-expression data is in pink or in purple so this is a link that's based on gene expression data this green blue one is based on pathway databases you can see you have different types of edges that connect your gene sets together and you can get all that information up on the top the other thing that I wanted to mention is so there is a component that you inputted a list of genes I don't know how many genes were in that set but I think about see 17 genes on your list but there's more genes than 17 displayed here and the reason for that is that remember this is a gene recommender system it gives you the connection between genes and some additional genes that are connected to your set of query genes the one that you inputted originally are the ones that have these horizontal lines a vertical line going through them and the genes that was predicted to be part of should be part of your investigation are genes that are solid black now that you have this network of genes so you can specify how many genes you want to predict along with your query genes so we start with 17 I think by default, G-mania provides 20 additional genes so there should be something like 37 genes here what you can do is you can do functional enrichment analysis and that's at the bottom here so you can go through G-mania's already gone through enrichment testing for all aroundology annotations and different hierarchies and ranked the different annotations based on their p-value so for example at TOB you see circulary system processes like I said some of these are muscle genes that are important for muscle contraction and that's why they show up here so heart contraction so what you can do is you can click the ones that are of interest to you and it will color the nodes based on those annotations so for example this gene here in the middle it now has multiple colors because it's annotated with those multiple functions on this list here and I should mention that in this list only genontology annotations that pass a certain significant threshold based on multiple testing are actually displayed not all the functions are displayed only the ones that are significant you might have mentioned this the size of the node is what the size of the node will come to that in a second but it's how central that node is how many neighbors it has not exactly direct neighbors but we'll see that concept in a second but overall how connected it is through the network so the ones that are more central we are more like larger than the ones that are more peripheral also one more thing so this gene list that you put in what would be an example of a gene list great question I'll answer that in a second but I wanted to see if what was your question I just wondered like you just included some genes but I was wondering what kind of where did I get that list okay so what's the use case so that's a great question so what this is really good for is you do a screen and you come up with a list of genes for example you're interested in what's different between stress condition A and stress condition B so you do whatever data you're generating and you come up with a list of genes that are different and that's the ideal use case that you already did someone else's you came up with a list of genes that are different between two conditions multiple conditions that you were investigating and now you want to understand why did you observe how does this gene together make sense and so this is the tool to try to see what other genes this gene set connects to and based on that you can try to understand what is the cellular context of observing this set of genes obviously you don't have to have multiple genes you can use it with one gene as well but this is the case where you've done a screen and you have a list that came up on top of your analysis and then with so going back to like GSCA with the genes? No it does not so that's very different compared to GSCA so when you're doing pathway analysis with GSCA you are essentially you're kind of ranking the genes in a way that allows you to join Richmond with this you just treat your list of genes and internally we'll come back to that I will try to figure out which one is more important than the other but there's no you don't have to provide any kind of ranking or any kind of scores and you start with like let's say 10 genes and would you see in the graph actually adds more genes to them? Yeah exactly so it's the same concept that I mentioned in the previous slide that you had those cities so like you provide some cities it gives you more cities so here you give it's a functionality that you can disable or not disable and I think you might be going through some of that in your module but you can say I want to see additional genes or you might say I don't want to see any additional genes you can only look at your own gene set and try to understand how they connect to each other and what their function is or you can get additional genes at I would say getting additional gene set is usually works because your your gene set is subject to some noise so to define your gene set you have to set some sort of threshold and then you come over the list but obviously that's not probably that's not complete and the way you set your threshold may or may not be very robust so if you try to predict additional genes it just tries to pull in other genes that are highly linked to your gene so that it gives it more context we try to figure out what the function of all those genes are. What would you say here in this example is this gene listed? So I gave it 17 genes and it gave me I think my default is 20 additional genes so it should be so you will go through that in your exercise but there is oh how can you tell the difference? So some of them are straight and some of them are solid sorry this is kind of yes Each specificity is factor into that because sometimes inside expression so anyway so we're going to come back to that so that's the great question so the whole concept is that you can have context independent networks which don't care about tissue specificity so genes are connected if they were correlated somehow somewhere it doesn't tell anything about the tissue that you might be interested in and then there is context dependent network that tries to understand the context that you're asking about that and that could have those networks could have some specificity that might correspond to tissue specificity and maybe if I want to explain that concept then it becomes a little bit more clear but it's not tissue specificity it's not something that you directly build into it it's by design hopefully the algorithm is actually finding the right tissue context And I have another question so how cancer is one of the most research topics in the citations how that is way so that's another really good point so there's some genes that are much more studied than others and that could create some biases so you might end up finding all the genes that are always well studied because they link to everything I would say that's a problem that's not conceptually that's not such a big issue recommender system because they're doing an unbiased screen you remember that most of the data that goes into it is for example from a gene expression study for multiple gene expression studies because those are the most abundant type of data and in gene expression studies you're measuring everything all at once so there's no literature bias that's creeped into the original expression data literature bias comes into once you take into literature networks and pathway database and things like that but if you're just talking about networks that are built from a genome wide data set there's no bias that is in that data before aggregating it with other literature knowledge that makes sense so there's also I would say that there's a lot of research in this area of literature bias and multifunctionality and what you see I guess depending on how you what you care about if you care about your your genome interest and what are the top genes connected to that typically you're not going to be hurt by what they call multifunctionality of literature bias but once you go into evaluating your method what's an evaluation metric called receiver under operating or rock curves where basically you try to recall as many genes as possible then literature bias is important in terms of figuring in terms of evaluating or comparing different methods to each other but I would say for your specific gene list if you're using genome wide data that's really typically not a big issue you can combine so in ideal case so let's say that you have a list of genes that might be representing multiple processes right so some of them are positively correlated with each other and those are upregulated some of them are negatively correlated in ideal case what should happen is that you would have like the figure that I showed in the beginning we would have two sub-networks so all your positively correlated genes are in one network and your negatively correlated genes are in another and then you have different functions that are enriched in these two sub-networks that in theory that would be ideal but data is never perfect so probably what you see is kind of two networks they're a little bit separated but there are also some connections weaker connections between them another thing that could happen is that your list of genes have nothing to do with each other and what should happen in ideal case that you will get like a blank network nothing is connected to anything in that case and those are always really good nulls to test to make sure that actually you get the expected eamier but you shouldn't see anything you don't and it's not like you're always seeing interactions between genes so is there more questions? yes so you do a differential gene expression test and you get a list that is different from one condition to another and you put it in this and it adds genes but then these genes that it's adding into your network are not differentially expressed because obviously it didn't pass through a threshold so then what do you infer what is this adding to my knowledge then is trying to use other genes to understand what your genes are but it's not telling me what networks or what pathways are going to be different between the two conditions it is because in your assay in your analysis maybe those genes didn't come up but if they're very tightly correlated with your genes it could be that experiments are not perfect if you repeated experiments again you might get hopefully you get similar set of genes but you're not going to get the exactly same results so in a way it could be that those genes could have come up if you repeated your so that's one interpretation another interpretation could be that well they didn't come up but they're actually linked to an important like pathway A is coming up in your analysis but that is tightly linked to pathway B and if you want to understand the context of that pathway or figure out what experiments to do next there are pathways and other genes that are highly correlated with your genes of interest so it's a way to basically give you more context to think about what your genes are and how they interact with other genes in the cellular system alright so I kind of showed you this picture but this is just on the slide now so you have your input genes and maybe you can see it a little bit so this additional genes that came up are these solid ones and then your genes that you actually input have distressed through them and again I'll emphasize I'll probably see this later too that you can specify how many other genes you want to see if any at all so you don't have to see any genes if you don't want to if you see them if I give you more context to think about your genes that you input it so in G-mania there is multiple kind of different functionality that you can use so you see the three dotted line on top so if you click on here you'll see basically the list of all the available networks that could be used with the G-mania analysis so you have core expression networks so if you open this up you'll see there's hundreds of gene expression data sets that have gone into this database so you see these are the names of the different publications so what G-mania does is scrapes GEO and all sorts of public repositories for all different data sets that people have generated since 20 years ago and tries to put them together and that's where the power is you don't have to worry about where the data is coming from hopefully by aggregating all of these data we're cancelling out the noise and kind of just capturing interactions that are kind of consistent between all these different data so for example you can click on the name of a data set there's a link to GEO where the data is available so what you can see here is that you have all these different and these are just under the core expression category you have other categories here, co-localization genetic interaction, pathway databases so as a user you have a lot of freedom which sometimes is nice but sometimes is scary there's too many options but what you can do is you can click which data you actually want to use so if you don't want to use any pathway databases you don't have to you can unclick that you can use or don't use core expression data set but there's default parameters that I'll tell you about and even within core expression data you can select very specific data that you want to use and only use those and it's always good to try multiple approaches to make sure your results are not just dependent on a very particular setting. The other thing that you can do is you can upload your own network and you can see the link here if you don't want to use any of these data that's out there you just want to visualize your own data in terms of network you can upload that and look at it here I lost it I want to tell you alright so and then the last thing that I'll mention is that there's at the bottom there's customized kind of parameters that you can set so this link on top it says max result in gene so that's the parameter that I've been talking about so right now it's set to 20 that means that 20 additional genes will be reported but you have the freedom to say so you can basically set this to zero and if you do that no additional genes will be reported and so now we get to this network context dependent networks which kind of partially answers a tissue specificity question as well so let me go back to this slide so to recap so context independent network is a network that basically collected all of these data you take a weighted sum so the edges in this final combined network is the weighted or the sum of all the edges that you saw between pairs of genes and that's a context independent network I mentioned string for example context independent network so these network have no information about tissue specificity or the context of your query now you can imagine using a context dependent network and what it would look like is that instead of just simply taking the sum of all these networks you take a weighted sum so some of the networks could be given a really high weight and some of them could be given a zero weight so you're not using information on those networks at all so how do we set so and the way you would do this mathematically is pretty simple you multiply you have this scalars wi for network i that you multiply all those edges in network i with the weight wi and then you add up across all the different networks so if your network T had a weight of zero none of the edges from that network would contribute to your final combined network edges and that's only summed up as the weight of the edge exactly yeah so maybe let me write this on the board final network on color w the edge between node i and j is going to be wi j so wi j and I'm going to start here because this is your combined network edge between i and j that's just going to be a sum of and I'm going to change notation a little bit so the weight so so for a network 1 we're going to assign it a weight of alpha 1 and we have wi j from network 1 and then if you have d networks we have d of these weights w assigned to the different networks and then we actually have the edge between i and j and that network so the final network edge you guys can you guys see over there too so the final network edge is just a weighted sum of the corresponding edges from all these different networks and now you're free to figure out how to weight a different networks and that's where the context dependent networks comes in so algorithmically how would you do this and to do this there are two overarching goals one is to take account the relevance of the network and one is to reduce redundancy so let me first talk about redundancy so when I pulled on that co-expression data in g-mania you saw there was hundreds of co-expression networks that were in there so if you some networks are just a lot more abundant some network types are a lot more abundant than others and if you don't consider that your basically results are going to be swamped by those data types that are very very abundant so for example we have hundreds of gene expression data but probably two protein protein interaction networks or something like that and if you don't take into account if you just simply add up all the edges then obviously we're going to be over-representing information that comes from gene expression data so we want to take account of redundancy and that's one component the other is that we want to find networks that are relevant to a particular query and here where it comes very useful if you have a list of genes as opposed to one gene because if you have a list of gene you can assess how relevant each of the networks are to your list of genes and let me tell you right now how this happens so let's say that you have four types of networks in your database you have a network based based on co-compact data network based on shared phenotype data network based on genetic interaction data and then network based on gene expression data which we call co-expression networks so what happens here is you basically you have this network important way for each of those networks and then you add up the edges to come up with this composite network and the edges of this composite network is a weighted sum of the corresponding edges that you saw throughout your database so how do you do this I want to talk about how do you actually assign these network weights and I'll give you the high-level intuition I'm hoping that you ask me questions to kind of probe a little bit more but the intuition is that let's say that you have a gene list I'm going to construct what I call an ideal network from your gene list so in the ideal network the genes on your list are all connected to each other and they're not connected to any other gene they have a lot of connections within themselves then you want to figure out how to construct a linear combination of these underlying network so that you're reproducing that ideal network and if you try to do that then you will try to find networks where your gene sets are much more connected to each other than genes that are not in your sets so mathematically there are ways you can formulate that and actually that's pretty simple to basically figure out how to construct a weighted combination of these underlying network so that the connections between the list of import genes or your query genes maximize so maximize the connection under your gene list is that people have questions about that so what's why that's nice because now you can, based on your query you might completely ignore some data and use others let's say for example your list of genes of interest or your query genes are genes from that are involved in brain processes and most of your gene expression data are coming from let's say blood data where these genes are not even expressed so hopefully this automatic weighing approach is going to figure out that your genes are never present in core expression network that are built based on blood and it's just going to ignore those and give more weights to core expression data that's based on data from brain although this is not enforced you don't say that fine brain tissue for me but by by virtue of providing your gene list that have information inside them about these genes that are hopefully core expressed in the brain you end up the algorithm end up selecting data sets that are relevant so so in G-mania there are so if you go to the 3 dots are it'll give you an option of network weighing so there's different approaches you can take to combine all these different networks that exist in the G-mania database so you can do what's called automatically selected weighing method and that's probably the default that you would do if you select automatically weighing method what it does is that if your query if query is a list of genes if your list of genes has more than six genes on it it'll try to automatically figure out which network should be used so it does a context dependent network that I mentioned as long as you have more than six genes but if you have less than six genes there's just not enough information to do that and if that scenario holds then it'll try to it'll use one of the default weighing method and the default weighing method is biological process weighing based and I'll mention what that is in a second but I also want to mention that there's this equal weighing option that if you actually don't want to use automatic weighing you can just say weigh all the networks equally or all networks by data type equally so the difference is that some of the data types have many more networks than others as I mentioned Quark's gene expression data have a lot more Quark's expression network so if you actually want to treat all the data equally you probably should do equal by data type so that's if you don't want to use this automatic weighing option but if you use automatic weighting what it does is for this without biological process base if you have less than six queries it's G-mania has been run on hundreds of gene ontology annotations to try to use information and these networks to predict gene function and when you do that for each function that you try to predict you assign a weight to a network so you can say on average networks that are very relevant for predicting biological processes should be given high weights for generic queries that networks that are not relevant for predicting biological processes for example when you try to predict gene ontology annotations so based on experiments that have been run before networks are given a weight that determines how relevant they've been in previous gene function prediction tests but that's all if you have less than six genes if you have more than six genes you can try to use the automatic weighing scheme and a nice side product of that is that if you use that to get this you also figure out which networks are actually used to make your context dependent network and maybe that becomes of interest because you can go out and look to see first of all you can look to see if the approach actually used a network that you thought would be relevant for example the example that I brought up about brain data and gene expression and brain data so hopefully if you had a list of brain genes G-mania automatically identified a different gene expression data set and use some of those and that's a way to confirm that it's working properly hopefully and the other thing is that you might be surprised it might select some data that actually you might be interested in or want to look into because that data seems to provide a lot of information about your gene set so I already explained this okay and now the last concept that I want to talk about is this direct interactions versus label propagation and this is actually a pretty interesting topic but I think it's not very well understood so I'll try to explain it in a clear way so what we talked about up to now is you have your networks and then you have a list of genes that you are interested in so this is your query gene and here I have them colored as red so you have your network, you have your query genes and there are two things now you can do you can look at the direct neighbors of these query genes so here what you are doing so you are taking your query genes you are directly only looking at other genes that are derived with the direct connections connected to them and you say something that I think if these were all like involved in some sort of function I think these two but some a little bit lower confidence should be involved in that function but you can but what do you mean and if few sophisticated methods do is they use what's called label propagation so label propagation is a little bit more involved than just looking at direct interaction what it does is that it basically considers a whole graph mobile structure to use not just direct interaction but indirect interaction to infer genes that should be relevant so an example here is that if you were to use label propagation instead of just cutting off your genes that you see to those that are directly connected you see a little bit more farther away genes too because in time what happens is that a lot of times you get you get some very dense area of the network here and then let's say that these were your query genes there's a dense area of the network here those genes have indirect connection to your query genes but they might be of interest to you because they're tightly connected with many indirect neighbors of your query genes so label propagation does this by using what's called a label propagation algorithm to score all the genes in the network based on their direct and indirect connection to your and it actually has a close form solution meaning that it's you can figure out the score of every gene in this network by just one equation so there's not a lot of parameters to play around with it's based on Kraft theory that tells you depending on how far given notice what a score should be questions about that so Gimania what it's doing is using label propagation is not using direct neighborhood and that's why you can get an expanded set of genes some of which are directly connected some of which are indirectly connected is this what the problem is yes concepts I think it uses label propagation I I don't remember but I think it's a very similar concept you try to basically stop the propagation you don't have to because it's it's a global algorithm that so there's a lot of ways of thinking about it but there's actually a graph theoretic approach that determines how much propagation you should do so what this algorithm converges meaning that as a convex algorithm meaning that there's just one solution to it and if you had to translate that to an iterative process what would happen is that for a given node you take a score of its neighbor you average that and that's the score of the current node you repeat that many many times until the no score changes in the network anymore and if you do that you always end up at the same result no matter yeah if you started with the same gene set so starting from a gene set you always have a global solution like it's the optimal for this propagation so you said it's convex do you have an objective function that says that how the score of two nodes that are close by should be similar according to their edge weights and also says that the initial score of a labeled node should be not too different from the label that you had for it I can try the equation for you so yeah so this is summarizing what I mentioned before in direct interaction string them links so you're just looking at direct interaction with your query genes and an example of an algorithm that says this is my AP the label propagation you're looking further than just your direct links and in a lot of situations this could actually be very beneficial because it could be that there's some direct connections that are very sparsely connected to anything else and you kind of don't get a good idea of what's going on in the overall network if you don't look at indirect connections as well so here's actually this is a slide that I made for my so label propagation what I'm going to show here is the size of the node depends on the score that it received after you ran this label propagation algorithm so all the genes that were in this sub component received a really high node plus some of the genes here because they were connected to this gene in the middle versus if you were using direct neighbors you would see something more limited and actually some of those here would not show up because not all the genes are connected to each other so in summary I don't know how much time we have so in summary something like g-mania has three parts there's an automated updated database that kind of contains hopefully all sorts of data that people have measured and are available in all sorts of public databases there's a query algorithm to find networks that are relevant to your gene and then there is a client network browser that shows you that displays the networks that you see so I would say the heavy lifting is the collection of all those data sets that exist out there and then using these algorithms in a fast way to figuring out which networks are relevant to your query because all of this happens in a matter of seconds even though these are very complicated computation and that's the power of it if you were to sit there for 10 minutes to kind of look at the network probably you wouldn't end up using it and so one of the emphasis is to try to do all of this in a way that you can do it in 10 seconds or less so I mentioned some of this before so there's already a lot of curated databases that are already part of Geomania for example Interpro, Interlog Biogrid Geo, data sets that are coming from all those sources are part of Geomania and you can look at the data that's there and I showed you a little bit how to figure out exactly what's there in terms of Geomania identifiers you can use all kind of standard Geomania identifiers and you can mix them automatically try to put them all in one identifier the go to is the gene symbol but they're all internally if you provide different types of identifiers they're all internally mapped to Ensembl and this Ensembl the gene identifying based on Ensembl is periodically updated so it might not be it's not mirror so it's not up to date with Ensembl but every few months I think it's updated in the database so it shouldn't be too outdated so what I mentioned to you was the Geomania web browser but there's also a side escape plugin that you can use and the side escape plugin is useful if you want to basically create similar functionality for other organisms you want to kind of create your own version of Geomania that you want to use locally and also if you want to use larger gene lists so in the browser probably if your gene lists are larger than a hundred you start to you will start to get into scaling issues and it won't be as fast anymore you might see it crash if your gene lists are very large so if you have large gene lists that you want to investigate better use the side escape plugin and then there's this our package, sorry it was a command line tool that was also published called Query Runner and this is if you want to use Geomania in a more industrial way what you can do with this is a few interesting things one is let's say that you have a data set of your own and you want to compare this data against all the other data that is available in Geomania how would you compare it you can compare it in terms of how much information about co-functionality between genes does your data have and how much does that compare to the other databases out there do you find that maybe your data set is uniquely contributing to the information about gene function so you can do this these kind of things with Query Runner which is a command line tool and so I talked about Geomania a lot I'll also mention String and I think some of you have already seen or used String String is a very useful kind of network visualization database and gene function tool as well the main difference with Geomania is that it's context independent so it basically doesn't consider your query genes to construct a combined network it already has a combined network that's constructed in a static way but otherwise it has a lot of similar functionality you can input your gene list it will give you a network it will tell which kind of data are supporting those connections so it has a legend that you can see which databases those you can see so these are the data that are supporting the edges that you see and it will give you a lot of information about your genes and it also does functional enrichment analysis so you also see what functions are enriched in your gene set as a side product of visualizing your networks so here are a few kind of comparisons String has been in existence for a long time now since 2003 it has large organism coverage the nodes are proteins so they don't include so in Geomania you could have non-coding genes as well but to here the nodes are proteins and also it uses direct interactions and not label propagation so you only see things that are directly linked to your genes of interest in contrast to Geomania I mentioned some of this it right now supports 9 organisms and you can add 4 for yourself if you use the side escape plugin it's gene-focused and it has thousands of network I would say probably the database of network is a lot more comprehensive in terms of profiling data that exists out there especially in Geo it does enrichment analysis and it uses label propagation which not only tells you about direct neighbors but also indirect neighbors that might be relevant and those are all that I had so I'm hoping that you were able to take away some useful information about networks their visualization and integration of your data sources that exist out there and I'm happy to take any questions