 Boom, boom, boom. There we go. So you showed that some people didn't get the lab. Yeah, I needed it. It's good thing. OK, let's start. I'm quite, everybody knows me, right? Yeah. OK. Can we get back? I'm back. Back in the house. Yeah, here I am. I dressed in a very noisy shirt so you would look at me. Yeah, so today I'm talking about gene function prediction. Oh, I don't have that little tool I had before, so walk back and forth. OK, good. Now I have a pointer, and we're good. OK, so let's see what we're going to learn about today. All right, so as I said before, I mean, what I'm trying to relate to you in these modules is the concepts. And in the lab, we're going to use a specific tool for doing gene function prediction. And it's one that I helped develop. And I think it's a great tool, obviously. And there are other tools for doing this type of thing. And those are linked to in the wiki. And there's some that I might not have updated the links to have, but all these tools have the same concepts incorporated in them. And so what's important in this lecture is for me to communicate these concepts to you so that they can generalize from the tool that we're going to work on during the lab to other tools that you might want to use as well. And when you read other people's papers, then you understand what the concepts are. OK, so the concept set that we're going to talk about is a functional interaction network. And then Rob has already introduced this term earlier today. And then we talked a little bit, I think, a little bit about it yesterday as well. So just to try to make that, I've just been trying to make that concept clear to you. And then this other concept of GILPI association, which we've also talked about. And the concept that I'm going to introduce, which is a gene recommender system. Now, we're also, this is a little bit complicated here. This concept of context-specific network weighting schemes. And so what Rob introduced to you this morning was something called the FI network, the Functional Interaction Network. I think that's what it stands for. And what that network was produced, how that network was produced, was combining together networks from a whole bunch of different sources. Now, how you combine networks together sometimes should depend upon the question that you want to ask of the final network. So if you have the way you combine networks, if you have it depend upon the question that you're asking the network to answer, then I'm going to call that context-specific. The other thing is I want to understand the difference between two different ways of predicting gene function or implementing GILPI association, that's direct interaction and latent propagation. And there are two distinct ways of doing this that are related to one another. And then I want you to be able to use Gene Recommender System to answer two types of questions about gene function. The first one is what does my gene do? You have a single question, you have a single gene, you want to find out more about that gene. And the other question is give me more genes like this. I guess that's a demand rather than a question. But that's a different way of interacting with the Gene Recommender System, right? And then finally, I want you to be able to select the appropriate network where you can answer your questions about gene function. All right. And then the outline is basically going to follow this, the first part is going to be communicating these concepts and then we're going to look at the tool, Dominia, and then we're also going to do very briefly, I'm going to introduce this tool, String, which is the major other tool for doing the types of things that I'm going to be explaining today. But there are a whole bunch of other minor tools that cover different sets of organisms. Okay. So you want to use genome-wide data in the lab. Great. And millions or actually billions of dollars have been spent by various governments to generate these, like, network data about how genes are linked together and also expression data, for example, how genes are expressed under a variety of different conditions and a variety of different organisms. And there's been tons of annotation work that's gone into reading the literature to build these pathway databases that give these pathway networks. So when you're asking questions about your gene list or individual genes, in some ways you want to take all that information into account when you're trying to find out more. Right? So, but, you know, until recently, you kind of had to be a computer scientist to bring all this stuff together, right? Because all these networks are stored in different places. The meaning of these networks isn't entirely clear. And so a lot of this work that we're going to be talking to you about and developing the gene-recommender systems is a way of providing a simple interface to these types of network data in order to make it useful to you in your work without you having to understand necessarily the specific details of all these networks were generated and what all the interactions mean and how to combine them together. And then Robin this morning talked about one way of doing that and that's in terms of doing this clustering within the network and doing sub-network analysis. Okay. So what is a functional interaction network? So, you should never use, this is like a whole figure, but you can see this kind of red and green here, right? Right? And this is a figure taken from a really early paper just soon right after microarrays were first described and this is a paper that used microarray, I think it was the East microarray, yeah, to measure gene expression under a variety of different conditions where the cells have been perturbed in various different ways either through a knockout mutant or through changes in I think the media that they were growing in. And one of the first things that was discovered is that if you just take the genes and then you arrange them from top to bottom in such a way that genes have like similar patterns of an expression right beside each other and then you put down a label of the function of the genes as you look down the list you see that there's these long areas where the genes have very similar expression patterns and they all have the same function, right? This led to this idea of what's called a functional interaction network where you have a way of building a network that where the nodes in this case correspond to genes and the links between genes correspond to in this case the degree of correlation of the microarray expression profile but those links and the strength of those links can be interpreted as evidence that the genes that are linked by that edge share a function. Yeah. What? Well that's what we're going to be talking about. But so in this case we're saying there's some aspect of their function that's shared, right? And different types of networks and different types of functional interaction networks might better capture different types of function. Yeah. Well in this case in microarray expression we would share the same regulation but it might also mean that it shares the same sort of biological function, right? So if this instead of being yeast that it could perturb using different deletions or different chemicals that you add to their media and you looked at gene expression across a set of human tissues all the genes that are uniquely expressed in the eye they're probably playing some role in vision. But we'll get back to this. We have questions and we can have this argument later on in the talk because I'm going to ask some more questions about what gene function actually means. But these are great questions and I love this discussion. Okay. So there's a lot of varieties of so-called functional interaction network. Admittedly this idea is very vague and it is a vague idea, right? And that's why we have to try to make it, that's why we have to come up with these gene recommender systems and provide a different way of interacting with these networks. So how do you distinguish between directly measured interactions? So these are ways of like measuring whether or not two genes have specifically interacted with one another. What do I mean by that? So protein-protein interaction networks are just protein interaction networks. They're trying to measure whether or not the protein products of two genes are physically linked together, right? And we can argue about the exact interpretation of protein-protein interaction networks and it's going to depend upon how that those protein links were generated and maybe Francis is going to start that argument. They're close enough to be... So there's a lot of interpretation problems with these data, right? Okay, and then another way that people do directly measured interactions are so-called genetic interaction networks. So there, two genes are connected by an edge if there's some sort of epistetic interaction between deletion mutants of these two genes. They don't have to be touching. They don't have to be touching. But what I'm trying to distinguish here is these are where you're directly doing measurements on pairs of genes. Now, what I talked about in the previous slide was inferring interactions from a single data source. So there, we're not directly measuring pairs of genes. What we're measuring is we're measuring gene expression across a whole bunch of different conditions and we're inferring interactions by, in this case, computing the correlation between the pairs of genes, right? And so that's like, those are inferred interactions. And what those interactions mean, I don't really know, right? Co-regulation interactions, probably. But they can also be interpreted as shared function under some conditions. What Robin talked to you about this morning was inferring interactions from multiple data sources, right? And there's a whole bunch of sort of what I couldn't even call context-independent ways of doing this. And basically the idea is that you take these networks. These are directly measured interactions. These are inferred interactions from a single data source. And what I haven't included here is networks that you get from like pathway databases. For example, you know, I don't know, you're looking at the Krebs cycle and you're connecting everything that shares, serves a metabolite, right? And then you're merging all those networks into one network, which was called the functional interaction network or the FI network this morning, right? And there, you know, if you have more of these networks that say these two genes are linked in some somewhat obscure way, it provides greater confidence to you that these links share some sort of aspect of their function. And that's a context-independent functional interaction network. So I'm also going to talk today about ways of driving context-independent interaction networks. Where you're asking the question, say I have a set of genes, I want to find more genes like these. Well, how would you find more genes like these using networks? Well, first of all, you'd find out whether which networks best reproduce this set of genes. And then you use those types of networks to find more genes like those, okay? And I'll make that idea a little bit clearer as we move through my presentation. So as I said, previously, there's two types of functional predictions. The first one is what does my gene do? I have a single gene and I want to know more about it. And so if I look at other genes that it's functionally linked to, I might be able to find out more about what that gene does based on the set of genes that it's functionally linked to. And the other question is, give me more genes like this. So find me more genes in the win-saving path that I've done now before. Find me more kinases. That should be pretty easy, actually. Find me more members of a protein complex. That might be easier, that might be hard. But these are the types of questions that you might have. Okay? And there's one other further question that's very similar to this question. I have a list of genes. Tell me how these genes are linked together. And Robin talked a little bit about that this morning as well. Okay, so what does my gene do? So here's the protocol. You have all your network data, maybe you have profile data like gene expression patterns. And your query list consists of a single gene. You put these two types of data together in what I'm going to call the gene recommender system. So this is like when you buy a book on Amazon. Amazon tells you what are the books you might like, right? This is the same thing, except it tells you what are genes you might like. And then the output here is your gene, plus all the other genes that it's linked to. And it looks like a hair ball mess here. But what this is telling you is how those genes are linked together. And each one of these different colors here corresponds to a different type of evidence, or a different type of information. So there's two types of information say that link these two genes down here together. I guess the paint type and the green type. And I'll be more clear about what that is in a second. And using guilt by association, you can try to confer the function of this gene based on the functions of the things that it's linked to. Yeah. Do you use the CDC-48 as an example? Yeah. Oh, is that 42? 42. Whoa, that's really weird. Okay. You're the first person to notice that. Thank you very much for that. I don't know what happened there. We probably just copied that down wrong. I assume it was not. That's supposed to be CDC-48. Yeah. That's just, yeah. Thanks for catching that. Yeah. And so now what I've done here is I've just colored all the genes that are involved in this function. Right? You can see that CDC-42 is linked to a whole bunch of genes that are involved in small GTP, a mediated single signal transduction. Right? So now that's already been annotated with this particular function, but you might annotate it with that function based on the fact that it's neighbors more than half of its neighbors have that function. And an easy way to do that is just to do a gene function enrichment analysis of the neighboring genes to find out what functions are enriched in that set. So that's one way of interacting with the gene recommender system. Okay. And so to do this type of analysis, you use all these types of data. Now, you can't use what I'm calling context-dependent interactions because you don't know what questions being asked when someone gives you a single gene. Right? So you could be asking a bunch of different questions. Right? So if you say what does P53 do, you should say what does TP53 do because that's a gene symbol for P53. So that question could be about a lot of different aspects of its function. Right? And then the biological function that a gene participates in doesn't necessarily match, particularly its molecular function. Right? Or you could be asking what the role of P53 is in disease and everybody knows what that is. Right? So in terms of trying to figure out what aspect of function and then answer those types of questions, some types of networks might be better for some types of gene function than others. So how would you... And so that's what these context-dependent networks are about. Now, how would you figure out what questions someone's asking? I mean, they could put it in. They could type in what aspect of function but there's another way of doing that. And that's defining the question by providing some context. So the first time I gave this slide, I was invited to give a talk at this in Memphis, Tennessee. Right? And so Memphis has two meanings, right? Depending upon whether or not you put in context of Knoxville and Nashville or Alexandria and Hyrule. Right? So if I say, tell me what Memphis... Tell me more things like Memphis and these two other things. These are two other towns that are in Tennessee. Tell me more things about Memphis and these two other things. These are other towns or cities that are in Egypt. Okay? So the context can provide you... If you give a few examples of something, that can help you to provide... As you say what your question is. And that's to give me more genes like these. So here you have some query lists. You have the same type of data. You put this information to a gene recommender system. And then what the gene recommender system will be doing is two different things. One is going to find genes that are highly linked to this set of genes. Right? Where presumably you hope that new genes that it adds are linked to more than one of these genes. And the other thing that it can do if you give it a list, a query list like this, it can find what networks are best able to reproduce this query list. Right? What networks... And what networks are these genes all well-linked together? And if you find the networks where these are all well-linked together, those are probably the networks that you should use to find more genes like those. All right. So this is the system that I'm going to be showing you today. It's called GME. It does exactly these things. There's other systems that try to do something similar and do so. And this is just a screenshot of the website, which we're going to be going through, going over later in the talk. And so here, what I'm showing here is the different types of... I put in this query list. And the query... The genes that are in the query list are represented here by big nodes. And then the small round nodes represent other genes that are inferred to be like that, the set of genes in the query list. The colors of nodes indicate the annotated function of genes in the query list, and in some cases genes not in the query list. And then there's these little triangles here. And these triangles correspond to features that are shared by groups of genes. And so in these cases, the triangles correspond to different molecular functions that these genes participate in. So NO synthesis. And the links are colored by the types of networks that link the genes together. And here's the functions down here. And this information here is telling you how much each one of these types of networks contributes to the overall network diagram that's being shown here. So in terms of finding more genes like this, these interprotein domains, which are protein domains, they're the ones that contribute the most to finding more genes in this list, and they're the ones that contribute most to... sorry, finding genes in this list originally, and they're the ones that contribute most to finding more genes in this list. Oh, that was a complicated slide. Okay, don't worry. We're going to get back to this. Okay, so how does this work? All right, so maybe you guys want to call up the website now. Do you want to do just type gmenia into it? Our website is designed to take lots of simultaneous users, but things might be slow. So we'll see. Might as well just switch it to serivici, and we'll see how it works. Okay, so you don't have to type that in. There's this list of genes. You'll get it by just clicking on that example thing. But if you click on that example when you're in human, you're going to get a list of human genes, which is fine, but then don't change the yeast. You can change the yeast, or you can just go with human, which is the default. You just have to press example, and then if you click on what's called show advanced options, this network list is going to show up. Now please don't click on any of these networks. You can do it if you want, but it's going to be a bit slower. Not that much slower, but it'll be a little bit slower. But I just want you to open up this advanced options panel and see that there's different types of networks that are being represented, and we by default selected a subset of them for you. Those of you who are with me can press go. Even when people who aren't with me can press go. Okay, but if you have no genes in there, it's going to complain. This is more about the network stuff. Let's skip this slide. There's a lot about the networks. Okay, do people have an answer? Okay, so what you're seeing looks a lot different than this, but there's going to be a little phrase near the top right under the gene box that says show advanced options. If you click on that phrase, you get this thing again. Maybe. You're free to do things now. You're supposed to have fun with this. The website, nothing's going to break. You could fool around as much as you want with this website, and it's actually designed to be relatively intuitive. I mean, nothing with a lot of features is ever intuitive, but it's signed it. Kind of be fun to play around with. That's what I want. I want it to be fun when I use it. But if you click on the show advanced options, you'll get this thing down. And then if you click on one of the network types, and what you see might differ from this slightly. So here, this tells you, here in this case, we have 95 PoE networks, of which 20 are selected. But if you click on any of these network types, it'll pull up a long list of networks. Here, there's only one where you can click or click on or click off individual networks. And then when you click on this arrow, it's going to open everything up. It's kind of intuitive. And then you can figure out where that network comes from, and there's some information on it. And then it's just really fun to play around with this thing. Okay. And then when you're happy, you can just press go again, and it's going to give you another search using the networks that you selected. All right. So these are the complex independent networks, and I've actually already explained what those things are. So if you have a co-expression network, you have a genetic interaction network, you have a protein interaction network, and you don't want to worry about context, one thing you can do is each one of these edges, it's associated with some weight. Right? And so when we made this co-expression network, the weight here was like the co-expression, the Pearson correlation coefficient, might be some normalized version of it. Right? For the genetic interaction network, it might just be one or zero what that weight is. And for the protein interaction network, it might also be one or zero, or it might be the count of the number of times that interaction has appeared. But regardless, you get a weight for every network, and the way in which you generate in the functional interaction network that Robin talked about or any of these other context independent networks is you just multiply that weight by some number. That's how much you trust the network, and then you sum the weight times that scale factor, sorry, for every pair of genes, you multiply each, the weight between that pair of genes by some scale factor, which tells you how much you trust the network, and then you sum that weight across all the networks. So I'm explaining that. Okay. We have a pair of genes here. Right? A pair of genes, CDC23 and APC11, they have a weight in every single one of these network types. Right? If they're not connected, that weight is zero. If they are connected, that weight could be one, it could be some real value, it could be anything, really. So the way in which I'm going to figure out what the weight between CDC23 and APC11 is in my context independent network, or the FI network, is I'm going to take the weight, and then I'm going to multiply it by some number that's assigned to each network type. And so what you're seeing in G-mania, if you have the network thing, is you see a percentage, that could be the number that you're going to multiply by. The way in which the functional interaction network, that Robin talked about this morning, is generated using a procedure called 90Base. And what 90Base says is basically it scores networks by how well it's able to reproduce known functional links between genes. So different types of networks provide better or worse data. So each network has a weight. Each pair of genes has its own edge weight within the network. To get the edge weight for a pair of genes in the combined network, you take their edge weights in each of the individual networks, and then multiply those edge weights by the network weight. You sum those things up, and that gives you the weight here. All right. What's different with the context-dependent networks is you don't use a single weight. You reassign that weight based on your query list. So your query list tells you how good each one of these networks is at reproducing the query list, meaning that how well connected that query list is in that network. And so here are the weights that are specific to the query list that contains these six genes. Okay? And that's the only difference between a context-independent and a context-dependent network, is how the weights are assigned. And the context-independent network weights. And in a context-independent network, you use the same network weights regardless of the question that you're asking. In this case, you use different network weights depending upon the question that you ask, and those weights are inferred based on the list that you put into the tool. Okay? It's generated by G-mania. It's generated by G-mania based on your query list. Okay. So it's just an option. Yeah. So you'll always get weights. The only difference is how those weights are determined. And there's a variety of different ways within G-mania to determine how those weights are determined. Some of them depend on the query list and some of them don't depend on it. So if the ones that depend on the query list, how do you determine those weights? Well, if you give a network a non-zero weight, that network should be relevant to predicting the function of interest, right? It should link together the genes in your list. You also want to give... You want downweight networks that are redundant. So a lot of times microwave expression profiling experiments measure basically the same thing. Co-regulation is the... is what people accept, right? And a lot of microwave expression data sets, they mostly measure in terms of co-regulation, is differences in growth, right? So one of the major responses that at least yeast cells have when you perturb them in some way, they grow slower or they grow more quickly. Right? When something grows slower, it grows more quickly. That means it's cycling through cell cycle faster or slower and a lot of the gene expression changes that you see come from the fact that the population is... has... is in different... different proportion of populations or in different cell cycle stages. This is true of yeast cells. It was also true when we did a similar analysis for human cells. You could certainly detect a very strong signature of growth when the gene expression patterns in different... in different human tissues. So as I said, there's... there is a bunch of different network weighting schemes that you can use. And these are all accessible from the advanced options panel. By default, G-many just decides between the query-specific weighting and the query-independent weighting based on the size of the list. If you have a long list, it's going to... it's going to try to weight the networks based on your list. If you have a short list, it's going to weight the networks based on how well they've reproduced known functional interactions between genes. A network that's just, you know, randomly connects genes together and the links in the network are not tending to link together genes that actually have the same function. And we figure that out by comparing that network to networks that are constructed based on gene ontology annotations. So if two genes have a similar set of gene ontology annotations in the constructed network, they're linked together, and if that, if the experimental network looks similar to the gene ontology network, then we give it a high weight. Right? And our network weighting schemes are based on different hierarchies in gene ontology. So there's the biological process hierarchy, the molecular function hierarchy, or the cellular component hierarchy. If you don't do anything, it chooses the biological process, but you can tell it to choose molecular function and cellular component if you want. If you're looking for something different. Also, if you don't trust anything that we're doing, you can just assign all the networks equal weighting. They're equally by data type. So all the networks here are organized in different categories of network, like co-expression, protein, or action. Or you can weight it equally by network. Okay. And I've just been through these points. Okay, so we've talked about network weighting. We've talked about what a functional interaction network is. We've talked about how you can decide how to weight networks based on what you want. Now once you have a network, how do you find the genes that are associated with it? How do you find the genes that are highly linked to it? And certainly with a single gene that is a more complicated question than you might think. But when you set a gene that becomes an even more complicated question. So if I say that these are the genes in my query list, I'm trying to find more genes like these four, and this is the network that I have, one of these genes is most like these four genes. It's probably this one, right? Because it's linked to two. It's got two linked together, right? And then maybe the next one you might say would be this one. But if we're extending that list beyond two, there's a bunch of other genes that we need to consider here, right? There's these two genes here that neither of which are directly linked to these four. But somehow these seem like they would be different than these genes over here, which have no linkage at all. There's no path from these genes to these genes, right? So if you think about this as some sort of pathway diagram, you might want genes that are later in the pathway to be functionally associated with the genes that are earlier in the pathway, even if there's not a direct linkage. Right? So there's two types of algorithms that people use for finding guilty associates or doing a predicting gene function by guilty association. Okay. These are... They look like yeast genes. You can just type them in. Let's find out. Do you want me to type it in? Or you can just do it in the lab. Okay. So the principle behind guilty association is you say that genes that are linked together in some way, functionally linked, have a function interaction with other genes, tend to share function. So if you want to find more genes that have whatever function is represented by these four, and I provided you with a network that looks like this, which of these seven genes is most likely to share the function of these four? Which one is second most likely to share the function of these four? Which third most likely? What's fourth most likely? And so what we're trying to do here is we're just trying to find a way of scoring these seven genes according to how well linked they are to this group of four. Predict the gene function. You can call this trying to do guilty association. But I'm just trying to say that there's two main types of algorithms and all the algorithms fall into one of two different classes. One type of algorithm is called direct interaction. I call it direct interaction. People call it 90 days. Sometimes they just call it guilty association. Sometimes they just call it nearest neighbor. And what direct interaction does is for every one of the nodes in the network which correspond to genes and looks at its neighbors and the score depends upon what proportion of its neighbors are in the query set and the strength of the link between the gene and its neighbors. So you can see in this case these two genes score highly than the other five genes because they're the only two that directly interact with these. But what you don't get from this type of algorithm is that this gene is probably closer to these four than, say, these three genes that aren't connected in any way. And the algorithms that try to do that are called label propagation algorithms. And so if you apply a label propagation algorithm in this case, what you would get would be something like this. So this gene is the most highly linked. This is the second most highly linked. But then this is the third most highly linked because you can see there's a lot of paths from this gene to these genes in the query set. And then these ones aren't linked at all. It's a very simple idea. And it's related to this idea of the hot net, the heat diffusion that Robin talked about this morning. And basically the idea is is that you score genes by direct interaction. Then you re-score the genes if you haven't already scored and then you re-score the genes if you haven't already scored. So you propagate the information of which the positive examples are. And the way the equations work is it corresponds to something like propagating heat through lattice where you have some loss at each one of these nodes. Or you can think of like you have some water sources and you're pushing water through the network where you have some loss at each one of these nodes. So the further the node is away from here, the less water and less heat that gets to it. And how much heat or water gets there depends upon how much heat or water got to the other nodes that it's connected to. And the whole thing balances out. Yeah. So reactome what we're doing with strings doing they're all derived from multiple sources. The sources vary a little bit and the way in which the sources are combined varies a little bit. What I was trying to distinguish earlier is where you use a fixed combination so you use the same way of combining the sources regardless of the question you're asking or the gene list that you put in and how you weight the sources depends upon the gene list that you put in. Right. And so those are the only differences. But under the hood a lot of these algorithms are essentially identical to one another. Or conceptionally they're identical and the details vary a little bit. And the conceptual thing that I want you to get is that you're combining together data from multiple sources the way in which you combine together data is you just sum up the link weights between pairs of genes and you can adjust how much you believe one source by reweighting the links from that source by scaling them up or all down if you don't trust the source. And the final thing which you combine the data sources together doesn't depend on the question you're asking and sometimes it does. And so that's the context dependent network so the ones where the weights that you assign the evidence weights you assign each one of the sources depends on the question you're asking. So within a network like in a co-expression data set you would think things that are highly correlated should be more highly linked to one another. So within a network all the links have weights that tell you the relative the confidence that you have that these two genes are linked together. Okay. So the networks themselves each have different weights between pairs of genes but how much you believe a type of network or a network derived from a single study might vary from network to network. So the way in which you incorporate information about how much you believe a study is to like basically take the weights that are in the network and scale them up or scale them down. Right? So let's say one is the most strongly linked pair of genes in the network. Right? And in the co-expression network those are genes that are perfectly co-expressed with one another. And then we want to take the evidence for functional linkage from a co-expression network and we want to combine it for evidence of functional linkage between the co-expression and the co-expression network. In the co-expression and the co-expression network let's just say all the links are 1 or 0 1 means that they they were co-purified and 0 means that they didn't, for example. So now how do we combine the gene expression network with the co-expression and the co-expression network? Well we could just add the link weights together. So now we have link weights between 0 and 2. Right? So you have the perfect link pair of genes in the co-expression network as they're also physically linked to one another somehow. Some of those link weights together you get 2. Right? But maybe we trust protein-protein interaction studies more than we trust co-expression studies. So like the perfectly co-expressed pair of genes maybe in the units of link weight in the protein interaction thing it should be like 10% of those units. So we just scale all the co-expression links by 0.1. So now the strongest evidence we have in co-expression is a link of 0.1. The strongest evidence we have in protein interaction is a link of 1. We add them together and the highest not value can get 1.1. Does that make sense? Yeah. Okay. So there's differences within the network and how strong we link pairs of genes are. But then you can assign the network in overall weight that says how confidence you are in this data about gene linkage. So those weights, so all the methods that you've ever seen really that combine data from different sources use this same technique. They weight the data from different sources they scale the link weights within a network according to the weight that that data gets. And those weights are decided in a variety of ways one of the ways the weights are decided is by saying okay how well does this network reproduce what we already know about gene function? Like if a pair of genes are linked in this network do they have the same function, annotations and gene ontology? And then you can derive the network weight based on that comparison. Alright. So now once we have the network the question is how we propagate information through the network. And that's what these two types of algorithms are. One is we propagate information through the network by just looking at the other genes that a gene is linked to and not considering indirect neighbors, right? So this thing this node right here has no neighbors it's not linked to any of these red genes but it does seem closer than this node, right? Because there's a path to these genes and this one actually has a bunch of paths and this one only has the one path. It goes like this. So you can look at things that are just directly linked to other nodes in the network that you've come up with or you can try to propagate information through the network by in this case what you're doing is essentially something like counting the number of weighted paths to the red nodes and the score reflects that. Okay, but the, you know, it's actually the way to think about it is these are sources of heat these are sinks of heat that also allow some heat to go out from the nodes and they're just letting the heat propagate through the network and then finding out how much heat you get at each one of the nodes at the end, right? So here no heat gets over there because there's no link to any of the sources. Alright alright I think you guys don't care about this stuff very much that's what I'm getting so so I'm gonna skip over this stuff and I'm gonna start talking about the tool but I'll go back and explain this stuff if you do. Okay, so what are the three parts of Gene Manure? So the first thing is, and these are true of Gene Recommender Systems in general so it's a large automatically updated collection and interaction networks. There were some questions yesterday about where we get this data on the interaction networks while we collect them for you. Right, so we are kind of a network database where we compile information from a bunch of different sources. We also have a query algorithm to find genes and networks that are functioning associated to the query gene list. So we analyze the gene list if it's long enough to tell you what networks are highly linked among the that have a lot of links among the query list and we also analyze the query list to find out what genes are highly linked to it. And then we also provide this information to you in a network browser which some of you have already seen and if you fool around there's extensive link outs that take you to databases we can find out more information about the source of interactions or the genes that are being linked to. Okay, so what are our data sources? So we get co-expression information by we go to Gene Expression Omnibus we download all the studies for the organisms that we're interested in that have a sufficient number of samples that we trust the co-expression measurements. I think it's like 12 for human or maybe 20. We compute the co-expression networks from all those experiments and then we put them in our database. In terms of like genetic and physical interactions we use a source called iRefIndex and what is iRefIndex? iRefIndex is this guy Ian Donaldson what he does is he goes through and you've heard about Mint today and Intact and there's all these like organizations so what they do is they annotate physical interactions that have either appeared in high throughput studies or that are in what I call like small scales so you have like some you have like a paper which describes physical interactions between a couple pairs of genes well there's multiple independent organizations that are curating those papers all the time and every once in a while they come up with a new physical interaction network and those networks become available through a service called Psychic and then Ian when he does he takes all that information together and makes what's called the iRefIndex where all the physical interactions are identified by the data source that came from so there's a lot of information available so we don't download that he compiles physical interaction data together we download what he generates and then we split it up by data source we also predicted interactions so these are things that are called enterologs so what's an enterolog? anyone know? okay so if you find out that two proteins physically interact in mouse do you think they interact in human? so what an enterolog is interactions among orthologs right so if you have a pair of genes orthologous genes in another organism and you find out that in the one organ is let's say mouse that they interact you have two identical orthologs in human you have strong evidence that they interact in human well those are predicted interactions right those weren't directly measured in human so you have to incorporate information about whether or not the gene is duplicated how well conserved it is possibly some other side information are they expressed under the same conditions that can change as well and so we use the I2D database to identify enterologs and we put those interactions in the predicted a great way of figuring out what the function of the gene is to see what protein domains it has so if you find out that two genes have the same set of protein domains that's often very strong evidence that they have at least some part of their function is shared so we generate networks based on how many protein domains a pair of genes share and how common those protein domains are and we call them the shared protein domain networks and we use inter-pro to get information about protein domains we also get information about synthetic genetic interactions among genes from the biogrid database we have pathway information from a variety of databases including React-Bilm and we have this new type of data called attributes what is an attribute so so far I've been talking to you about networks where you have links between pair of genes but genes also have say attributes, attributes are like annotations that have been assigned to a gene something like this sort of shared protein domain so you can say like the presence of a protein domain is an annotation but there's other annotations that a gene can have for example it can be predicted to be a target of a microarray it can be predicted to interact with the drug a gene to be assigned to a specific pathway and so we say that two genes are likely to share function if they have similar types of attributes okay so that's all our network data that's all the types of things that we use for gene function prediction by default we don't include the attribute information sometimes that can be a bit circular and by default we only include 20 co-expression networks because a lot of the because there's so many co-expression studies we get a lot of network data from co-expression and when we include a lot of networks in our analysis things slow down a little bit and so we feel like a lot of the co-expression data is very redundant and because it slows things down so much we only include the top 20 most informative co-expression networks by default but you can turn them all on if you want okay and we have some organisms specific databases because some organisms are well represented by their model organism databases and some are less so but if you click around you can see what's what's available just by clicking these things and see what networks come up so the other thing that we try to do this shows me the slide that I'm on and not the slide that I'm going to it's not useful at all because I can see the slide that I'm on what? okay right and so that's what I thought so we talked earlier about this problem of like genes have multiple identifiers and so the way in which we design this this website is we want to make it as easy as possible to deal with so we're trying to solve all those problems for you so you put in a gene list we try to figure out what gene you really mean and if we can figure out what gene you really mean we just go forward and we analyze that gene now if the gene identifier that you use to describe your gene is not unique and in a lot of cases there's gene identifiers that aren't unique and let me tell you when you do like eight different organisms you find a lot of weird special cases so like in fly there's gene identifiers that vary by capitalization so s small mg is a gene s big mg is a separate gene and they're not linked in function at all right those are two totally different genes so so we get all the gene ID mappings that we can from unsolvable and unsolvable plant and then we go through and we remove identifiers that aren't unique or that don't correspond to protein coding genes at this point in time so we'll identify most of the gene identifiers you put in we'll get some of them we won't get because they're not unique or because they've been annotated in some weird way in the database so to address that problem you sometimes have to go through and like update things a little bit or you could just say well you know we got like 95% of the identifiers that's pretty good oh yeah and we have gene annotations and I'll explain about that in a second but basically we do functional enrichment analysis on your gene list and all the other genes that we pulled up as being functionally linked to them we report that information those gene annotations come from gene ontology and there's a variety of model organism databases that assign those annotations including GOA okay and these are the gene identifiers that we identify like I said we're doing all unique identifiers and certainly these are the identifiers that we're pretty good at recognizing you give us some synonyms sometimes we get them especially if they're unique in organism specific names we often get those as well we might be a little bit out of date because to get these gene identifier mappings we have to download the ensemble database the ensemble database is really big so we only do it about once every three months or so so if you have a new gene you might have to look around for the gene identifier that it corresponds to okay so right now we cover 8 organisms we have one of the 2000 networks we have a web network browser we also have a plugin that you guys have all downloaded and the exercise that we're going to do today is just based on the network browser because it's the easiest thing to interact with you saw that sometimes it can be a bit tricky to interact with cytoscape and the idea is that I want you to learn you can get through the network browser okay so with the differences with our cytoscape plugin it has all the same functionality you can use it to get access to older g-many data releases but once or twice a year we update our network database when we update our network database the analysis that you did might change right you get a different answer because you're looking at different networks we think this is a problem if you publish the paper using an analysis on an earlier database you want people to be able to reproduce that but we can't make that that type of information available to you through the web all the time so the solution that we've come up with is that we have each one of our old network releases available through the cytoscape plugin so even though you can't reproduce the analysis that you did on the network on the website if it was using an older network database you can reproduce using cytoscape and like the way in which we do the way in which we add more networks we add more networks when we improve functionality right so we're not trying to break things by giving you more network we're trying to make things better so if you get one answer from an analysis using an earlier network database better answer if you use a later network database but you know your reviewers might want to reproduce what you've done the other thing that you can do with the cytoscape plugin is a bit complicated but you can do it is you can add new organisms by yourself right and I can explain how you add new organisms but basically you have to say you have to give us a database that tells us how to map between all the different gene identifiers and then you have to provide networks to us to do the same gene mania analysis that you would do on the website but with the new organism and you can integrate our networks with other cytoscape analysis and then I'll show you some of that today and then the website uses our servers so when you put in a long query list it's a lot of computation for us to do and we're giving it to you for free so we restrict the length of the query list that you get it's not that things get so much slower necessarily on our side they do a little bit but they get really slow on your side because when you have the network browser you're downloading something from us that has a lot of interactions between genes and our network browsers is implemented using Javascript which is a little bit slow sometimes when you get a network that has like 200 genes in it it moves really slow if you try to move things around using the web network browser now I think I just saw the alpha release within the next hopefully six months we're going to have a brand new network browser that's going to be available through our website which should allow you to use networks to have a lot more genes in it but I've just seen the first version of this now and the other good thing is you'll be able to use it because it's HTML5 you can use it on an iPad on your phone as well you can't do that right now with Gene Mania you have to use your computer the status gate plugin you can download there's no restriction you can put as big a gene list as you want and if your computer can handle it your computer handles it so the one thing to tell you about the status gate plugin I think some of you when you download this plugin you've encountered this there's two parts to the plugin the first part is the code itself but there's also the network database and we're trying to compile all the network data that's readily available to download that network data that's actually two gigs it's a lot of information that's available for every one of the organisms so it will take a little while for you to download the network database once you get the status gate plugin downloaded but it's like about half an hour in general is how long it takes on a good length okay and so I don't know how advanced what kind of advanced analysis you want to do but I also have a tool called query runner and so what does query runner do um so um query runner does like gene function prediction but kind of in an offline manner so what do I mean by that so say you say okay I want to predict the function of gene I want to predict genes that should be assigned to all the go annotation categories and there's 20,000 of those right so that would be 20,000 queries that would be 20,000 queries in site escape so we have like a command line tool that allows you to do that all at once and there's a type of analysis that we like to do um where we assess the added predictive value of new data right so you have like a new genetic interaction network and you want to say wow how much does this genetic interaction network add to the total knowledge that we have about gene function like how much have we captured gene function as we know it using this interaction network right and so the way that we would do that analysis is that we would try to predict all the functions in gene ontology using the networks that are currently available see how well we do and then we'll add this new network to that background and see how much better our predictions get right and so we love this type of analysis um and we've designed a tool so we can do it but that's what this tool is called queer runner alright so string string is our main competitor we are strings main competitor I think would be the better way of saying it because they're more popular than we are um and this is the introductions of our website and the website is it looks very similar to ours you can put it in a protein name uh there's one major difference that you can see between our website and and string it says protein name because string is very focused on proteins rather than genes right and they will do some organism models attack but they also they have uh they have you can put in uh hundreds of different organisms into string right you can put in a single name or you can put in multiple names you can answer that like what does my gene do or uh find me more genes like this question you can also search by protein sequence or multiple sequences and string also combines together different types of data sources um and there's a lot of information here available about string okay and so here's the results of doing a query in string um their results look very similar to ours they actually now use our network display tool um but it's the same type of idea so the nodes correspond to proteins in this case the color of the edges corresponds to different types of data that link these proteins together and there's something really cool about string is if you have a structure for the protein you can click on the node and then see the structure right and so then here are the predicted functional partners in this case of rad51 and these are the types of data that link these things together and this score is is some estimate of how likely this gene is to share a function a part of its function with this rad51 gene okay so then here's a comparison we can go through this but you can read it by yourself I mean the main thing is is that we're focused on genes we do have thousands of networks but we don't pre-compute the weights we do it in a kind of online manner and you can actually upload your own network so if you have network data that links genes together in your organisms that's not represented in our databases you can upload it and then use it to analyze your network also we rely on different types of functional mix data so we have genetic interactions which are a bit hard to measure among proteins um we also have thematic information and chemical interactions okay so functional interaction network I think we covered that field plan association we covered gene recommender system you guys know what it is now direct interaction and legal propagation I spent a long time on that I think you guys got it be able to use gene recommender system to answer two types of questions and be able to select the appropriate network weighting scheme to answer your questions about gene function I'm going to be around for a while so if we didn't quite get through that point you can ask me about it all right questions so the direct interaction versus legal propagation okay okay so this here is a like a false color measurement score okay so it's high score to low score okay so these are the positive examples these are the query list and these are the genes you want to find more genes like this right these gray lengths indicate non-zero lengths in the network so these are edges between the genes so here I'm just in this cartoon I'm just giving all these edges the same weight but the scoring does depend upon the weight of the edges and so so there's two ways of scoring the unscored genes here right the first way of doing it is for every gene you look at its neighbors it's immediate neighbors and you see whether or not they were in the query list and your score depends upon what proportion of them were in the query list okay so here this gene has no neighbors in the query list so its score is zero this gene has no neighbors in the query list so its score is zero this gene has one up it's two neighbors in the query list so it has a non-zero score this gene has two of its three neighbors in the query list so it has a non-zero score simple example, let's say that this one gets a score of half because that's a proportion of its neighbors in the query list. And this gets a score of two-thirds because that's a proportion of its neighbors in the query list. But in general, the score is going to depend on how strong the link is. So if this link has strength one, this link has strength point five, and this link has strength point five, then this gene should get a score of, let's say, one-half, sorry, three-quarters, right? Because the total linkage to the things in the query list is 1.5 out of 1 plus 0.5. That description of that algorithm is conceptually identical to every algorithm that uses direct interaction. It just looks at the direct neighbors of a gene. The string does this as well, yeah. It's four genes, yeah. Yeah. So this is an organism that has 11 genes in its genome. Four of which we know the function for. Yeah. So that's direct interaction. Now, there's different ways to combine together, like, information about the number of genes in the query list of your direct neighbors. But essentially, it's the same idea of all these algorithms. The thing that's different about a label propagation algorithm is that genes that are not directly connected to the query list can still get non-zero scores. And the way that works, so there's actually the easiest way to describe it is you go to every one of the gene in the network in turn that hasn't been scored, and you essentially implement direct interaction for that gene to give it a score. And then you use that score as the score of the gene when you look at its neighbors. So let me be a bit more clear about what I mean. Let's say these genes in the query list, they get a score of one. Okay. And so, and, you know, genes not on the query list start off with a score of zero. Okay. So now we choose this gene. We're going to calculate the score. I said the score is two-thirds. This gene gets a score of one-half. So now when we look at this gene, its score is going to be the average of the scores of its neighbors. So what's the average of one-half and two-thirds? I have twelve, maybe? Is that right? No, seven-twelve. No, I don't know. It's something between one-half and two-thirds. Okay. And then, so then this would get a score of like a half. Wow. Okay. And then these things still get a score of zero. So now we have an initial score for these genes. Now when we go through them, we're going to update that score. And we continue updating that score by just taking the average of its neighbors until those scores stabilize, so they don't change anymore. And the way the algorithm is set up, their scores are guaranteed to stabilize at some point. And not only are they guaranteed to stabilize, using a little bit of linear algebra, you could, you could actually figure out what their stable values are. Right? So you don't have to do these updates. You just have to solve a linear, like a linear system of equations. So that's what the label propagation does. It's just, you can think, because you're iteratively updating these scores, information about labels propagates through the network. But there's no way that these three genes down here, which don't have any links to any other genes, the genes in the, in the non-zero part of the network, there's no way that you can get a score that's not zero. You can also get this feeling that the score decreases as you move away from this positive genes. Well, since I developed the label propagation algorithm, I have a strong opinion about this. So in general, you don't get to choose because the gene recommender systems use one of these things. Sorry, can you speak up? We trust our prediction. Well, well, we, we have benchmarked it extensively. And then actually, every time we come up with a new network database, we re-benchmark all the predictions that we do. And our, our ability to recover gene function does improve every time we add new network. Yeah. So we benchmark our predictions by saying, okay, what did you give us, like a subset of the nodes of the gene annotated to this co-functional category? How well could we recover the other genes that were in this? So we, so like, it's our, what, sorry? Yeah, all of our domain publications have this type of benchmarking. And so our, our initial publication came out in 2008 that describes the algorithm. And then we've had two database publications in NAR that describe the web interface in various versions of it. And then all the other tools that I told you about, the query runner tool, and the, in the satellite plug-in tool, they have their own publications. And then we've also developed new versions of our algorithm to integrate different networks together. And then there's a Bionic Max publication about that as well. And each one of those contains benchmarking on how well we're doing it. That's true. That's not intentional. I've been using these slides for about four years. So you've got a pretty good eye. All right. This is our cartoon, obviously. Nothing. Probably what happened is this edge didn't get moved to where it was supposed to be, and it got selected or something.