 So, today I'm going to talk to you about a tool that I developed with Gary Bader that we think is very useful and helpful, and the general topic of what I'm going to call gene function prediction. And that idea encompasses a lot of different ideas. And so we're going to break that down into sort of two different use cases. And I want to introduce to you some network concepts that are important in gene function prediction that Lincoln might have touched upon, but I'm going to go into with a little bit more detail. All right. These are our learning objectives today. Okay. So, I mean, really I want you to understand three basic concepts. One is what's called a functional interaction network. Have you, has that come up yet? Has Gary talked about that or Lincoln? We've talked about interaction networks. So in a functional interaction network, I'm going to return to that concept in guilt buy association. Has that concept come up yet? Okay. Great. So I'll talk about guilt buy association as a way of trying to infer gene function and then gene recommender systems. And the basic overall goal here is to answer like two questions. Some gene came up in my screen. What does this gene do? Sometimes you can just go look it up in some sort of database, but like most of the genes in the human genome, there's not much known about them or there's just a few things known about them. Maybe they're biochemical function. But can you take all the existing data that people have been collecting in genomics and functional genomics and all the interaction networks and all the expression data and can you say something a little bit more about the gene function besides like maybe the handful of papers about someone who's decided to look at that gene in a little bit more detail. And that's what we're trying, that's what we're going to be talking about today. How to use that like existing database of oodles and oodles of genomics data to try to get back at something about the gene function. And then the other type of question that we're going to try to be answering is you come up with a list of, hopefully a short list of genes that show up in your functional screen or whatever you're doing to establish a gene list. Can you say something a little bit more about that gene list? And then what I talked about before was that can you find pathways that are abundant in that gene list? But what I'm hoping to do today with the tool like gene mania is can you say like you know why possibly answer the question why that list came up? What is the sort of overall theme that connects all these genes together? And so both of those concepts fit into this like general model of gene function prediction. Okay. So I'm going to talk about something called context specific network schemes. I think I'm going to change my slides is kind of a complicated term. Yeah, but you'll see what I mean by that in a second. And then there's a lot of concepts involved in like taking a network and finding where the genes that you're interested are in the network and then looking around in the network for other nearby genes and saying something about the gene that you're interested in using that network, right? And those two ideas are like do you use just the genes that are directly connected to that gene to say something about your gene of interest or do you like get some more general property of the locality of the gene? And so that's the direct interaction versus label propagation. I'll make those ideas clearer as I go through this talk. And I want you to be able to use gene recommender systems like gene mania, though there are others which I will even talk about, to answer two types of questions about gene function. What does my gene do? Give me more genes like these. Okay. Great. All right. So here's the outline. Wow. I think I might trip. Okay. So right down, we'll get right into functional interaction networks. We're going to talk about the concepts and then we're going to be talking about some more theory and then I'll start to show you gene mania, which is a fun thing to play around with. And then we'll talk also about our competitor string and other functional interaction networks in general. Okay. Okay. So I introduced this idea to you. You know? So here you are. Your government has spent billions of dollars collecting functional genomics data. How do you make use of that in telling you something about the genes that you're interested in? Say there's one gene you're interested in. Say there's a list of genes that you're interested in. How can you make use of all that data that's out there? And the reason that that's not easy to make use of is the data's in sort of different formats. It's not entirely clear what a genetic interaction network might be or what it means if two genes are co-expressed. So how do you resolve all of these issues in one to get the best use of the data that you can have? And so the first step to making use of that data is trying to put it into some sort of consistent format. So like the different data types can sort of be consistently queried in the same way, right? And so that's where this idea of the functional interaction network comes from. And have you seen this before? So like the functional interaction network contains nodes that correspond to genes. I'm always going to look at this screen because I'm left-handed. Okay, so these nodes here, they correspond to genes, and the links between genes say something about the degree of functional interaction. It's kind of hard to explain what they mean. It's like saying, you know, how likely are these two genes to have the same function or share some aspect of their function? Where what aspect of their function you're looking at depends on the data type that you've used to generate the network in the first place, okay? It's just a very abstract notion of are these genes functionally linked in some way, right? And I think you've seen a lot of different networks. You've seen protein-protein interaction networks. So are these genes physically linked or are they likely to be in the same protein complex? Here what I'm showing is a co-expression network. So let's take some, you know, a set of different microarrays that measure yeast, in this case, yeast gene expression. This is pretty early stuff under a variety of different conditions. And let's use this heat map to show which genes are up and down regulated under which conditions reach real correspondence to the gene. You can see there's like groups of genes that have very similar patterns of expression across those conditions. When that happens, that's called co-expression, and that provides some evidence that those genes are involved in the same pathway or share some similarity of function, okay? And so you can make a functional interaction network out of an expression study by just measuring some measure of the correlation or the similarity of those profiles and then translating that similarity of profiles into the weight on the edge between those pairs of genes. And so why is that? Why would be doing that valuable? Well in this original study, which was, you know, this was like one of, you know, probably one of the first large microarray studies, all they did is they took the, which is called the expression profile, so the relative expression of the genes across all the different conditions, and they sorted the genes in the list so that genes with similar expression profiles were near to each other using hierarchical clustering. And lo and behold, when they wrote down the function of those genes, you can see, you can't read the text here because it's pixelated, you know, these were the 90s, I guess we had fewer pixels back then, that you see that they all have the same function. You mean that sort of the text looks similar. Why is that valuable? Well not every gene has some assigned function, right? So, but if you find some like unknown gene that's highly linked to all these genes that have some cell cycle function, and it's co-expressed with those cell cycle function genes, that provides you some evidence that that gene in fact shares that function. And this is what we call guilt by association, right? We don't know what the gene does, but it's associated with a lot of other genes that have a particular function. So that's evidence that that gene has function, right? That might not work in a court of law, but it works here, right? So, you know, if this association is strong enough, we have pretty compelling evidence that says that this gene is involved in cell cycle and this gene is involved in protein degradation. Yeah. Yeah. So, in this case, once you have the network, then you can use one of these network layout algorithms to locate the nodes in some sort of two-dimensional space. And the idea is with many of these network layout algorithms, especially what force directed Veronica, or which ones you got to talk about. So some of the network layouts, one of them is called force network directed, puts genes that have strong links close together on the screen. So when you look at it, you can kind of see that this unknown gene is close to here because it has strong links with the genes there. Yeah. Does that make sense? Okay. Great. Okay. So, that's the basic idea of what a functional interaction network is. You can directly measure functional interaction, for example, in the protein-protein interaction networks that we had that I talked about earlier, or you can infer functional interaction based on co-expression. And so that's all listed here. So, there's the directly measured interactions. Those are protein interaction networks. And here in Toronto, because I'm in the Donnelly Center, and Brendan Andrews and Charlie Boone do a lot of work here, we know a lot about synthetic lethal or pairwise genetic interactions between these genes. So that's another type of directly measured interaction that you can come up with. Or you can infer interactions from single data sources, like the co-expression networks that I've been talking about. You can also get networks by trying to combine together interactions from multiple data sources. Right? So if you see a pair of genes, and that pair of genes physically interacts with one another, that they're in the same complex, and they're co-expressed with one another, that might provide you stronger belief that those genes are functionally associated. Okay? So, in that case, one way you can think about that is you take all the networks that you have, and then for every pair of genes come up with a composite network, where the weight, the strength of the weight, or the edge weight between them are the... Mathematicians say edges, other people say links, so I'm going to go between link and edge because I'm kind of a mathematician. So that if you... the link weight or the edge weight would be the sum of the link and edge weights of all the networks. Okay? That's one way of combining networks together, and I'll talk about different ways of doing it. Okay. So now we have this network, or these networks. What are we going to do with these networks? So there's two things you can do, or the two things that we make available to you and other people make available to you. So the first thing, the first question is, what does my gene do? Right? So I've got some vague description, potentially, of what the gene does based on the fact that it has a conserved protein domain that I know has a particular enzymatic or biochemical function. But that doesn't tell me necessarily what biological pathway it's a part. Right? And I want to expand on that. Some genes don't even have protein domains that make any sense to anybody, or protein domains that aren't necessarily associated with a particular function. Okay. So what does my gene do? So one way of trying to answer that question is take the gene, look at all the networks that you have for that gene, find who its interactors are, which genes are functionally associated with it, and ask what their function is. Right? And then just use guilt by association to say, well, you know, I'm not quite sure what this gene does, but it's co-express with these five genes, and I know that they're involved in, you know, RNA splicing. Right? And it's linked with these six genes that are also members of the splicisome. So I'm pretty sure that this gene is involved in RNA splicing, or at least I have some indirect evidence of that. Okay. So the other type of function prediction is give me more genes like these. So what does that mean? Well, you know, in the early days, like if you're trying to set up a screen, for example, like a smaller scale screen, and you're trying to find more genes in the Wnt signaling pathway, which is something I did a few years ago, you want to come up with all the lists of all the genes that you know of in the Wnt signaling pathway, and you want to ask, well, what other genes are functionally associated with these ones? Right? And then let's just take our assay and see if, like, knocking out or over-expressing these genes has any impact on Wnt signaling. Right? And so that's a different type of question. And so the reason that's a different type of question is when I give you a list of genes, you're actually able to say more about what networks are relevant. If I just give you a single gene, you don't know necessarily what networks are relevant to answering the question, what does this gene do. But if I give you a list of genes, you can say, well, what networks are these genes linked together in? Those are probably the important networks for figuring out what links this list of genes together. Okay. I'll make that idea a bit more clear in later slides. Okay. So, how do you answer the question, what my gene does? So your input is all the networks that you're interested in and all the profile data that you're going to turn into a network. And then you have a query list, which is the gene that you care about. You plug those two things into what I'm going to call a gene recommender system. It's kind of like Amazon. You read this book. These are the other books that you might like. Right? If you like this gene, maybe you should consider these genes. And then output is the gene. And all the, in this case, the 20 genes that it's most highly linked to in the networks that you put in the input. And then, so what I've done here is this is a picture I've taken from the gene mania interface and you have a way of coloring genes based on their assign go function. And so I just colored these genes in and you can see a lot of the genes that this central gene is linked to CDC 42 also have similar function based on the fact that their nodes are colored the same way. Okay. Great. Okay. So that's one thing that a gene recommender system like gene mania or string or other ones can tell you. And the networks that you use there are all the ones that I've referred to before. And we and other people have maintained large network databases so that you just, you type your gene in and then you can choose which networks you're interested in. So one thing that we provide that is much rarer is what I'm going to call a context dependent network. So what's a context dependent network? Okay. So say you want to ask the question, what does P53 do? Well, it kind of depends what type of question you're asking. Right? Are you asking about what biological processes involved in? Are you asking about what its molecular function is, where it lives in the cell, what its regulatory targets are? You know, describing the function of a gene requires answering a lot of these questions and there's different answers depending upon which question you're asking. Okay. All right. And so as you might imagine, some of the data that gets collected would be better at answering some of these questions than other questions, right? And so when you're asking these questions, you actually want to be using the right data. And you know, unless you're an expert in all the function of genomics data, you might not know right off the top of your head what's relevant. Maybe you do. Great. Okay. So what ideally you'd like to do is you'd like the gene recommender system to figure out what question you're answering and then answer that question. And so one of the first times I talked about this concept, I was giving a talk to Memphis and Nashville. Oh, sorry. Memphis, whatever. Sorry. Memphis and Nashville are two cities. Memphis is a city and Tennessee is a state, right? Right. But what is Memphis? Well, if you want more things like Memphis, it depends what you're asking, right? Memphis is a city in Tennessee, but Memphis is also a city in Egypt, right? And so there used to be this now defunct tool on Google called Google Sets that you give it a list of things and Google will give you more things like that list. That's gone now. But so if you just put Memphis and Google doesn't know what you're talking about, but if you say Memphis and Knoxville and Nashville, well, Google's like, okay, you're talking about cities in Tennessee, I'll give you more cities in Tennessee, but if you put Memphis and Alexandra and Cairo, you get more cities that are in Egypt, right? And so by giving lists of things, you can answer the question, what am I asking, right? And so that's what, like, give me more genes like these does, right? And so, oops, right? So if you put in a list and you put in the network data, what a context-dependent gene recommender system is going to do is it's going to choose which networks are relevant to answering questions about that list, as I said before, right? And then in the output, it gives you not only the networks, but also gives you more genes like those in the list. But the networks that it looks for, it looks in to find those genes are ones that already link the genes in this list closely together, okay? And so this is an image from the G-Mania interface. And I think at this time, I usually press escape and show you G-Mania. Okay, so this is what I've done here. So if you just go, I mean, you can go to G-Mania.org. You can also wait until you do the assignment. And so what I've done here is I've just put in the list of genes. So let's start off with, I hope CDC 42 is a, oh, good, it is a human gene. Okay, so you can choose one of nine different organisms now to look, and then you just type in the names. And if you don't have a gene that you care about, you can just press EG, which gives you a list of genes that are good ones. And then you just press the thing. And what this is doing, this is answering the question, what does my gene do? So it takes this gene. It takes a pre-specified set of networks that we think are important for answering most questions about the biological function of genes. It combines them together to make a composite network by like adding together the edge weights. And then it finds the 20 most highly associated genes. So these are the ones that have the greatest links to CDC 42. And then if you click down here, it tells you all the functions that the gene ontology functions that are enriched in this gene list of 21 genes. Okay, and then we give the FDR and the coverage is, so what the coverage is is there's 111 genes in the genome that have this function, that's seven of which are represented here. And then just by clicking on these things, you can color the map. So CDC 42 is involved in the regulation of cell projection assembly and it's involved in the positive regulation of cell projection organization. Yes, those functions are very similar, but you've already seen, you know about this problem, and that's why you learn to use enrichment maps. Right, okay, great. The other thing you can do, which unfortunately I have now erased the list that I've also carefully typed in in the beginning, I'm just going to take the example list. So now I have a list of genes. Okay, so now look over here, these, if you can see the bars, those tell you the different types of evidence that were used to link the genes together. So, right, so let's take a step back. So let's go back to CDC 42. Remember, because I'm only asking about one gene, it doesn't really know what question I'm asking, so it's just going to use a default list of networks to try to find other things that are interacting with CDC. So most of those networks are physical interaction networks. You click on this arrow, it tells you all the physical interaction networks that we use. Basically, we download all the large network databases that are published up until the point of which we refresh our data. So to date, we have about 2,500 networks in this. Not all of them are for human, but many of them are for human. Most of them, a lot of them are co-expression, about half of them are co-expression, so I co-expression study study. So click here, you get all the networks that were included, click on the networks, you get the publication that presented. In this case, the co-expression study, you can link out to it by pressing that. And then we have some way of labeling what the network is, and then you can also link out. So the way in which we get the network data is we download it from Genome at Expression Omnibus, and so you can link out to that if you want more information. Yeah? So that's how much weight we give to the network. Okay, so I'll talk about this a little bit later. But one way to come up with a composite network, so combine together all this network data, is to just add up all the edge weights between each pair of genes. But if you think some type of data is going to come up with a composite network, but if you think some types of information should be weighted differently than others, then instead of just adding up the network, all the edge weights, you can multiply, you can scale the edge weights in one network compared to the other. So if you think one network is 10 times more important, you can scale it up by 10, right? And so the percentages that we give, that's how we're scaling the network. And it adds up to 100%, that's how much overall weight that we give the network. Now, how do we figure out these weights? Well, we figure out these weights by saying what would be the best set of weights that we could give if we were trying to recover everything that's known about the biological function of genes as it's encoded in the gene ontology database. So if you take the gene ontology database and then you go across all the categories and you look at all the genes that are in the same pathways and you say, okay, what I want to do is I want to strengthen the links between the genes in the same pathways as much as possible and weaken the links between genes in the same pathway and those that aren't in the same pathway, how should I weight the networks? What's the best weighting to give to sort of maximize the difference between within pathway weights and out of within pathway and not within pathway weights? Does that make sense? Ideally, I'd like all the genes in a pathway to be linked together and not to be linked to any of the genes that aren't in the pathway. Now, obviously, I'm not going to be able to achieve that but I want to try to get as close to that goal as possible by coming up with some weighting of the networks and those are how those weights were determined. So they're determined based on gene ontology. So I'll show you later on how to adjust the network weights if you want to weight networks in a different way. So now on the other hand, so just look at the way this looks now, right? Big red bar, medium purple bar, actually, you know what? I can just come up with another gene mania and just open up another gene mania instance. Okay, so, oh, good. Now here, let's go back to that list that I also carefully typed in before and then I pressed go and you can see now that the physical interactions are much less important in this list, right? Now co-expression weight gets much more weight. And so what happened here is I have a long list of genes so then I ask the same question that I asked for the other ones and the other ones I said, okay, let's take all the known pathways and try to make the weights between members of the same pathway as high as possible and make the weights between members of different pathways as low as possible. For this, now I have a long list, that's like a new pathway. So I say, okay, let's try to make the weights in this list as much as possible while making the weights between this list and not on this list as low as possible, okay? And so that procedure tells you what the networks are that uniquely linked the list together. Okay, does that make sense? Yeah, okay. So this is context dependent weighting because it depends upon what genes you put in and this is context independent. Again, it doesn't matter what gene you put in, it's always going to use the same weights. Yeah. So we have a website and we have a size escape plugin. And so for the issue with the website is we don't want people to put really big gene lists in, not because our servers can't handle the inference, but like what you see here, which I'm playing around with online is basically like a mini cytoscape. And that whole object's got to be sent back to you when you use the website. And so if you put in a long list of genes, it takes a long time for that object to get sent back to you because it requires a lot of bandwidth. So we do put a restriction on, I think the restriction is about 100, but I'm not entirely sure. I can't remember. But if you have a really long gene list, just pump it in and see what happens. You're almost certainly won't crash our servers. They're fairly robust, though this is a new interface. It's supposed to be able to handle larger gene lists. But I think about 100 is our limit, but you could go up to 300 and see what happens. On the other side, if you use the cytoscape plugin, there's no limit because we don't have this problem. Everything's already living on your computer. So you have to send the cytoscape-like thing back to you. And in fact, this thing that we use, the network browser that you're playing around with here, it's called cytoscape.js. And it was developed in Gary Bader's lab by a programmer, Max Franz, who did a lot of great work on this with the help of... I'm just really bad with names. Okay, Christian Lopez. So, I mean, if you are kind of a webby type person, you can actually use cytoscape.js to put network interfaces on your own databases. And a lot of people actually do, including our competitors, String. Okay, all right. So I'm going to go back and explain some of the things that I went through quickly. And then we can go back... I'll answer more questions about this and you have a lab that you can use to work with the interface. Okay. Okay, so... Right. So I just took you through this kind of quickly when I showed it to you. So there's some slides that show you what all the different things on the interface are. Feel free to click on anything that looks clickable and learn how it works. You're not going to break it. So don't worry about that. So for selecting query networks... So what I did is I just clicked on this thing right here and that produced this. And now there are a list of networks. The people just want to try... We can take... People want to just open it up on their computer and play around with it. The first thing I was trying to show you is... This is the box where you just type all the genes. You can also click on this and this gives you access to all the networks that are available for human. Okay, and then we've just categorized them into different categories. The co-expression networks I've already talked about. And then here's the list of all the co-expression networks that we have. We have more than 200 of them. We've selected 20 by default. Those are the 20 most useful ones for predicting biological process. But if there's specific things that you want, you can go through that list and choose them. You can find out more about each co-expression network by just expanding it out. You can click on it to add it. If you want to, for some reason, use all the co-expression networks, you have... You just click on that and it gets all the co-expression networks. So here, right here, it's telling you you're now including 287 out of 287 co-expression networks. Let's go back to default. This is the new version of the interface and I don't quite know how it works yet. Right, I don't know how to go back to default. Question? That's a good question. How do we make co-expression networks? So we compute the correlations. And then, in terms of, like, doing the guilt-by-association, low correlations aren't useful in general. And so what we do is, for each gene, we only keep the top 100 highest correlations. Right, and then we rescale it. So, like, the highest correlation that gene has with any of its neighbors is one. So it's basically the co-expression network with a few slight changes that we found help improve the quality of guilt-by-association. So, for yeast, it's easy because we look at the yeast-GFP collection. So the yeast-GFP collection is just, like, you know, Aaron O'Shea, like a few years ago now, GFP-TAG, most of the proteins in yeast, and then, based on looking at microscope images, decided which subcellular compartments the proteins were in. And so for yeast, that's what co-localization means. In human, we don't have the equivalent data. So by co-localization, we mean is expressed in the same tissue. So how is that the co-expression? So often the co-expression studies are, is expressed under the same conditions or cellular stresses. That's, I mean, it's a very kind of narrow distinction. But one of the things that I found is that tissue co-expression is often a much better predictor of shared function than any other type of co-expression. So we separated out. Great question. If you guys are done playing around G-mania, I'll go back and explain that a little bit further. So it doesn't need to be. It doesn't need to be. And it depends how you do the guilt-by-association. Okay. Don't worry, you have lots of time to play with G-mania later on. I promise you this. And you're even allowed to play with it after you leave this classroom. In fact, I encourage you to do that. Okay. I already talked about complex, context-dependent networks. That stuff is all here. This is how we do the network waiting. One point I want to make here. Again, these are all just a bunch of notes that you can use later on. So I talked about when we assign weights to networks, we assign them based on how relevant they are to the query, so how much they connect together. Genes are in the same list. But we also down weight to networks meaning if the networks provide the same type of information, if the networks are too similar to one another, we don't want to give that network high weight over and over and over again. So for example, if you were to take the same network and you were repeat it ten times, the total weight assigned to all those ten networks should be the same as the weight that that network would get if that only appeared once. So we consider redundancy when we do the network waiting as well. And I was particularly worried about this because there's a limit to what co-expression can tell you. Often co-expression tells you a lot about growth as whether or not those two genes are up-regulated when the cells are dividing or down-regulated when the cells are dividing. And so many co-expression data sets give you the same type of information and it links together genes that are involved in proliferating cells. So we have this redundancy that goes into determining network weights as well. Okay, I'll get to your question. I'm just going in the order that the slides are. And you didn't get to it, unless you clicked around randomly, but I'll show you how to get to it. You can decide how to define how to weight the networks. So the first thing you could choose is how many extra genes to show. So if I give you a gene and I say give me more genes like this, what this gene do, what it does is it finds the, by default, the 20 most highly linked genes, but you can increase 20 to something big, maybe it looks like around 100. I'm sorry, this is a new interface, so I haven't used it that much. Or you can make it go all the way down to zero if you don't want any extra genes. There's also attributes, which I will show you later. And you can control the number of attributes you get. So, but down here, you can change the network weighting. So, by default, we have this, we use a scheme that says automatically selected weighting scheme. What that means is if you have a small list of genes, we don't have enough information to determine what networks are relevant. So we use the default, which is networks that are good for recovering pathways, in general. But if we have more information, then we actually determine what networks are good for recovering the list. Okay. And so, if you want to explicitly choose the latter, which is what networks are good for recovering the list, then you choose a sign based on query genes. If you just want us to automatically choose between those two options, you choose that. I think that's the best way to go. But if you have more specific questions, you could use three different types of gene ontology weightings. So one is the default weighting for short lists, which is like networks that are good at recovering biological pathways. So pathways would describe gene biological function. But you can also choose networks that are good for describing gene molecular function. So with the enzymatic activity of the gene or the protein product. Or you can choose networks that are good at recovering the cellular component. So that means where the gene goes in the cell or where the protein goes in the cell and what protein complex is this involved in. Okay. Or the last options are you can equally weight networks. If you don't want to use one of our fancy weird schemes for coming up with the network weights are and there you have two more choices. One is each network gets equal weight. And one is each data type gets equal weight. The reason that's important is at least for human, half our networks So if you don't want your networks to be overwhelmed by co-expression networks you can do equal by data type. So attributes actually don't have any slides about attributes. You guys just want me to show you what attributes are? Oh, here. I'll show you. Yeah. Okay. So click this thing to get the networks and then down there it says customize advanced options. There you go. Okay. Who wants to see attributes? All right. Okay. So let me tell you let me just start it and then I'll tell you where attributes came from. So right up till now everything in the gene except the attributes is a network but sometimes genes have features. You know? These genes are in this pathway. That's a feature. That's not a network. These genes are on this chromosome. That's a feature. It's not a network. And so we came up with attributes as a way of dealing with that. So we have five different types of attributes. These are all attributes that are compiled by Gary Bader's lab and these are all resources. A lot of them are from GSEA and essentially they're pathways or gene sets and then we use those pathways or gene sets to find other genes like those. Okay. So let's see what this comes up with. I think it's a little bit slow because I did one of the unadvisable things and I turned on all the gene expression data sets and so there's like hundreds of networks to download that information right now. It's going to take a little while. And so when this comes up, what you'll see is we represent these attribute nodes slightly differently. They're diamonds instead of circles and they say things like a given protein domain so that the gene has okay, so here's an example of an attribute. So this is an attribute. It's a pathway. This is a description of where the attribute comes from and these are all like I said they're collected by Gary's lab and if you click around on top of the node you can see oops, how did that? You can see which genes it's linked to. Here's another pathway in activation of CD42. Should be no surprise to you that there's some genes that are linked to CD42 that have that attribute. Let's see if we have like okay, so we have something called the Biocartamao pathway. I'm not really sure what that is. These are pathways from different sources. Some of the other attributes, none of which came up here. Let's just turn cool expression off and let's use this. Some of the other attributes are like has this transcription factor motif in its promoter region or is a target of this or RNA. You can find out more about what attributes we have access to by just clicking on customize oops, let's go back to attributes and so these are the attributes that we have. The consolidated pathways those are the ones you're seeing the most of those are pathways drawn from a number of sources that were published in 2013 along with Gary's enrichment map paper. Also this drug interaction. In that case the attribute would be the drug and it would be linked to all the genes that have an interaction with that drug. Interpro, those are protein domains. So conserved protein domains. So have you guys heard about these? Everyone knows what protein domain is. Okay great. And then there's the micro RNA target predictions and the transcription factor target predictions. So if you have a list of genes you think are co-regulated you can put them in and for example well one of the ways that you can try to see if they have a common regulator is to use their attributes. A slightly better way Michael Hoffman is going to tell you about tomorrow that looks at more in depth at transcription factor targeting. Okay. So those are attributes and you can control the maximum number of attributes we show and you have to do that because there's like 20,000 attributes. So if you don't put a control on the maximum number you can end up with a lot of attributes. Okay. Alright, more questions. Yeah. So can I ask you a favor? So like I said, so Max Max is the one who did this. He did a great job and he responds really well to user criticism and not as well to my feedback. So if you can so you can get that if you press on report so here's the legend for the function color down here but yeah if you, what I think you want would have to be the legend on the network image itself. Yes. So you don't have the legend on this network image that you can export. So if you email me and or email Jean Mania asking for that and then I can forward it to Max and it doesn't come from me. Oh, I'm being recorded, right? Okay. You can cut that part out, right? Okay, yes. Yes, I'm known as the guy who asked for too many features. So yeah. Are there any other questions? Yeah. Thank you. Thank you. Right, so where is the upload button now? Where did you find that? Oh, good. Okay. Upload network. So there is just the two column format. Tabs are limited. So Jean and basically every line corresponds to one interaction. So you put the Jean one of the genes involved the interaction and the other gene involved the interaction. And basically Jean Mania will recognize any of the gene names that it recognizes when you put in. And we try to design it to be as easy as possible so that we'll identify gene names in a variety of different formats. That said we only recognize unique identifiers. So there's a lot of gene names that actually correspond to two different genes where case is sensitive. So if you're like a Drosophila geneticist there's going to be some genes you're not going to be able to put in because they differ only by case. There's like smog and SMG and SMG is capitalized with an S and smog is SMG but it's a lower case S and there's nothing we can do about that. Okay, but that's the format that we recognize. Yeah. Okay. So let me just tell you a couple more things and I'll get out of your way. Okay. I've told you all the stuff about the network waiting schemes. Now I want to answer your question about does the query gene have to be connected to all the other query genes? And so that's the algorithm for finding guilt bias association. And it's one of our learning outcomes. So if you look at this network here you can see that there's four red genes and they're linked they're directly linked to some genes in the network. They're indirectly linked say to this gene there's no direct connection to the red genes but there's a path and then there's these genes are not linked to these red genes at all by either a direct or indirect path. So there's two main algorithms for doing guilt bias association the highly linked genes the one is direct interaction so the genes that you would return with direct interaction have to be linked to the query genes. The one we use is suddenly different it's called label propagation and basically it allows indirect connections basically by like down waiting based on how far you are away using the heat diffusion equation but it doesn't matter the most important thing is is that you can get genes that aren't directly connected to the query genes but they'll get much less often than the directly connected ones. It goes it goes an arbitrary number of genes as long as they're connected so you'll never get these ones because there's no connection here so no heat can flow from here to here it'll go down but because we're only showing the top 20 genes if there's like a long link and they're not connected to anything else you'll get the whole list but most of the time if you're considering like by default we look at 200 networks so there's often a large connected component with the genes so you very rarely get indirect genes but sometimes you get indirect connections where the two genes share a lot of neighbors and there's a bunch of different paths by which the heat can propagate between them yes yes you can well so I'm answering a different question but you can certainly on here if you take these two genes out that are directly connected you can click on this and find out what the links are that they're all genes together but we don't provide an explanation for how the heat got there but you might be able to do it yourself by just looking at the network ok so and these are more notes and details I can go through this if you want but I think I've explained most of this and then I have a longer label propagation example so the advantage of considering the indirect connections is not only that you can get genes back that aren't directly connected to your query list but indirect connections can sometimes allow you to distinguish between direct but spurious connections and communities of genes that are highly linked to one another so this is before these are like four query genes that are on label propagation which is so the the size of the node here in this case tells you basically which genes we would return in gene mania like how highly linked those genes are to the query genes right so you can see here that this network each one of these nodes here is in like a group of very highly connected nodes called modules or communities and they tend to appear for genes that share a lot of functions so pathways of genes are often connected together in a module and one of the things you're trying to do when you're doing this like gene function prediction is to find these modules of genes and label propagation works a little bit better at that because it considers not only direct connections but indirect connections so what do I mean here well one of the if you have like a collection of genes that are all highly linked to one another you won't necessarily get direct connections between everyone in the group but everyone in the group will share a lot of indirect connections right so if you're in a clique you know you often share a lot of the same friends right and actually it turns out that that type of information the number of friends you share sometimes is a better predictor of gene function whether or not two genes are closely linked together and so people try to come up with explanations for why that's true and one of the explanations is potentially this jerk in the middle here let's not call them a jerk let's call them a social butterfly who's member of all these different communities right so if you're trying to identify this community of genes that are all highly linked together and you have one of these like multi-function or plyotropic genes they can be linked to a whole bunch of different communities and if you start looking at direct neighbors you're going to get these ones as well right but if you rely on not only direct connections but sharing a lot of connections or having a lot of the same friends that's better finds you more members of the same community okay so that's what label propagation is good at sorry I spent so much time on that but this is something that algorithmic work on so yeah there we go so what is gene mania so we have a large automatically updated connection of interaction networks we're about to have a new data release that we're just debugging right now unfortunately our last one was about two years ago but that's pretty up to date as these things go also we have this query algorithm that I've been talking about that finds genes and networks that are functioning associated with the gene list and then we have this network browser that has a lot of linkouts so that you can find out where we got our information from like what papers we got our information from and where to go back to look for it and so where do we get data from so our gene expression data largely comes from geo gene expression omnibus and whenever we do a new data collection basically we go through gene expression omnibus and find all data sets that have a minimum number of conditions and build an old expression that works out of them they have to have a minimum number of conditions they also have to be from a microwave format that we recognize so we know that they're not something weird we get our attributes from Gary Bader's lab we being me and Gary and we get our physical interactions from an organization called IREF what they do is they compile physical interactions that annotators around the world are generating and we use a way of interacting with IREF called IREF index to get our physical interactions we get all those physical interactions and we separate them out by the PubMed or the publication that they came from we also have predicted interactions and so what are those are largely those are what are called enteralogs so we predict an interaction between two genes if the orthologs of those genes interact in another model organism and so we get our enteralogs from I2D we also have a network that tells you shared protein domains so we link two genes together if they share a lot of protein domains the idea there is they're likely to have the same biochemical function and so we get our protein domains from InterPro and we get genetic interactions from Biogrid in terms of gene ID mappings so like I said we try to recognize any way that you can type the gene name in but we only recognize unique identifiers and we generally don't recognize gene names with spaces in them so we have preferred gene identifiers and they're usually not what you want to use because there's things that are very unique like ensemble IDs or entro-gene IDs if we have those, we'll recognize those and those are very useful but we recognize almost all gene symbols because those are almost always unique and then we recognize a number of aliases for genes but not everything, especially if it's not unique and we get gene ontology annotations from the standard places I guess I'm out of time here and like I said we also have a cytoscape plugin this is a very old version of the cytoscape plugin but this is the paper here that describes it this is designed by Jason Montogio and it has all the gene mania functionality we have versions of it for cytoscape 3 one thing that people actually end up doing a lot is they use it to add new organisms so there's a lot of organisms that we don't we don't support because it takes a lot of work for us to add a new organism but we can take you through the process of adding a new organism to gene mania through the cytoscape plugin and you can of course integrate the gene mania networks with other cytoscape analyses and it's also a source for getting interactions into cytoscape so you can download all the gene mania data releases from cytoscape they're big, they're like gigabytes in size takes five minutes or something but then you have all the interactions there and then you can use those interactions in other types of network analyses and yes we do still tell you what the source is that the interactions came from the interface has the functionality that the gene mania website does or the vast majority of the functionality I don't think we make these nice little I think we only color a node one color the fact that we have these beautiful pie charts that's something that Max has just added and I don't think our cytoscape plugin has that but the other stuff about linking to the data source is still there the other important thing is we only have the latest release available on our website if you want to reproduce gene mania analyses done on an older data release you have to use the cytoscape plugin because all the previous data releases are available to you so you can carry out the same analysis you can reproduce the analysis that you did on our website in cytoscape even if you did it like two years ago or three years ago we have a number of other tools that are associated with this so one of them is called query runner and what we've done that for in the past is that if you have a new network we can tell you how much that network adds to the overall knowledge about gene biology by saying this is how much we knew about pathways given all the data this is how much we could recover known pathways given all the data up to now and now this is how much more we can recover about known pathways when you add your network in so it's like the added predictive value of the network one of our competitors is String this is our most major competitor what differs between us and String there are friends more than competitors is that String is more focused on proteins and we're more focused on genes so we'll include things like genetic interactions or we'll talk about things as a gene level String will talk about things more at a protein level we give you a lot more precision in the networks that you can choose we do label propagation where String doesn't they do just direct interactions and until recently and I'm not sure if they do it now they don't weight networks though they do allow you to turn networks on and off but I would highly recommend looking at both gene mania and String if you want to see what you get out of the same query okay and then we have a long list of comparisons here if you're interested and that's it so yeah not always now so yeah we do it's actually on the integrated assignment so it's called GRN have you heard it before you'll find out more about it new network so you can use it in two ways so one way is to say okay let's take the whole genome and let's try to predict the function of every gene of unknown function or of like weakly known function so query runner can do that query runner can also tell you how much you've improved or decreased your ability to predict gene function to recover known function of genes using the new network that you add to the network database that's what we've tried to say before yeah so we developed this for a number of reasons but in 2010 there was a new yeast genetic interaction network that was published and we wanted to say well this is an important interaction network because it tells you much more than you used to know even if you took all the functional genomics data that had been gathered up until that point in time yeah