 OK, welcome back. You know who I am? You know about the licensing? So I'm going to talk to you right now about gene function prediction. And Lincoln mentioned a little bit about what I was going to talk about in his module four yesterday. And essentially what this is about is how to use high throughput data on gene expression and networks that's been generated for model organisms in your own research. And Lincoln in module four described ways to do sort of historical data, ways to use data that's been generated to other people to help you to define modules or sets of genes. And this module, I'm going to tell you another approach to how to do that. And that approach is going to focus on a tool that Gary Bader and I built called G-mania. But there are a lot of other systems that perform similar functions. And some of those systems are linked through the wiki, and we're going to discuss one in particular called string, the size G-mania. And essentially what this module is going to be about is what I've started to call a gene recommender system. A gene recommender system, everybody knows what a recommender system is, because you use them all the time. So if you go to Amazon, you buy a couple books. Amazon starts suggesting you books that you might like. Or you go to one of these internet radio stations, you play a few songs, and then the internet radio station starts to figure out what type of music you like. So we're doing the same thing, but we do it with genes. You give us a set of genes, and we give you more genes like those. And we call that gene function prediction. So that's what this module is about. OK, and I'll briefly talk about functional interaction networks, but Lincoln talked about that module for already, so you've already been introduced to this idea. And then I'll talk to you about the concepts in gene function prediction, why we think we can build gene recommender systems, and what we think it means when we recommend you additional genes. Now, there's some differences on how you recommend the genes, and that's what I'm going to talk about in scoring interaction between genes by guilt bias association. I'm going to introduce G-mania to you. I'll give you a short demo of it. You can be working more with it in the lab. And then I'll give you a short explanation of network weighting schemes, and what that is will become clear as we go through the talk. And then I'll also give you a short introduction to string, which is, gene is probably the second most popular gene recommender system, and string is the most popular one. OK, and so where this all comes from is this idea that there's been millions of dollars spent on generating genome-wide data. So cell wiring diagrams, networks of protein interactions, networks of genetic interactions, chip-seq data that tells us about interactions between transcription factors and DNA. There's microwave expression data. And so the idea is that whenever you're doing an analysis within a particular organism for which all this data is available, you should be able to use that data within the context of the questions that you're asking yourself. But it's actually a little bit hard to use this data in the way that it's represented. These data sets are large. They're incomplete. Because all these measurements are done on a fairly large scale, not any one individual measurement is not specifically reliable. And sometimes it's hard to think about how to phrase questions in a way that you can use these data. And so the solution that we and others have come up with is within the context, is to use these data within the context of function prediction. And we mean this in very a broad sense. But you can think about it as asking two types of questions. If you have a single gene, you want to know what that gene does. What we mean by that is by looking at the genes that it interacts with and seeing if you can say something about its functions based on the interactions that it has with other genes. And these interactions could be genes that it's co-complexed with. These could be genes that it has some sort of epistatic interaction with, so a genetic interaction. These could be genes that it's co-expressed with. And then what type of data is important might depend on your question, or you might just want to know in general about all the interactions this gene might have. Then another type of question is, give me more genes like these. And this is a slightly different question than this one. So you buy one book in Amazon, and Amazon might tell you a whole bunch of other books, but you buy five or six books. Now they're starting to build a profile out of you. And they know more precisely what you want. And I think maybe Amazon's not the right example here. I think internet radio is probably the better example. And I'll show you some examples that distinguish between these two types of questions. But really, by answering these questions, this is the way that we can query all the interaction in our genome-wide data that's been generated for the model organisms that you might be working on. OK, and all this is driven by something called the guilt bias association principle. And where this comes from, this is one early example of this. This is one of the first microarray papers. And what they did in this microarray paper is that they just profiled these genes under a whole bunch of different conditions. So under deletion mutants, under a bunch of different environmental conditions. And this is a very familiar picture to people now. But each one of these rows corresponds to the expression pattern of that gene across the various different conditions. And red means up-regulated or down-regulated. And green means the opposite direction. And I can't remember which is which. And there's a very simple insight that if you just start to group these genes by similarity of their profiles, and then you look at that list of genes. Oh, yeah. Last module, I learned how to use this laser pointer. Let's put that knowledge to use. OK, but if you just group those genes by these profiles, and then you write down some representation of that gene's function, beside them, you find that the function groups in the same way that the genes cluster by their expression profiles. Often. And so you can represent this type of information in a network setting, where each node of the network represents a gene. And the strength of the interactions in that network, or the thickness of the lines here, the links between the nodes, is proportional to the similarity of their expression profiles, like the correlation coefficient. Now, if you lay that network out in such a way that highly linked genes are in the same spatial location, and say here you've got two genes that you don't have a function for, or they have some sort of more vague function assigned to them, simply by looking in the local neighborhood or looking at their neighbors, you might be able to try to assign function to these genes. And that's the guilt-by-association principle that the function of the gene can be inferred by the other genes it interacts with. And then this is how to use it with microarray data. Sometimes you measure interaction explicitly by doing like a yeast to hybrid assay, or you do some sort of infinite purification assay where you're looking for co-complex genes, proteins, or some sort of epistetic interaction. OK. And so ideally, you take all the data that's available when you're answering that question. And so I have a query list here, and let's say this is a single gene. And so the question is, what does my gene do? Well, you take that gene and all the data that's available for the organism, you put it through your gene recommender system, and you find out in this case the 20 most highly interacting genes with that gene. So each of the colors here represents a different type of interaction. You can see that this gene has a lot of different types of interactions of various kind with its most highly connected neighbors. So no, these genes not only do they physically interact, they're co-expressed. In some cases, there's a genetic interaction between them. They have similar protein domains. They also have a conserved interaction in other species. That's very strong information that there's some sort of interaction between those genes. And interaction in these ways often means interaction in terms of function, so they share function. So you just take this gene, and what we do is just take the 20 most interacting genes with it that we identify using label propagation, which I'll describe later. And then you just do an enrichment analysis within this little local neighborhood to find the functions that are enriched. And this is the function that's most enriched in this local neighborhood. And so that's one way to try to assess the function for a gene. In this case, we already knew that gene had that function. But you can do similar analysis for genes that don't have an assigned function, and often that sort of thing will work. So now, if you have a little bit more information, you can refine what data you're going to use to answer the question. So what do I mean by that? So actually, I came up with this example when I was giving a similar talk in Memphis. And so which Memphis do you think I was in? I was in Memphis, Tennessee, but there's another Memphis. There's the original Memphis. So if I just told you Memphis based on the fact that we're in North America, you think that that means Memphis, Tennessee. But that's not the only Memphis. But if I gave you a set of three cities, I said Memphis, Knoxville, and Nashville, and I asked you to give me more cities like that, you know that I'm in Tennessee, and you can give me more cities from Tennessee. I don't know if you knew that those, which had a new guy. But some people knew that was a good one. But if I told you Memphis and Cairo and Alexandria, you could give me more Egyptian cities. And so by giving a list of genes, you say what type of question you're asking or what kind of contacts you're asking. So if I give you a few kinases, you could give me more kinases. But kinases are involved in a lot of pathways. So if I give you members of the same pathway, then it becomes clear that the question that you're asking is a question about the pathway. And then based on the question that you're asking by your list of genes, you should be able to select the data that's most relevant to that question. So if I give you a few members of a complex, I'm probably the data, if I give you a few members of a complex, the most relevant data to try to complete that complex would be physical interaction data. But that's different than if I give you a list of genes that have similar protein domains. OK. So to answer this question, give me more genes like these. You take a query list. And actually, this query list is on the wiki, because we're going to play with it later. And you take the gene and network data, push it into the gene recommender system. And then you get out a list of genes. You get out the ones that have the dashed lines here. Those are the ones that were in the original query list. The ones without the dashed lines through them, those are the ones that are related. I've just annotated these with enriched functions in this set. And I've shown the networks here. But the other thing that this gives you is it gives you a relative weight of the networks based on how good they are at answering the question, give me more genes that are in this list. I'll make that a little bit clearer later on. But if you ask a question like give me more genes like these, that provides you the information that you need to say what types of data are relevant to answering this question. OK. And so here's G-mania. At this point, I was going to give you a motivating demo. So let us hope. Yeah, I guess. OK. I have the gene list. So I copied this is the gene list right here. OK. Paste the gene list in. I've chosen human. And the easiest way to interact with G-mania is just to press go. OK. So what happens now? So what G-mania is doing in the background is it takes your list. And because the list is sufficiently long to give me enough information to say what type of question it's asking, G-mania is actually able to weight the different sources of data according to how well they've reproduced the genes on the list. So if I take this list and I say, well, how do these genes interact with one another? And most of the time they interact through co-expression interactions, that suggests that if I want to add more genes to this list, co-expression is a good type of data to use to try to find those genes. And so that's how we assign the weights to these various types of sources of data. The black genes here are the genes on the list. And the gray genes are the 20 most highly interacting genes with these query genes based on networks of this type. And so in the network display, you can click on genes. It gives you information about the genes. You can link out to Entrez. If you've never heard of that gene before, this links you to the gene information, more information on the gene. This grid right here, let me find one that's got actually stuff filled in in the grid. So this tells you what functions that gene has. So this is called the function grid. So these are genontology annotations. And these are ones that are significantly enriched among the nodes here against the background of the entire genome at an FDR of 0.05. You can have a better look at those functions over here on this side. You can annotate the graph. You can color the nodes according to those that actually have that function. You can also look at the data that you use to try to produce genes on this list. So this is one source of data that it used. So this is a co-expression network that was generated using data from this study right here. You can find out more about that study by clicking through here. That takes you to the PubMed link. It's data we downloaded from the Gene Expression Omnibus. You can find out more about it by clicking through there. So these are the weights that are given to the various co-expression data sets. A lot of these genes share protein domains. So these NOS, I think that's nitrous oxide synthesis genes. They are all connected together because they presumably have the same protein domains. These are other genes that share those protein domains or their protein products share those protein domains. As Lincoln talked about yesterday, a number of these genes also are in the same pathway. And genes in the same pathway are connected by the pathway links. A number of these genes are expressed in the same sets of tissues. They're connected by co-expression links. A number of these genes interact. And those are connected by these links here. You can find out more about individual links by clicking on this. So these are all the co-expression data sets in which these two genes are co-expressed with one another. That's the basic interface. And then the lab goes through various aspects of that interface. And so what do you get out of this? Well, let me reset the... First of all, there's some graph layout that's going on. So genes that are highly interacting with each other are especially located next to each other. So if you have a long list, they can help you identify clusters of genes that share a lot of interactions within the list. You can also get to find out how the genes in the list interact with each other. So most of these genes in this list, they're largely co-expressed with each other, and they share a lot of protein domains. Again, it's a gene recommender system, so it gives you other genes that are like these genes. You can control how many genes are provided by just changing the Advanced Options Panel. There'll be more on that in the lab. You can also specifically adjust what network data it uses by going in for individual networks and changing, checking on or off networks that it will look at to determine which of the relevant networks. OK. Questions about the interface? Now we'll get back to it in the lab. Yeah. So we were coloring the nodes ourselves. So by default, let me clear all these things up. When you first return from the list, the black nodes are the ones that were in your query, and gray nodes are the ones that are the most highly interacting genes. Those are the ones that Gene Mania found. When you go to the Functions tab, which I did over here, you can color the nodes according to those that have the function. So these are all the nodes among this list that have this molecular function that's defined here. So the no colors are things that you yourself add. Now the problem here is, and nobody has a solution to this, is a lot of these, if you see I'm switching functions here, but there's a lot of overlap in the genes that have those functions. And Gary this morning talked about ways of dealing with that using Enrichment Map. In terms of coloring, there's an order in which the nodes are colored. So I've colored all the nodes that have this function, but some of those nodes also have that function. So that color trumps that color, like it colors over it. But you can change the order of the coloring down here, as such. It represents something kind of complicated. Do you guys want me to go through it? It'll take like two minutes. And I do talk about this later in the talk, so we'll get back to this, so you don't understand the first time through. It's essentially what that weight represents is it represents 100 divided by the number of neighbors that that node has in the full network. We're not showing you the full network here. We're only showing you the nodes in the list. But in the full network, the full PFAM network, this is a network where nodes with a lot of shared protein domains are linked together, where those protein domains are defined by PFAM. These nodes have two neighbors, because it's 50, right? It's 100 divided by 2, right? And so you can see that number changes. So let's go up here. So apparently in the full network, these nodes have like 20 neighbors. It's slightly more complicated than that, because in the full network, this node could have more neighbors than this node, right? So what do you put in there? Well, we put the geometric mean in. It's like the arithmetic mean, but with logs instead of additions. And so we take that weight to represent some sort of measure of how likely it is some sort of measure of functional interaction, the idea being if you only have one neighbor, you probably share a function with it. But if you have 50 neighbors, let's just say there's one. Maybe you share a function with one of the neighbors, right? It's kind of a vague way of doing that, obviously. But that's the way we did. It becomes slightly more complicated when you look at co-expression. For the co-expression, and actually for shared protein domains, the original network, everything, the co-expression network, the links are already weighted by correlation, right? So then we divide by the sum of the correlation of all edges that go into a node so that everything adds up to 100. So things will be weighted slightly higher in co-expression networks if they were more correlated with one another. That's why it's a little more complicated. All right, more questions. So if there are no connections, you'll just see a separate node. Most genes have connections. That said, we do what's called sparsification of the networks. And so what that means is, so a co-expression network, you can measure co-expression between any pair of genes. But we only consider, I think it's the top 100 most co-expressed genes with the other to make a link. So in a co-expression network, the most interactions that a gene will have is about 100. And so it's rare for genes to interact with other genes. So the fact that there's a lot of interactions here suggests that this is a highly interacting group. If we put a set of random genes in, there may be a few interactions that come up. But there certainly won't be as many as what we're seeing here. OK, great. So let's go back to the lecture and let's go through the concepts that we introduced here. And you know, I don't know what you can play with it. It's really fun to play with. It's designed to be fun to play with. So click on things. It should have behavior that seems intuitive. There's a lot of things you can click on. And you'll see it's fun. OK, so let me go through. This is the advanced options panel. I showed you how to open it. It was just clicking here. It said open advanced option panel. Now let's open it. It says hide advanced options panel. Forgot about my tool. OK, and so here, there's a default. And we selected networks to search by default so that you don't have to think that much. But once you become more adept in its use, you can change the networks that it considers when it does its searching. Now GMA is going to assign its own weights to these networks. But you can certainly say, only consider these networks. And so what you do that, there's networks are classified into different groups. And then once you click on the group, it opens up and tells you all the networks. Here, these are all the co-expression networks. The networks, in general, are labeled by the first author and last author of the study that we use to generate the network. And if you click on this triangle here, it gives you more information. We'll give you the title of the study. It'll give you a link to the PubMed entry. And in some cases, there's tags that we've associated with the study that vaguely say what the study is about. But it's actually just better to read the abstract. So you can click in for any of those. The number here tells you how many networks are available. And the number here tells you how many networks that you've selected. By default, we select all networks, except we do not select, we only select 20 of the co-expression networks. Because we feel more co-expression data can become redundant unless you have a longer list. And there's some what are called predicted networks that we don't select. So predicted means two things. Predicted means that genes are linked in a predicted network if they're observed to interact in a different species. And usually that interaction means physical interaction. So if you have a protein-protein interaction or network in like mouse, our colleagues in I2D, this is Eurgor Drusica's lab, what they've done is they've identified the corresponding human genes. And they put an interaction between two human genes if their orthologs and mouths interact. We call those predicted interactions, because they're not directly measured. There's another type of predicted interaction that we predict an interaction network that we have. And these are functional interaction networks, like the network that Lincoln told you about in Module 4. And so what people do is they do the same thing that Gene Mania does. They collect some subset of the data. They combine it all together. And then calculate some measure of functional interaction that's computed based on how much those genes interact in the various data sets. We include those functional interaction networks that other people have generated. We call them predicted interaction networks. But because we're drawing from the same data sources and doing something similar to those, we think it would be circular to include them in our basic interaction. But you can do the search yourself if you want by checking off those networks. Questions? OK. And again, the fraction is that this is the number of networks selected from the total. You can check in here. And so here, we've opened this up. And like I said here, this is a link to the PubMed. And this tells you the citation. This tells you how many interactions you have. OK. So what you get at the end, as I showed you, is this what I call a composite interaction network. And so these are networks that are made up of interactions from various different types. And so if you put a single gene into gene mania, or if you use another one of these gene recommender systems, you get what I call a query independent composite interaction network. What that means is the contribution of each one of these data types to this final network, which is what you're going to be searching to identify the related genes, is fixed. It doesn't depend on the questions that you ask. OK. And so one way to fix it is just to take each of these networks separately and then just sum the weights on the edges together. And then the weight on the edge, sorry, the link between two nodes is equal to the sums of the weights from each of these networks. And that actually works pretty well. So if you're just doing a one gene query, you don't know what question you're asking. So you want to find a set of weights that works well in general. So networks vary in the quality of the measurements that people have used. They vary based on how much information is in the network. So if I take a predictive interaction network from yeast where I've taken yeast genes that interact and I try to detect the human homologs, that's probably less reliable than if I've got a mouse gene network where I've taken the human homologs or the logs of mouse and I've seen that they interact within mouse. So you want to weight based on the quality of the network and the type of questions people ask in general. That's one situation when you do it. And so yeah, obviously you want to ask different questions, so you should weight networks differently. And again, so some networks might be better than others. And so the rules that I think you should use when you weight the networks, the way we do is we use a linear regression procedure, but the details of the algorithm aren't important. But what's important is networks should be weighted by the relevance. So how relevant they are to predicting the function of interest. In this case, how relevant they are to predicting membership in the query list. And the second thing is redundancy. So what that means is a lot of times different labs are measuring the same thing. We combine those, we take the networks from both labs at the same time, because we think it's useful to have both those types of information available, but we want to recognize the fact that this is redundant information in general, so the individual network should be down weighted a little bit. One way to think about that is if I took a network and I just duplicated it five times, and then I was using equal weights on all of the networks, the network that I just duplicated five times is going to contribute more to the final network than if I just had that network represented once. So we want to be able to detect cases like that or near duplications and down weight appropriately. So in the final weighting, those five networks, the sum of their weight should be equal to the weight that that one network would have gotten if it hadn't been duplicated in the first ones. OK, and so that's where we get these query-specific weights from, and that's what these numbers here mean. So we actually do assign numbers in the query-independent case because those are numbers that we've computed by asking how well those networks reproduce shared Go annotation. So Gmania has one network weighting scheme that uses by default, but if you want to control how it does its weighting within the advanced option panel, this tells you how to do that control. So the default is automatically select the weighting method. So if you don't do anything and your gene list is six or more genes, we try to assign the weights based on the query gene list. So based on how well those networks, how often those networks are connecting together genes in the query list. If your query list has fewer than six genes, so five or fewer genes, we use what's called a gene ontology-based weighting biological process. So what that is is, OK, what that means is that we use fixed weights. So we'll use the same set of weights regardless of the gene list that you provide, but those weights are set based on how well those networks reproduce patterns of Go annotation for biological process terms. If you're asking a question about the biochemical activity or molecular function of these genes, you might want to select that, which uses a different branch of the hierarchy. Or if you want to ask a question more related to where they are in the cell or what protein complex they're part of, you might want to select this weighting. We also have equal weighting. So you assign each network an equal weight. You can weight equally by network, but some data types have more networks than the others. So you can also weight equally by data type. So each source of data contributes the same amount. OK. So questions about that, that's how the network weights are assigned, and those are the options that are available to you. OK, the last thing is how the genes, if you provide me a network, how the genes are scored. And so there's two different ways that this happens. One is called what I'm calling direct neighborhood or sort of it's often people call it you'll buy association. And it's very easy to figure out what the weight is here. So this is the network, and this network is what's called disconnected because there's a part of the network that's not connected to this part here. OK. These genes are red. These are the query genes. And then given this network, we want to figure out what weight to assign the other genes in this here, so the seven other genes. So the direct neighborhood method, basically what you do is you weight yourself according to the fraction of your neighbors that are query genes. So here, two-thirds of the neighbors are query genes. Here, one-half of the neighbors are query genes. So this is slightly larger than that. OK, but what this doesn't do is it doesn't propagate information to things that aren't directly connected to the query genes. That would never happen in gene mania, but this is just for illustrative purposes for the algorithm. Yeah. What happened here? Oh, god. Sorry. I've been using this slide forever, and this is the first time everyone's known a stat. Oh. All right. Yes, it does. Yeah. Thanks for that. I'll fix it. Not right now, though. OK. All right. OK. And so this is not the algorithm that we use. This is the algorithm that most of the other systems use. It works actually quite well. We like to use a slightly different algorithm called label propagation. And I'll give you two reasons for why we do that. The first one is this idea that genes that aren't directly connected to the query genes but have indirect connections can still get weights. And here, there's a disconnected component to say, OK, these genes shouldn't get any weight at all because they're not part of this. But these genes that are further away from the query gene should get smaller weights. The way the algorithm works is pretty straightforward. Basically, it iterates the weights. So each node looks at its neighbors and takes an average of the average of its neighbors in its own state. So this node is 0. And in the beginning, its neighbors are 0. So 0 on average plus 0 is 0. This node starts at 0. It's got a neighbor of 1 and a neighbor of 0. So this node has 0, the average of 0, and I guess a half. So this would be 0.25, right? And so direct neighborhood, that's the first iteration. So now we're going to redo this computation here. So now this node has neighbors that have non-zero values. So it's going to get a non-zero value. Now, in the next iteration, this node has neighbors that have non-zero values. So it's going to get a non-zero value. So the weight propagates through the network, or the label propagates through the network. So these are just details. So the direct neighborhood, it depends on the strength of the links to the query genes and the number of query gene neighbors. Here, the label propagation depends on these two things. But it also depends upon whether or not the node is in the cluster of nodes with the query gene. The next slide points out in a little bit more detail. So if it has a number of shared neighbors with the query gene. And so what this allows, it allows indirect links to query gene to impact scores. So it often brings up clusters of nodes. And let me show you why that's important. So say this is your network here. What this network consists of, it consists of four separate clusters. And so what's important is not only, in this case, so what happens is if you are in a module of genes that have similar function, often not only directly connected to nodes in that module, but you have indirect connections to nodes in that module because you share neighbors. And the number of shared neighbors you have with another node is often a better predictor of the likelihood that you share function with that node than a direct connection. So if you only consider direct connections, you can't incorporate that type of information. But here, because you're propagating the label around, what happens is in the first step you propagate that label to here, and then you propagate that label to here. And now this node, when it reassigns its label, it gets information not only from the direct connection but from the indirect connection because you propagated there in the last round. So the links within this module reinforce each other during the label propagation. And so things tend to be more spread out. So what happens is it pulls out modules a bit more often in clusters of nodes. And you'll see that when you use G-Manu. OK. All right, so what do we have? We have a large, automatically updated connection of interaction networks. And this is what you should expect from any gene recommender system. We have a query algorithm to find genes and networks that are function-associated to the query list. So again, gene recommender systems will do this for genes but not for networks. And we have this interactive client-solid browser with the link out. So you can go and find out more information. And String does this very well as well. And I'll show you String in a second. OK, so where do we get our data from? We right now have the largest collection of data that's available. So if you go to Side Escape and you look at where to get network information from, we have six times more links than anybody else does. And that's largely because we get data from everybody. So all the physical interactions are collected together in something called iREF. This is an organization that shares physical interactions among themselves. So we get all their physical interactions. Plus, we add genetic interactions from BioGrid. We query gene expression to OmniBus every once in a while to find expression data sets that are large enough for us to build reliable, co-expression networks out of. Our data on what are called enteralogs. So this is where we use information about whether or not the orthologous genes in another species physically interact with one another. That comes from I2D. We use Interpro to tell us about shared protein domains. We also collect some organism-specific data sets. So for example, if you know yeast, there's what's called the yeast GFP collection that tells you about the subcellular localization of proteins. We have that information. We get our gene ID mappings from Ensembl. We get our gene annotations from gene ontology. OK, and these are all the gene identifiers. This is our current status. So there's also a cytoscape plugin which you've installed. This is what it looks like. This is what this looks like. We have something. This is a bit more advanced thing, but we have something called QueryRunner. And so what QueryRunner is all about is evaluating the added value of a network. So if you want to ask the question, what does this network that I've just generated add to global knowledge about gene function? The way that we articulate that question or the way that we answer that question is we say, OK, let's take all the network data that's available out there already. Let's try to reproduce all the gene annotations that have already been done. And let's see how well we do that measured, in this case, using some measure called the area under the ROC curve. And then let's ask how much that decreases if we remove this network from our collection. So what's the added value based on all available data of this new data? So you can use information like that to tell you of the quality of the informativeness of the interaction data set that you've generated, but also what type or whether an interaction data set helps you for predicting some types of function versus other types of function. So we have tools to do that. So then String is another gene recommender system that has some of the similar functionality that Geomania does. We focus more on genes. String, their focus is more on proteins. And so here's a similar interface. You can put the protein name. String automatically detects the organism, so it will ask you a question to see what organism you meant. You press Go. And if you do that, these are the results I think I searched for, Interactors of Rad 50. String uses a similar representation to represent the data sources that link things together. String has eight separate types of data. So the network is very similar to the one that you might see in Geomania. They actually use our network display tool to do their network displays. And you can find out information about the predicted functional partners. This table here tells you how things are interacting. Here, this is a score that tells you how likely they are to functionally interact with the gene that you put in there. And you can get more information by clicking on these various links. So what String does extremely well is help you track the source of interaction. So if you click on Interaction, like Geomania, you get the study that it came from. But we don't go into great depth for small-scale studies. So studies that report fewer than 100 interactions. So it's definitely worth a try out. The thing that's also different is what the meaning of the weights are on the links. So I had some sort of vague description about what the link weights meant. In String, they can be interpreted as probability of functional interaction. So what does that mean? What that means is the proportion of genes that have this link that share a go annotation. So that has a very specific meaning. OK, String also has a very large organism coverage. It doesn't mean they have data on every organism. But a lot of their interactions come from some sort of phylogenetic analysis or some sort of gene fusion analysis. So the focus that String started with was these types of information are very useful for prokaryotes. And I can't remember if anybody was working in a prokaryote or a bacteria. But if you are, I'm happy to talk offline about why this would be useful for looking for interactions between genes. OK. Right, and so those are the links for gene mania. The link for String is string-db.org. But if you put String and functional into Google, you get it.