 Hi, everyone. I'm back. Okay. So, welcome to the last day of the Pathway Workshop. It's nice to see everybody. So, today I'm going to be talking about gene function prediction. Really what I'm going to be talking about is a specific tool called gene mania. And what we designed the tool for, that we, being myself and Gary Bader, is we designed it as a way of giving people access to all of the genomics and functional genomics and proteomics data that's been generated over the last 10 years for the model organisms to make use of that to help you in your research without having you to have a PhD in computer science, essentially to use it. And so, how did we do that? So, first I guess maybe I'll tell you about the outline. So, the outline is, the first thing to understand is in order to use all this data that's been generated, we need this concept of something called a functional interaction between genes. And that concept remains the same across all the different types of data that's been generated over time. And we'll go through all the sort of different types of data that have been generated. And once you have defined this concept of a functional interaction network, then you can say things about predicting gene function, right, either for a single gene or for multiple genes. And that's what we think this data is most useful for. And I'll show you some ways in which you can make use of the data in your own research. Right. And so, there's some concepts in gene function prediction that I want to quickly go over, guilt bias association, gene recommender systems. And then I'm going to explain about gene mania. I'm going to show you a short gene mania demo. We have two things. We have a website for people to use. And that's the way I use it most of the time, because it's easy to use. You can, you know, do a quick query. But if you want to do, like, really hardcore stuff and you want to analyze large gene lists, we also have a cytoscape plug-in, which everybody has already installed on their thing. It has the same functionality as the website. It can deal with slightly larger gene lists. But the other thing is it's integrated into the whole cytoscape framework, so you can use other tools on networks that you generate using gene mania. Okay. And then, so we'll have a lab today. We'll be using the cytoscape plug-in to do an analysis. Okay. And then, finally, there's a few, like, technical details that people always ask questions about that I'm going to explain. Those sort of network waiting schemes will become clear what that is in a second. And then I'm going to introduce you briefly to another tool that does something similar to gene mania called string. Okay. So the question is, how do you use all this genome-wide data that people have been generating over the last decade or so in the lab? Now, what makes this difficult is there's all sorts of different types of data. So there's the genetic interaction data, there's protein-protein interaction data. There's pathways that people have generated. There's protein domain similarity. There's, you know, there's chip data. So there's protein DNA interaction. There's a whole lot of microarray expression data. They're all different kind of data modalities. They have this problem that makes them different than a lot of data that people are used to working with in that they're a little bit noisy. So there's false negatives. There's false positive interactions. And so how are you going to make use of that? And the idea simply is, is that you take each one of these data sets and you define something called a functional interaction between genes. So what's a functional interaction? Functional interaction is evidence that these genes share some function or do similar things. Right? Now that could be a little bit complicated because genes are multifunctional. So they could be sharing different aspects of their function. But if you overlay all the different functional interactions together, what you should hope ultimately is that the genes that really do have a strong shared function will have functional interact... will show functional interactions under a lot of different circumstances. And that's essentially the concept that we're using to make use of this data. I'll make this concept a little bit clearer as I give the presentation. Okay. So, so the idea of functional interactions, well, one of the places that came from is, is from one of the first large-scale microarray data sets. And so what happened here? So this is, this was published, this is data that was published in yeast. And the analysis that I'm showing, the picture up here was from 1998. And essentially what they did is they took a microarray expression profile of yeast cells for all the genes or for some subset of the genes, and those are listed in this heat map down the rows, under a set of different conditions that represents different types of environmental stress. And then they just clustered these expression profiles. And what they discovered was that genes with similar expression profiles were annotated with similar functions. Right? Which was a very nice discovery, right? Because then you can use that information then to say things about genes that you don't know the function of, right? And so one of the ways that you can represent that is represent as, as, as just a network, which gene corresponds to a node. And the links between genes, their weight corresponds to the, the strength of evidence, in this case in the co-expression data that they share function, which you get maybe from correlation. And then the idea being simply that if you find a cluster of nodes that are all very strongly linked together, three of which you have known function for, maybe this one you should say it's cell cycle based on this observation. And then three of which down here you know function for, maybe this gene is involved in protein degradation. Now you wouldn't necessarily say this alone for one microarray data set, but you see the same thing over and over and over and over again that starts to give you confidence that what you're looking at is real, okay? And it's certainly a way of generating hypothesis about gene function. Okay, so when I talked about functional prediction and interacting with all these data in this following way, there's actually two types of questions that you can ask, right? The number one question is what does my gene do? I have some gene, I'm, I don't really, you know, it's shown up in some like screen, for example, maybe it's, it's got a SNP that's associated with some disease that I care about. Can I find out anything about its function by saying, seeing what other genes it has a lot of functional interactions with? Okay, and there's another type of question, and that's if you're trying to say set up a small cell screen, and that's like I have a list of genes that I know have some shared functions, say they're all involved in the wind signaling pathway, can you find me more genes like these genes? Right, and those are two different types of questions. One of those types of questions you can ask the Gene-Mani interface and other what I call gene recommender systems. Okay, so what does my gene do? So the way this analysis works is you take all the network and profile data that's been generated over the last few years, you put it all together in some way, you have a query list that consists of the one gene that you care about, you put this query list together with the network and profile data, you find genes that have a lot of functional interactions with this single gene, and then you take that set of genes and you do an enrichment analysis to find out what functions or what pathways are enriched in the genes that this one is interacting with. And then what I'm showing here is a network that was actually generated on the Gene-Mani website, you can generate similar networks in the plug-in, and then I've just colored nodes by the most enriched function. So here this is whatever that says. Okay, so then the second type of question, this is great in everything, but genes are multifunctional and you might be asking different things about what the gene does. You might want to know what is biological function it's evolved in is, or you might want to know what its biochemical function is. And those are sometimes two different things. Sometimes you might want to know where it's localized in the cell, which does tell you something about its function, but it's not exactly the same thing. You can think of this as like, you know, in terms of how the gene ontology splits up gene function into like biological process, molecular function, and cellular component. But then also there's also, I mean, you know, we know of a lot of genes that have a lot of different functions. So you can think about this in the following way. I came up with this when I was giving this talk in Memphis, Tennessee. Right, so if you just say Memphis, and you say give me more cities like Memphis, well nobody knows how to answer that question until you give them a little bit more context, right? So if I say give me more cities like Memphis, Knoxville, and Nashville, well you have these other two cities which are there in Tennessee, believe me. But if you say Memphis and these Alexandria and Cairo, well then you're talking about Memphis and Egypt. Right, so the list, your query list can define the question that you're asking. Right, so give me more genes like this, and it allows you to ask a more specific question that's defined by your query list. Right, and so again it's the same sort of thing. You take all the network and profile data, you take your gene list, plug it into the gene recommender system, and it gives you the network that connects together these genes. It also gives you genes that are well connected to this gene set here. But now that you've defined the query list, the gene mania or the gene recommender system, it can do something a little bit different. It can figure out what type of data is most relevant to the question that you're asking. If you just give it one gene, it doesn't know what question you're asking. So it has to kind of weight all the data the same way that doesn't depend upon which gene you give it. But if you give it a list, it can say, in this case, microarray data is more relevant than physical interaction data. So let me give you a more specific example. So if you're interested, for example, in a biological process that a gene is involved and say that gene is a gene that responds to salt stress. So possibly the most relevant data to try to figure out what genes are the ones that respond to salt stress are ones where there's a microarray time series that stresses the cell using salt. So presumably some of the genes that are already known to respond to cell stress will respond to the salt stress, and then other genes that respond to salt stress will have correlated expression patterns with those. So in that case, the microarray data on salt stress might be the one that comes up as the most relevant to the question. Now on the other hand, if you're interested in what protein complexes the gene is involved in, well, presumably the physical interaction data is going to be the one that's going to be the most relevant to that question. So under this kind of made-up example that I've come up with, if your query list consists of genes that respond to salt stress, the gene recommender system can identify the fact that, okay, well, what is it about this query list that is the same, and you might be able to identify the fact that they're all co-expressed in this one data set. Or if that query list instead is parts of this protein complex that you're interested, it gives it the ability to identify that what's similar about the genes in the query list is that they're all connected in a physical interaction data set. And so, yeah. So gene mania does both of those things. So essentially, I mean, I'll show you in a second, but you first of all you say, okay, these are the interactions that I think might be relevant, and then of those interactions, gene mania determines which ones are most relevant. So you can control that, but you don't need to. But the thing is, of course, if you provide more information about what you think is relevant, it makes its guessing task easier. So it's more likely to get it right. Okay. All right, so again, here's the interface. I guess this means that this is where I'm supposed to stop and give you an example. Okay, so maybe that's what I'll do. Okay, so here's the website. It's just type in gene mania. And so you get to, right now, we provide this for eight different model organisms. So you can just choose the model organism that you're interested in. Let's say we're interested in yeast for some reason. Okay, in this box, you type in your gene lists. You can also copy and paste over from Excel. I don't want to type in the gene list, so I can just type in genes, and it looks up, and then the check mark tells you whether or not it's actually in the database. We do our best to identify gene identifiers that people use. We don't identify probe set identifiers, but we do try to identify all these gene symbols. We can find entro-gene IDs. We can find ensemble IDs. There's a lot of different gene IDs. Now, but the way we do it, the identifier has to be unique. It has to refer only to one gene, and there's a surprising number of identifiers that aren't unique. The other thing is that it has to refer to a protein-coding gene at this point. It can't have referred to a pseudo-gene. It can't refer to a gene whose status is uncertain. It can't refer to non-coding RNA yet. We're working on that, but right now it's only protein-coding genes that are confirmed to be protein-coding genes that we recognize. Okay, and then you can also, just to make your life easier, you can press this example, and it gives you an example gene list here. Okay, and there's an advanced options panel that allows you to select which networks and what network waiting, and we're going to go over that in the talk in a second. I'm going to close that off and ignore it for the time being, just to give you the example of an analysis. So now I'll press the Go button, and so because I give it a gene list, what Gene Mania is doing is taking all the yeast data that I said is relevant, and we have a default set of networks that we use, and then it's asking what networks are most relevant to the specific question I'm asking by the gene list, and it answers that question by finding the networks where this gene list is highly connected to one another, but not really connected to other genes outside the list. Right, and then it waits those networks based on this measure. The networks are grouped into different types of data, so these are genetic interactions, these are physical interactions, this is other, so there's other types of data here, because these are genes that appear often together in PubMed abstracts. These are genes that respond similarly to chemicals. You can find out more information about where the network comes from by opening up this box and clicking through to PubMed, and these weights reflect the relevance of the genes. Okay, so now here's your network. The black ones of the genes were in the query list, the gray ones are the other genes that are most connected to those, and if you open up the function tabs, what it's done is it's already done an official exact test for you to find gene ontology biological process categories that are enriched in this gene list, and then you can color the genes according to these categories. Alright, so now that's the basic way that the interface works. You can also find out more about the specific genes in the list, and you can go through here, and this tells you all the enriched functions, and by highlighting over the little box, it tells you the name of the specific function. So these are all the enriched functions that this gene is assigned to. These are enriched functions that that gene is not assigned to, but are enriched in the rest of the list. The interface is designed to be fairly intuitive, so you can play around with it, and often it works the way you want it to. Okay, so let's go back and I'll tell you some specific things about the way that the interface works. Okay, so the first thing I did, I showed you the advanced options panel that you can get from the query page. I opened it up and then I shut it quickly. But that's, in the advanced options panel, that's if you want to do... answer your question, you want to only look for certain types of interactions. And in that case, you have the interactions that are grouped by different categories. So there's co-expression, there's co-localization, there's genetic interactions, physical interactions, and so forth. There's also this new type of interaction that we put up there called attributes. And those are the pathways that the gene is assigned to. And you can use those pathways to try to predict more genes like those. Right, and those pathways themselves, they come from actually Gary's lab has collected them as part of the pathways they make available for analysis through enrichment maps. So you open this advanced options panel by clicking that phrase here in the query screen. You can also get it from the network display. And then you can also, there's pre-selected sets of networks. There's a default set, which is essentially all interactions that we trust and some proportion of the microarray, co-expression interactions. We don't include all of them because often they give redundant information. Or you can choose all networks or you can choose none of the networks. Okay. So now, by clicking the checkboxes turn on and off all the networks in a specific category if you click there. Right, and then the fraction here tells you which ones you've selected. If you click through on here, that actually opens up and displays all the networks. Let me, maybe this will be easier to see if I... So here I just opened the advanced options panel from the network display. So here are all the different networks. You can turn the networks off. So there's zero of the one network and co-localization is selected. You can turn all the networks on or you can open this up and click on or off specific networks. And if you click on here, that tells you where the network came from. So you can click through to read the PubMed listing. It also tells you sort of the source of the types of information. And then we also tag the networks based on keywords that were assigned to the networks in PubMed. So you have a lot of choices here. And so this is the attributes. Okay, and these are intrapro attributes. Let's choose human. Let's use the example. So now in human, we have a lot more of these attributes. So these are the pathways. You can look at drug interactions. You can look at microRNA target predictions. And just have fun with that. Okay. So I've already gone through this by showing you it on the interface itself. This is just for your notes so that you see how this works. Okay. So now if there's any questions about the interaction with the website, maybe you could briefly answer them now. And then I'm going to talk to you about some different aspects of the interface which you need to understand, which is how the networks are weighted. Yeah. So you want to upload two different lists? Yeah. Yeah. I would really like you to be able to do that. And this is something we're actually proposing to do. So we're asking for funding to expand it in that way so that you can provide a background list. It's extremely difficult to do that with the web interface. You could try to do that with the cytoscape plugin by just sort of manually removing genes or overlaying genes in some way. I cannot tell you how to do that myself, but if you email me, there's people in the lab who can tell you how to do that. Veronica, do you know how to do that? Oh, right. Okay. Yeah. That's a great idea. Yeah. Okay. So every functional interaction is associated with a weight. So if it's a co-expression, that weight is derived from the Pearson correlation coefficient. So when we compute the final network, we provide a weight to the functional interaction. And then the weight in a network, like the co-expression network, let's say it's like five, is multiplied by that percentage weight. Let's say it's like 50%. So that becomes 2.5. And then we add up the weights from all the networks that are included to get the final functional interaction weight. Exactly. The composite network. Okay. More questions. Okay. So now I'm going to explain together, explain to you how these weights that you see, they're computed. Okay. The reason that I'm going to explain this to you is you actually have some choices that you can make in the interface and it's important for me to explain why, like, how to make those choices. So then here's the idea, right? So you have three different types of networks in this case. You have to assign them all a weight, which is the reflection of, like, how much they're going to be used to generate the final network. How do you determine what those weights are? So the simplest possible thing that you can do is you can just assign them all equal weights. You can say, we don't know one way or another what's better, what network's better and what network network's not better. So the problem is, first of all, not every network is equally useful for every question. So if I've given you more information about what question you're asking by giving you a gene list, obviously you should up weight the networks that are most relevant to that gene list. The other thing, and so here on this slide I've explained that by saying that the gene function could be a whole bunch of different things. The other thing that can happen is, right, so the other thing that can happen, that's not only a question of relevance to the question, but some networks end up being redundant. And so what does that mean? Well, often what happens, especially with microarray data, is there's different labs doing the same assay at the same time. So if you get like 10 data sets that are all querying one aspect of function for a set of genes, you don't want to assign them all the same weight as a different data set that queries a completely different function of genes, because then all your queries are going to be focused on that specific function. So if you just assign equal weight to every network, you're not taking advantage of your sensitive to that. And so we have two rules for network weighting. I mean we use equations, obviously, but the two general rules are the networks should be relevant to the question that you're asking. And it shouldn't be redundant with the other networks that are already there. And the way in which we assign those weights takes into consideration both those two things. We use linear regression, a specific type of linear regression to assign the weights, and it identifies when it assigns those weights those weights reflect both the relevance and the redundancy. And so again if you give us a long list, we can come up with query specific weights that reflect what networks are relevant answering the question. Now by default if you don't do anything Gene Mania selects between two different ways of doing the network weighting. One is query specific but it only uses that if your list is sufficiently long. I think it has to be five genes or six genes. The reason that is if the list is short we just don't have enough information and so we're going to make like bad calls. We still might make bad calls if it's like five or six genes, but as the gene gets longer you can become more certain that the query specific weights are reflective of what you need. The other thing is that the gene list is short but if you only give us one gene we weight networks so that based on their ability to recapitulate the co-annotation patterns in gene ontology. So by that what I mean is if the network tends to link together genes that are assigned the same function it gets a greater weight. And as long as that weight is not redundant with some other network. So essentially you could think of the way in which we are computing the network weights under this gene ontology based weighting is that we take all the gene sets in gene ontology we compute the query specific network weights for them and then we average them together. So that the weight just reflects overall predictiveness for biological process annotation. Does that make sense? So now by default we use biological process because that's what most people are asking but you can switch over to molecular function or you can switch over to sell your component. And then if for some reason you don't want to use any of these weighting techniques there's the last category which is equal weighting so you can equal weight by data type so you can say the weight that I'm going to assign to all the co-expression networks in total is going to be same as the weight that I assigned to all the co-complex networks so if there's like 10 co-expression networks and one co-complex network each of those 10 will get one tenth of the weight of the one co-complex network or we can just assign equal weights to all the networks. So those are your options in terms of network weighting if you're not happy with the automatic choice that we've made you can change that choice. Okay the last little bit of explanation I'm going to give you is once we have the network how we decide what genes are the ones that interact the most with those in the list. Okay and so let's say for example this is the network that we've come up with it's not a very interesting looking network but this is the network these are the genes that are on the query list so those are the genes that define the question that we're asking and by color I'm using the color to reflect how highly those genes are scored. Okay so there's two main ways of scoring nodes one is that for each of the nodes each of the genes you look at its neighbors and you set its score to be the average of the score of its neighbors so what does that do? Well that means that this gene which has one neighbor that's in the original list is going to score less than this gene which has two neighbors in the list right and then these genes get no score at all because they have no neighbors in the list right but what this this type of scoring doesn't reflect is that you know this group of three genes is not connected to anything in the list at all but these genes down here have indirect connections to things in the list and then under some circumstances you might even have two genes that aren't directly connected but have a whole lot of indirect connections right so by indirect connections I mean that there is a path of length two that goes between the two nodes so this gene here is indirectly connected to this gene it's indirectly connected to this gene by path of length two and it's indirectly connected to that gene okay so we want to reflect the fact that these two genes even though they're not direct neighbors they're in some sense closer in the network than these genes over there where there's no path from the genes in our query list and these types of things are important say like if you're defining protein complexes you don't there's a lot of false negatives in these functional interaction networks right so if you have say for example a protein complex which you'd hope is that everything in the complex is going to be connected to each other by direct physical interaction but often it's not but often they share a lot of the neighbors they interact with the same sets of genes and if you share a lot of neighbors that is often a better indicator that you share a function than a direct interaction between yourself and one neighbor does that make sense okay so the method that we use in order to score genes once we have a network is called label propagation and essentially what it does is it reflects the difference between these two things so you can think about it as the following you first compute the direct neighborhood you score the genes in that way and then you continue and then you continue re-averaging the genes so now you look at this gene you take the average of it neighbors it's going to be a bit smaller you look at this gene you take the average of this it's neighbors it's going to be a bit smaller none of these genes are going to be connected by any pass to these red genes here so they're never going to get any score at all you can think of it as heat flowing through a network with some like heat sinks that cause you to lose label in some way or you can think of it as paint running through a network okay you don't have to understand the algorithm itself the important thing to understand is that it can distinguish between situations where genes just are indirectly connected to the query list and those where there's no connections at all and it scores genes highly if they have a lot of indirect connections to the query list yes exactly yeah so the problem with a lot of these interaction networks that people define there's false positives these are interactions that aren't real to get detected there's few of those, they're not that bad but there is a big problem with false negatives interactions that should have been detected but weren't connected and so there might be another network that gets uploaded at a different date that actually has that direct connection or maybe that direct connection and never gets observed by anybody but there's been various people who have noticed that when you have these false negatives interactions that you should have detected but you didn't you can also often find signatures of them by the fact that the two things that should have been connected share a lot of their neighbors and that's what we're trying to take advantage of with this algorithm precisely do you have an example? I mean I can give you a citation can you email me after and then I'll just give you some citations yeah so I'm saying two things if there's genes that have more indirect connections score more highly than genes that have few indirect connections and the other thing that I'm saying is under some circumstances if you have a whole lot of indirect connections then you can score more highly than something that only has a single direct connection great there's some details here which I've already explained but might help you with your note taking right so the take home is that the label propagation that we use because it allows indirect links to query genes to impact the scores often what happens is it will pull up a whole cluster of nodes because clusters of nodes that have a lot of interconnections also have a lot of indirect connections with one another okay so here's a label propagation example so a network and before you run label propagation it looks like this after you run label propagation it looks like this you can see here that these genes that all have a lot of connections with one another what they do is they strengthen their interactions right so these are all kind of high but there's not very strong scores for the other genes so this gene here is in the original query list right and that's kind of important because there's a lot of genes in these networks that are called hub genes and hub genes a lot of them are very multifunctional p53 is a perfect example of this so it's linked to a lot of different functions so if you're linked to p53 you don't want to inherit all the functions that p53 has you're linked to p53 because you share some aspect of its function right so by also relying on these indirect connections you can distinguish between direct connections that occur simply because you're linked to one of these pleotropic genes that have a lot of functions and specific functions that genes that have a small number of functions have okay so gene mania is three things it's a large automatically updated collection of interactions and networks and that's what a lot of people use gene mania for is they use it to actually download the networks that connect their gene set together right so even if you don't want to use gene function prediction you can still use gene mania for that um it's a query algorithm that finds genes and networks that are functionally related to your gene list okay and it also has this interactive client-side network browser with extensive link-outs I love using the website because what I do in talks is people mention gene and I type in a gene mania and I try to guess what they're going to say the gene function is and I can go through and I can link through and see all the other genes that it interacts with it's fun okay and so where the data sources come from we collect data from a lot of different data sources right now we have almost 2,000 networks in the gene mania um website and in the plugin and so our co-expression data comes from gene expression omnibus we download all the gene expression data sets that come from a platform that we recognize and have a sufficient number of samples in them that we're confident that the co-expression values that we compute are relatively accurate we get genetic and physical interactions from biogrid so what IREF index does is people who either generate large sets of physical interactions or people who go through and curate physical interactions reported in papers they put their physical interactions into something called IREF which is an overall body to kind of control the annotation of physical interactions that they generate and we do the downloads about we try to do them about quarterly about once every three months but often we end up doing it about once every six months okay we also get predicted interactions so these are what are called enterologs so these are interactions that have been observed between orthologs in a different species we also look at shared protein domains and recently we've added attributes and these are compiled by Gary Bader's lab and we also have some organism specific databases that we use and we get our gene annotations from gene ontology we ignore electronic annotation because those are unreliable so one thing that I actually haven't talked about if we're missing a network that you want you can upload that network to our database and uploading the network is dead simple you just put it into an Excel spreadsheet and each row is the two genes that interact so say you have like a hundred interactions you have a hundred rows and each row you have the gene identifiers for the genes that interact if you want you add another column which is the interaction score if you don't have a score we assume that they're all the interaction strength is all the same for all the genes you got to output that Excel spreadsheet and then use that network in exactly the same way that you would use any of the networks that are already in our database these are things you can do on the website you can also do with the Cytoscape plugin and you have a lot more functionality in the Cytoscape plugin so you can upload series of networks you can make your own network databases for example okay one more thing I want to tell you about is gene identifiers so we try to recognize all the unique gene identifiers that we can here's a list of the ones that we get the problem of course is a lot of people call genes by names that are synonymous with other genes and it's terrible there's something called SMG which refers to two distinct genes in Drosophila, one is called smog and I can't remember what the other one is called there's a lot of situations where you're using an alias for a gene that's not unique and because we want to simplify the input gene process we don't ask questions to try to figure out what sense of the gene that you mean which is to identify cases where the alias is non-specific then under those circumstances you can go and try to look up another name for the gene that is unique but besides that we try to identify any identify that people use so you don't have to do this identifier mapping step okay and the other thing is we use the ensemble database which we mirror but again we update about once every three months every six months so it's possible that we're not going to recognize a gene that's in the ensemble database because it's a new gene and if that happens just email us to let us know about it and we'll try to do something for you okay so currently we have eight organisms which one is missing maybe we only have seven right now yeast alright and we have about 2,000 networks and we have the network browser we're going to look at the cytoscape plugin though I encourage you to fool around with the network browser as well the other thing that we have we have a lot of offline command line tools to either like build up network bases for another organism and people have certainly done that and though so currently once you build up those network databases for different organisms you can access that through the cytoscape plugin what we're trying to do now is we're trying to make it easier for you to set up those network databases and make your own gene mania instance for that organism alright but again that's we're waiting for funding on that and it would happen maybe in the next year or two years but what we also have done is that we set up instances for people that we've collaborated with on specific organisms so far we've set up instances that we're working on for like cricket we're setting up an instance with the Gary set up an instance with a group at York and I can't remember what organism it's some single cell you carry on which? yes yes and so there's that currently we're just about to release gene mania for E. coli which is surprisingly hard to set up and we also have tools what one of them is called query runner where you can evaluate how much your new network contributes to overall knowledge about functional interactions or you can make a series of gene function predictions for all for example all the go categories and you can do that kind of offline to run these things and then assess them through cross validation okay so our major competitors the string database they've actually been around for about seven years longer and string has much of the functionality that I've talked to you about already the functionality that they don't have is they don't have different ways of waiting networks and their focus is slightly different than our focus but one thing that's very good about string is they cover a whole lot of different organisms right so if you're not if the organism you're interested in isn't one of the special eight string can probably help you okay and so these are the types of networks that you get out from string I think I've did I think my query gene here might have been rad50 right you can see they have kind of a nice nice interface they actually now use the network display tool that we developed for gene mania or at least the the cytoscape web plug-in that we developed and then actually if you click through if there's a structure for the corresponding protein you can click through and see that structure they also so you can see the like ours the color of the links indicate the source of the evidence and they're much better at tracking where individual what individual evidence is for individual interactions so you click through those interactions in many cases you can click through through the specific paper that reported the interaction even if that paper only reported one interaction we can provide that to you if the paper reported a hundred or more interactions but for like single interactions by like that we're reported by a paper from 20 years ago we're not going to be able to help you with that you know they also give you the gene function predictive partners I don't think they use they do go enrichment yet but they continue to do it they have a much better representation of the text mining so they're much better at recovering proteins whose names have been co-mented in PubMed at abstracts so they're very good for looking up what's going on in the literature and the literature with individual genes the thing to know about string is their protein focused they're not focused on genes so much as they're focused on proteins so for example we're focused on genes so we include things like genetic interactions we're a lot more permissive about the type of interactions that we permit because we have a very flexible way of doing network waiting we can do a lot of things on the fly any type of functional interaction we can incorporate string is a little bit more specific about the type of functional interactions that they allow and that might give them maybe a little bit more I don't know and so what they do is they come up with eight pre-computed networks for each organism these networks are a combination of all the networks together we have a little bit more flexibility that you can turn off or turn on some networks of the database of thousands that you want to be included in the query or not so you have a little bit more precision there the last thing that's a little bit different is they use direct interactions to score notes so they don't do this label propagation that I've been talking about with the indirect connections but nonetheless there's no reason not to try the same query and try the same query and see what you get because the types of things that we report are often complementary so E. coli hopefully that's coming soon you can find it actually if you want to play around with it you just go to the beta site you go beta.gmail.org E. coli is actually up there right now so if you like E. coli you can fool around with it we don't include it has to do with the fact that I don't think they contain a lot of functional information yet and we want to incorporate more phenotype information so disease associations of genes and right now we rely on other people to do our orthology mapping between organisms and as time goes on we're going to start doing a lot of that orthology mapping ourselves so we can map orthologist co-expression relationships for example okay so these are the URLs