 we can go and look at the output that we got. So here, the black genes are the ones that were originally in the list. The links are colored based on what the data source is that supports them. So some links are supported by both physical interactions. So these two genes have been shown to physically interact in all these different studies. And it's also supported by co-expression. So these two genes are co-expressed in this one study. You can tell that this line is thicker than this because there's more evidence for physical interaction than co-expression. Over here we see the relative weights that have been assigned to the different data sources. So co-expression gets about 80% of the weight and physical interactions get about 20% of the weight. We can look down and we can look at the weights in individual co-expression studies and just by hovering over top of this, I see the links from that study. Don't worry about following along necessarily with this because we're going to have a lab about this tool anyway, an assignment that you can use to go through this tool. But just to illustrate the point about the weighting is these are the weights that have been assigned based on how well each network connects together the genes that are registering the list, which are the black boxes. Now you can go through and you can look for enriched functions by taking this and so I guess what happened was this list is a list of genes that are involved in DNA recombination. And so you can see the ones that are read with the dashes through them, those are the ones that are in DNA recombination that were also originally on the list. The smaller nodes are our nodes that gene mania has added and these are other ones that are also involved in the process of DNA recombination. All right, this is probably a bit more about they're involved in double standard brace, strand break prepare and then we just make a little function like in there. Okay, so that's to show you where the weights come from. So here I consider a lot of other different types of this is the gene list from the previous slide and then these are the weights that were assigned to all the various different sources of information on this gene list. Okay, and then here are either the black nodes or the nodes that are colored with the bars through them or ones in the original list and everything else are recommended genes or related genes that gene mania has found. Right, so I showed you opening up the advanced options panel to select the networks. Sorry, this is still in the answer to your question. So to select the networks in the first place. So if you don't select a network, it gets assigned to zero weight. Once you select the network, then gene mania assigns a weight to them either based on your gene list or based on some other measure of the relevance of that data source and the other measure is how well that network reproduces known annotations for genes. Right, so if you take all the pathway databases, you take the go annotation databases and then you try assigning weights to networks so that you do a good job of reproducing those go annotations, you can get, that's how the weighting gets done, for example, for single genes. Okay, and then the last way, so I've told you about two weighting schemes and you can actually select from another, a number of other weighting schemes if you just go down here to the network weighting and I'll explain that later in the presentation. Okay, so you saw me go through this. I clicked open this thing to get to the advanced options. This is the network panel. I clicked on check boxes to select all or some of the networks and this fraction here indicates what proportion networks are there and if you click through on this as you saw, you get down to the individual networks. So here if you click through on co-localization, there's one co-localization network in the yeast that we have and this is a famous study that does protein localization based on GFP tagged proteins in budding yeast and you can click it on or off to decide whether to include that data or not. Okay. And so, as I said, you can come up with a single weighting or a single network which combines together all the data sources in a fixed way and then you could do that either by just adding them together. The strength of linking between two nodes is actually just the strength of the links in all the data sources where a link has been measured between those nodes. That's a surprisingly effective thing to do but you can also assign them a weight as I said before based on how well that data source matches the co-annuation patterns that you see in gene ontology and that's what most people do and that's what G-mania does. Okay. And then as I said before, if your list is long enough, you can weight genes, you can weight networks based on how well they reproduce that list and that's what I'm calling the context-dependent network. Okay. And so, basically, by doing this, asking this question, how well does the network reproduce the gene lists, you're basically setting weights based on two rules. Is the network relevant for this function? Right? Meaning like, are the genes in the query list most often connected to one another than to other genes? Also, you're asking the question, is this network redundant? Meaning that, are there two networks that essentially have the same set of connections? If there are two networks that have the same set of connections, you shouldn't be assigning them collectively more weight than that single version of the network would get. Right? So if we had two networks that were exactly the same, all of the weights that they get assigned should be equal to the weight that that single network would get just by itself. Right? And so that's, so you want to make sure that you're not over-weighting redundant networks. And the reason this is important is that it's really easy to generate networks in co-expression studies. Co-expression studies are cheap, there's thousands and thousands of them in databases, and so if you don't do and often they provide redundant information. And often, if you do make a co-expression network, what you find is that genes that are involved in growth or cell division tend to get linked together very highly. And that's because one of the first things that happens when you start perturbing cells is you change how quickly they're growing. Right? And so when you look for differential expression or you're looking for genes that are responding in the same way to the perturbations, you're looking at genes that are involved in the same growth functions. That's going to be your strongest signal. So you want to, you know, there's also other signals in co-expression networks, but you want to make sure that you don't spend all your time reproducing this growth signal over and over and over again. So you have to remove this redundancy between the networks. Okay. And so in gene mania what we do is we automatically select which of these weighting schemes you should use. If your gene list is short, we don't know what question you're asking. Like if you give me one gene, you could be asking a lot of different questions. And so we don't know how to weight the network. So just by default, we use the set of fixed weights that comes from how good that network is at reproducing known function. If you give us a longer list, then we use the query dependent scheme that I told you about. Now that being said, we give you multiple options. We also allow you to equally weight each of the data sources. If you don't believe our algorithms you can just assign every data source you selected in equal weight. So it contributes equally to making these inferences. Okay. And then also if you want us to assign if you have if you want to define your question a little bit more you can say, well, am I asking about the biological process of the single gene, the molecular function of the single gene or the cellular component, like where it sits in the cell. And then you can choose from one of these three weighting schemes where you can actually you know, I told you we weight networks based on how well they reproduce co-annotation patterns in gene ontology. Well, there's three different hierarchies of gene ontology, the biological process and molecular function and cellular component compartment so you can just you can only look at co-annotation in one of these hierarchies. All right. And I told you about all this so far. Okay, so then we have combined networks together to ask questions. I've told you about how you might weight the contributions from different networks either automatically or based on your query list. So now once you have the network let's say this is the network that we have how do we find what are the guilty associates of a gene? How do we find which genes are most heavily associated with the query list that we put in? Okay, so let's say that this is our network. So here we have a network it's a pretty small network it's got one, two, three, four, five, six, seven, eight, nine, ten, eleven genes in it. Four of these genes are in the query list and seven of these genes aren't and then the question is what genes are most associated with the query list in this network? Okay, well there's two main ways of coming up with this what genes are most associated with it and what we're going to do is we're going to use these main algorithms to assign a score to every gene based on how associated they are with the query list. Okay, so red means a high score white means a low score. Okay, the first method is called direct interaction. Basically you score a gene by looking at its neighbors in the network and saying how many of them are in the query list and how strong are my links to that? So for instance one way we have different ways of combining this type of information together but an easy way to think about it is you could say well let's take the sum of the weight of the links to a node so in this case this node here so it's got three links how many of those links lead to genes in the query list two of the three and maybe you include the weight so if these weights are stronger than this one then maybe your score becomes more than two out of three and maybe it becomes like three out of four because you sum the weight to the things in the query list. There's a variety of different methods that all use the same strategy of just looking at your neighbors and seeing how many of them are on the query list they vary by whether they count whether they include the weights how they combine the weights together do they multiply the weights together or do they sum them as a combination within a neighborhood so the point I want to make about this is these types of algorithms are great but they can only assign scores to genes that are directly interacting with the query list but you can see here that there's five other genes that it's unable to distinguish between here they all get the same score because they don't directly interact with any genes in the query list but in this case you might think that this gene should get a slightly higher score than this gene here because this gene even though it's not directly connected in the query list there's a lot of indirect interactions with the query list it indirectly interacts with three other genes and it's extremely important to include these indirect interactions because the data sources that you're looking at often they have a lot of false negatives they're incomplete and more so sometimes they have false positives so they say two pairs of genes interact when they don't right and it's important to avoid to detect these false positive interactions and one way of detecting these false positive interactions is whether or not they're supported by a lot of indirect interactions right if two genes actually interact they actually have a lot of friends in common they have a lot of indirect interactions between them this is a general property that people have discovered of these biological networks that they have what's called a high clustering coefficient they tend to occur in groups of genes that have high interactivity among them and so another way of computing the guilty associates is what's called label propagation and what you do in label propagation is you allow the score to propagate through the network along links getting a little bit weaker the further and further away you get a little query list right so in this case for example this gene here one way of computing its score is to compute the score of these genes right and then this score is derived from the score of its neighbors and then this score is derived from its neighbor here and people have associated the types of things that happen when you use these algorithms as heat propagation so if you can imagine that these are sources of heat right and then you have a little bit of loss heat loss at each one of these nodes and then these are continuously giving out heat this is the amount of heat that might arrive at each one of these nodes but notice no heat ever arrives to this part of the network because there's no connections at all that can transmit the heat right and so this part of the network that has no connection at all to the component that contains the query genes the score gets to be zero allowing you to distinguish none of these two genes from these genes here and so here has a little bit more information about each of these scoring algorithms an example algorithm within the space of direct interaction methods is called Naive Bayes and then an example algorithm in label propagation obviously is Gene Mania and there's a lot of other algorithms in this space like Hotnet is included within this space but in general they have the same properties in that they indirectly propagate direct neighbor score so that the indirect links to the query list can have some influence on scores so here's another label propagation example that makes use of the fact that that shows I think a bit more clearly that pairs of genes with a lot of strong interactions or groups of genes with a lot of strong interactions together end up getting high scores so here scores is indicated by the size of the nodes so this is what you start from these four genes are on the query list and so the scores of the nodes within this cluster that's highly connected to one another they all reinforce each other when you do the label propagation but the fact that this gene here which has a lot of functions is connected to other clusters of genes that doesn't make for a lot of propagation of label to these other clusters so even though a direct interaction approach would give high scores to these genes because they're connected to something which has this function when you do label propagation it kind of cleans that up a little bit so the three parts of gene mania and other gene recommender systems that can actually update collection of interaction networks we go through the literature and we go through the internet and we pull down periodically all the interaction networks we can get our hands on our role is to include all the data that's readily available when you're doing your functional predictions other methods like string do the same thing they're constantly updating their interaction networks so other people make up it made available also there's a query algorithm that finds genes and networks that are functionally associated to your query list so you find genes by finding the ones that are highly interacting with your query list you find networks by just looking at the network weights after you do the gene mania query say you have a list of genes and you want to know more about that list of genes you can find other genes that are on the list but you can also find out how they're most connected to one another like is this list of genes actually co-expressed and if this list of genes is co-expressed with one another you can look up the co-expression and find out which study gets the most weight saying under what conditions is this list of genes co-expressed with one another and then we have this network browser that we make available to you and I don't know you'll see it more so when you do the assignment but there's a lot of linkouts so there's links to allow you to link through and find out more information about the studies that the networks came from looking at the genes by looking at the genes in the network database so what data do we collect? well we compile data from the gene expression omnibus we pull in all the co-expression data that satisfies some basic controls that we use to make sure that data is well formed we pull in genetic interactions from a database called biogrid we pull in physical interactions from a database called iREF index iREF index is it sucks physical interactions up from all the various different databases that compile physical interactions like Mint, Intact these are annotation efforts where people go through the literature and they put a physical interaction in their database if it's supported by some paper they've read so there are a number of parallel efforts that are somewhat overlapping but mostly independent what iREF index does is it groups all these networks together these networks are available separately on what's company called a psychic server but I won't go into that exactly the other source that we use for finding functional interactions is we get what are called enteralogs so does anybody know what an enteralog is? enteralog basically encapsulates this idea of two genes interact in mouse they're orthologs and orthologs probably interact in humans right so it's interaction by orthology that's what an enteralog stands for so if you have an interaction network and you're able to identify orthologs in a different species then enteralogs say well those orthologs probably interact as well it's not a perfect source of information it can be a weak source of information but especially for organisms where there's very little interaction data it might be incredibly useful and under some circumstances it's actually a great source of information if you find that two genes interact in human and then you also find that those two genes interact in yeast that's probably an interaction that you can rely on and so and the last major data source that we get well there's two other data sources that I want to talk about one is protein domains so people know what protein domains are more or less so protein domains are like there's a database that looks for sequences sub-sequences within protein sequences in the database it's called PFAM or sometimes it's now called Interpro that represent conserved domains like a zinc finger domain is a type of a domain or an SH2 domain is a type of domain so these are proteins, parts of the protein that fold independently into structures that are conserved throughout various organisms so you can scan through a protein sequence and infer where these protein domains might occur on a gene and that tells you a lot about this biochemical function so one way to generate a network is ask whether or not a given gene has the same set of protein domains as something else that we know has a given biochemical function so that's another source of data that we use we also have most recently we've included data that computes and this is data that's been compiled by Gary Bader's lab including a lot of data that came from the MSIG database which is hosted at the Broad that assigns annotations to genes so what's an annotation? well it could be like it's the gene sets that we've talked about previously are they all in the same chromosome do they all have the same transcription factor binding site in their promoter do they all have the same domain are they all associated with this disease do they all are they annotated target of a drug are they predicted target of a microarray right and so we pull in as many of those annotations that we can we don't turn this on by default because because within the annotations are included gene sets so if you're asking the question what does my gene do you might be getting circular information about what your gene does using the fact that your gene is annotated in a certain way to find more genes like that that have the same annotation but you can turn that on by opening the advanced options panel and just clicking that on it gives you a lot of information about your gene list we do our best to identify every gene idea that we can now we're restricted in one way as Gary said in the first talk that you got sometimes genes have the exact same name and he gave you one example where two genes the gene symbol it differed between two genes that differed by capitalization which is like crazy you're not really necessarily even going to remember what the capitalization is and then genes sometimes are associated so what Gary showed you on that list with all the weird looking symbols on it those are like your social insurance number right you can also have gene symbols that's sort of the somewhat human readable name of the gene and then there's also things called gene synonyms that's like when someone says this gene's name is smog or if you're like a fly biologist this gene's name is aubergine so the genes have these longer names we do our best to recognize everything that we can based on what annotation is available for that so we don't recognize that we don't recognize that we don't recognize identifiers that map to more than one gene and we pull our gene annotations in gene gene ontology and we get some organism specific databases you just have to click around for the organisms to see what's available okay so right now we cover eight organisms so these are the organisms that have enough throughput data high throughput data that we think it's useful to have like a large publicly available system to cover them and that's human mouse rat zebrafish, C. elegans Drosophila, Arabidopsis yeast and E. coli these are like the eight major organisms in terms of the amount of data that's available for them if you're working on a model organism that's not one of these you have a variety of things you can do one of the things you can do is you can try to find the orthologs for the genes that you're interested in in the closest model organism we also make it possible to make a gene mania instance for your own model organism you just have to provide a couple things and you're not going to be able to access it through the website but you can access it through the cytoscape plugin which has all the functionality of the website right now we have about we have more than 2,000 networks I would say about 1,500 of them are co-expression networks and then of course we make this web network browser available but you can also browse through the cytoscape because we have a plugin which has the identical data and functionality of gene mania okay the other useful thing about the plugin is on the website we only make our most recent data release available so like I said periodically we update our network databases by just sucking in all the data that's available and we also have to update our gene annotations because these change all the time and you wouldn't believe how much they change you think that things are stable you think that this is the name of the gene it's not because it changes all the time in the annotation databases so sometimes we're a little bit behind in the annotation we don't recognize a given gene name but if you take your gene and you go and you look it up and on your gene you find synoms for the gene or the other gene identifiers associated with it so but in the cytoscape plugin in the website we can only make the most recent data available in the cytoscape plugin you can pull in all the older data releases the reason we do that is for reproducibility if you run a gene mania analysis on a previous data set and somebody wants to reproduce that gene mania analysis well they can't do it through our website but they can do it on the cytoscape plugin and then as I said you can add new organisms so if you're able to construct a bunch of data sources for a given organism you can access and get access to that through the cytoscape plugin it's not easy but we have the ability to do that and we make the tools available to let you do that and you can integrate gene mania networks with other cytoscape analyses and then the other thing is is on our website we have a restriction in the length of the query list we don't take query lists longer than I think 100 to 200 genes and the reason for that is not that we can't do the analysis but that like network browser where I was moving stuff around that thing contains a lot of information and so when you when gene mania finishes the analysis it constructs that thing for you and then it sends it through the web to you and that can take a long time the other thing is that thing is right now it's encoded in something called flash which is a pretty slow programming language so once you get more than about 100 genes in that network browser things slow down a lot and so we don't want people to get very bad performance on the website so we don't let you use long query list but you can use long query list in the cytoscape plugin and hopefully within the next few months that we already have a working demo of this we've replaced the network browser which is made out of flash with a new type with a new language which is like JavaScript and HTML5 compliant which when things should get a whole lot faster indeed they look like they will so we'll remove that barrier but the other thing is right now gene mania doesn't run on iPads because Steve Jobs doesn't like flash or didn't like flash you can't run gene mania on iPads you can't run gene mania on your phone when we release the new version that uses the new type of network you should be able to you can run it on an iPad you can run it on a phone okay one last thing that we include is something called query runner and so what's query runner so because gene mania takes all the network data that's available in the world that's easily accessible we can say something about how much the all the high throughput data that's been generated to date how good that is at reproducing what's already known about gene function discovered through different methods which is kind of cool in itself that's like saying say I didn't know that much about gene function I wanted to reproduce gene function could I do it just using all the high throughput data but another type of analysis and then analysis is like I've generated this big genetic interaction network how much have I added to the global knowledge of gene function that's available in these networks how much better am I able to reconstruct the gene ontology categories now that I include this new network compared to all the networks that were available before we want to do this analysis for this paper that we participated in about five years ago with this large genetic interaction network and then ask the question will all the genetic interaction networks were available what's the gain that we get from adding this network compared to all the other various genetic interaction networks to all the other network all the other data that was currently available so we can assess the added predictive value of new data okay so the major kind of complementary other gene recommender system is something called string and string's actually been around a lot longer than gene mania so gene mania's been around since about 2010 string's been around since about 2000 and string has much of the functionality that gene mania does they add some functionality that we don't have and they don't have some things that we do have what's different about string there's two major advantages to string one is that they have a much larger organismal coverage so they cover hundreds of organisms I mean most of this coverage is due to like they come up with enterologs but you can ask questions about sort of any organism that's in ensemble through string the data coverage might be pretty small for some of the organisms but those organisms are there the other major difference with string is they're focused on proteins not genes what that means is is they collect different types of information right so it doesn't make sense to talk about genetic interactions among proteins necessarily and for them what they do that's also different from what we do is they put a lot more effort into curating networks so they actually have like eight network types that are much more curated whereas we just pull it all in and there are algorithms to sort out what network data is relevant what network data isn't relevant and so or we let you decide through yourself by like turning networks on and off and seeing whether or not you believe what the answers are okay and they have this nice interface which is actually based off the web browser that we developed for genomania and then you can click through here and I don't know if you can see this but proteins that have structures associated with them I think it's super cool okay and then these are the predicted functional partners they use a direct interaction scheme to find the most likely partners which means that they get different answers than we do also because they use different data but then also they have like a lot of other really interesting analysis that you can do they can look for gene fusion and occurrence events they can look in gene neighborhood you can ask me questions about this but a lot of these analysis and a lot of this sort of the string coverage comes from prokaryotes so in prokaryotes there's there's a lot of other sources of data about gene function that aren't as useful in eukaryotes and those are things like you know are these two genes in the same operon right so obviously if there's in the same operon they're going to be co-expressed and they're probably involved in the same function do these two genes are they separate in some organisms and fused in the others right if they're separate in some organisms and fused in the other then probably the ones they're separate in there's actually approaching for an interaction between them and there's some this fusion gives evidence for a shared function of these genes and the other thing that's that's incredibly useful in prokaryotes is do these sets of genes do they co-occur in the same sets of prokaryotes right if you have a tail or whatever that thing is called all the genes that are going to be involved in making that thing run they're going to be in the same set of prokaryotes right so this type of information the neighborhood information tells you about whether or not they're on the same operon whether they fuse or whether they co-occur this is useful for defining gene function in prokaryotes much less so for eukaryotes and so string gives you access to all that type of information ok so then I've written out the comparison here but we've already gone through this ok so have we done the learning objectives yes I hope we've talked about function interaction networks built by association gene recommender systems I've talked extensively about different ways of waiting networks or types of data I've talked to you about two different algorithms for predicting gene function once you're given a network one is just to look at the neighbors of the genes and one is to propagate information about function through the network you should be able to use a gene recommender system now to answer two types of questions if you can't now you certainly will be in the next once the next two hours are done and we're going to go more on this in the next but you should be able to select the appropriate network waiting scheme to answer your questions ok so we're on a coffee break but do we have any questions before we start our break