 Hi, I'm Quaid Morris. I'm a member of the Morris Lab. In fact, that's my lab. Sometimes it's easy to forget that in lab-making. And I'm a competition biologist from the University of Toronto. I've been working in this field. I've had my lab for going on 12 years now. And as Anne said, I'll be talking about gene function prediction today. You've heard a lot about networks, interaction networks, modules. And we're going to try to put some of those ideas together to use these things to say something about gene function. So this is another way, in addition to the sorts of things that Robin told you about, to try to use these interaction network data in your research with your gene lists. So what are our learning objectives? Okay, this is the one. All right. So start down here. So basically, we're going to move towards answering two types of questions about gene function. One is, what does my gene do? Like, what is its function? And the other question we're going to try to answer is, give me more genes like these. Right? So if I have a list of genes, say those that are involved in the wind signaling, can you find more genes like these wind signaling genes? And the data that we're going to use to try to find this information out are going to be these interaction networks. And the reason I'm posing the questions like this is I think this is an easy way to interact with these interaction networks, which can be kind of complicated. It can be incomplete. And their meaning is not always entirely clear. And I'll say more about those ideas as I give the talk. Right? And so answering these questions involves understanding some key concepts, functional interaction network, which I'm sure Robin went over, but it would be helpful for me to explain again. Guilt by association. So that's the concept or that's the technique we're going to use to try to infer some things about gene function. And basically the idea is if you interact with genes of this given function, that's evidence that you yourself have that function. And then the last concept is the gene recommender system. And so those are the interfaces that I'm going to talk about today. And I'm going to focus on one particular interface called G-mania that was developed in my lab and Gary Bader's lab. But there are a number of other interfaces that make use of the same concepts, give access to the same data, and have some of the similar functionality. And we're going to talk about one of those other interfaces called the string interface. Okay. So when applying guilt by association, there's going to be sort of two ways in which guilt by association would be inferred. So one way is by direct interaction. Do you directly interact with genes that have this function? And the other way is something that I call label propagation, but it came up in Robin's lecture in the form of modules. So the idea is can you identify groups of genes that are all interacting with one another? And if you remember one of these groups of genes or these modules, you're more likely to share a function with those genes that are already in that module. And so label propagation is one of the ways in which you can identify modules, is one way to think about it. And so, and then finally, when we're answering the question, give me more genes like these, we're going to want to use the network data in a different way than when we ask the question, what does my gene do? And so when we ask the question, give me more genes like these, you're asking a question about a specific type of function. And some networks might be better at inferring that function than other networks. So you will want to reweight the networks when you look for evidence of interaction. Okay, I just wanted to give you a general overview. If the concepts that I've talked about so far aren't that clear, they're going to become clear through my lecture because these are our learning objectives. Okay, and by going through the learning objectives, I've pretty much given you the outline. So we're going to talk first about what a functional interaction network is, remind you what it is, go over these concepts that I introduced, and talk about algorithms for scoring these, these guilty association. I'm going to introduce to you, then I'm going to give you a quick demo on the website. It's pretty easy website to use. But I'll go through and press all the buttons for you so you know what they do. And I would encourage you to fool around with the website yourself. It's been designed to be user friendly. Then I'll go into these, what I'm calling these different network waiting schemes where you can evaluate networks based on what type of evidence they're going to help to, they're going to present about gene function. And then we're going to talk about the string, which is another gene recommender system. Okay, so what I'm trying to illustrate in this slide is that if you want to make use of all these different sources of data in your own research, it becomes a bit complicated. I mean, these databases, they're large, they're incomplete, their relationship to each other isn't necessarily that clear. Right, and so this is why we define these concepts. One of the first concepts illustrated in the slide is the idea of a functional interaction network. So what's shown on this slide on the left is a figure of microarray expression data from what is now a classic paper in the field from Mike Eisen et al. Pat Brown's lab in 1998, where what they did was they made a microarray and they profile gene expression under a variety of different conditions. And this is probably a figure that is familiar to most of you by now. But what this is, is this matrix or this array of numbers here shows using a false color heat map. So green meaning up and red meaning down or vice versa. I can never remember which is which. It doesn't really matter for this case. So the rows correspond to genes and the columns correspond to different conditions or different cellular stresses. And what's shown here is the expression profile of the given gene across those conditions. And the genes have been sorted using hierarchical clustering so the genes with similar expression profiles work beside each other. And the observation that they made. And so these two are blowups of different regions in this plot. The observation that they made was if you look at the assigned function for these genes, you see that the assigned function correlates with the gene expression profile. Meaning that genes with similar expression profiles had similar functions. And so the value of that observation, which is illustrated here in the network figure beside it, is that if you then arrange the genes according to in a network where the nodes represent genes and the strength of the links or the thickness of the edges here correspond to how highly correlated those genes are. And then you annotate the genes in the network that have known function and you have two genes in the network that have unknown function. Well, the genes that are interacting with ones with known function or have high interactions and say they're unknown, you can have a pretty good idea or initial guess about what the function of those genes actually is. And so this network here, which is describing the gene expression profile, so the way that the edges represent the degree of correlation, in general this is called a functional interaction network. Meaning that there's some correlation between the strength of the link between genes and the likelihood they share at least some aspect of their function. Now function is a big word. Function can mean a lot of things. And so what is the aspect of the function that's being represented here? Well, it's some aspect of their function. Genes that are co-expressed probably in this case, which was I think a variety of cellular stresses and deletion mutants, they probably represent a shared stress response in the cell. But in general, we don't necessarily know what type of function a network is representing, but we know that the link gives us some hint of functional interaction. So what we're going to want to do, and here in the next slide talks about the different types of functional interaction networks. So there's directly measured interactions. You've seen some of those states, you can directly measure whether or not two pairs of protein interact or whether or not there's a complex of proteins that all co-purify together. And that's called a protein interaction network. You can measure synthetic pairwise genetic interaction networks. You can infer interactions from a single data source and we saw that with the co-expression network. There's other way there's other gene expression profiling studies from which you can infer this these types of things. You could also infer the functional interaction by sequence similarity, right? Genes with similar protein sequence probably have similar biochemical function. And then there's a bit of like an industry, especially in the early odds, which was inferring interaction by combining together data from multiple data sources. And that's partially what we're doing here in Tumania. And so, but what people have published, including the network that Robin told you about in the last module, are networks that combine together data from a variety of different sources to get kind of a composite measure of functional interaction. Okay. So, here, what we're going to be trying to do is we're going to be trying to use these interaction networks as the data sources that we're going to query to answer the specific questions that I talked about. So one is, what does my gene do? So the idea is, is you we have a gene that's shown up on the screen, you have don't really know that much about it. Maybe there's not much in the literature. But you want to get some idea of what that gene might what function function rule might be playing. Right. And so there the goal is to determine a gene function or say something about it based on who an interacts with. And this is like the guilt by association case that I talked about. Right. You know what a gene does by what the genes that interacts with do. And but there's another type of question, which is give me more genes like these, right, say you're trying to set up some sort of medium sales, the scale screen, maybe you're interested in, for example, the wind signaling pathway is one of my early collaborators was, you want to find more kinases, you want to find my members of protein complex, you want to find more members disease genes. Right. So that's a different type of question. And then the question we can also try to use the answer using functional interaction data. So we want to answer the question, what does my gene do? So we take the input, which is all the network and profile data, we can get our hands on. We take a query list, which in this case is just a single gene, and we, we push those two through a gene recommender system. And so what's a gene recommender system? It's a system that recommends genes. Here we go. In case it wasn't clear, obviously. So like it's like Amazon, you know, you like this book, well, you're gonna like these books, too. It's the same sort of idea. If you are interested in this gene, these are the other genes that interact with it. These are probably the ones you want to look at. What's the false positive? It's it's something that you say is true, but it's not true. You know, that's a very hard question to answer because this is just a single gene. Right. So if it's just a single gene, you don't know what question is being asked. Right. There's a bad answer. So, so I mean, in G-manium, I'll show you later, you can adjust the question that's being asked by saying what data sources you're willing to query for answers. But if you don't put anything in, then we just, we just find genes that are likely to have the same biological function. Meaning as as defined by the gene ontology biological process hierarchy, which you've heard about probably two days ago. Is that right? Yeah. Okay, good. Thank you. Right. And so, so here, you know, the best thing you could do is like run this through the gene recommender system, or at least using G-manium, then do some sort of enrichment analysis. And you've heard about that yesterday. Right. So here's the gene. These are it's like 20 closest associates. Are there any pathways or categories of gene function that are risked among its associates? If there are, then that might say something about the function of this sheet. Okay, that gives you additional information. Also, what you can sort of see here, as you can see, there's connections of different types and different colors between that gene and its associates. And those are telling you how those genes are linked together. What the data is that says that these are interacting. So some of its physical interaction data, some of it is predicted interactions because those those two genes inter orthologs, those genes interact in a different organism, sometimes as co expression. So the black ones? So the the the black one with the so the one with the diagonal lines is your query. And all the other ones are the ones that are inferred. And they're colored based on whether or not they've been assigned one of these three functions. Yeah, what is the black? Black means it has it's not been assigned any of those functions. Yeah, yeah, but it's not black with like black with the with the diagonals means that it's it's part of your query list. Okay, and any other questions about this? Okay. Okay, and so when you're answering these types of questions, you can you can use anything you have, basically. And again, this issue that Francis brought up, if you put too much data in, maybe you'll get a mismatch of different things. But you know, generally speaking, you get a pretty good initial guess at what's going on. At least you can see the other genes that that interact well. And then you can iterate, you can take those genes and put them back in and ask a different type of question to try to refine the question that you're asking about the gene function. I'll show you how to do that in a second. Okay. So now, you know, back to this question. When you asking about when you ask about function, like what does you know p 53 do? Well, to paint who you ask, you get a lot of different answers this question, right? Or p 53, it's a transcription factor, right? Everybody knows that it's the most famous tumor suppressor, right? It plays a role in apoptosis, it plays a role in responding to DNA damage. It's got a lot of different functions. It's probably the gene with the most functions assigned to it, right? So, so if you want to say what does p 53 do? Well, you could say, well, this is the fun, the aspect of function that I'm interested in. And it's important to say what aspect of function you're interested in, because some of the networks might be better for some types of gene function than others. Right? Okay. So for example, one of the networks that people generate is this network of protein sequence similarity. Often what that tells you most about is whether or not they have the same biochemical function, whether they have the same enzymatic activity. But that doesn't necessarily mean that they're an organism is going to behave the same way or have the same phenotype if those two genes at high sequence similarity are knocked out, right? Because it could be expressed under different conditions, for example. Okay. So one way of, of, okay. And so in order to answer these types of questions, you need what I'm calling a context dependent set of networks. And so that means a set of networks that is designed to answer the question about function that you're interested in, not an arbitrary set. Okay. And so how do you define the question you're asking? Well, one of the ways to define the question you're asking is define the query by providing context, right? So Memphis is a city in Tennessee. And Memphis is also a city in ancient Egypt. Right. And so depending upon what question you're asking, if you just put Memphis in, you might get information about both versions of Memphis. Right. But if you mentioned, if you see it say a couple more cities in Tennessee, well, I can tell you more cities like Memphis in that way. If you meant you put in a couple more cities in ancient and modern Egypt, you can get more cities that way. Okay. Right. So you provide context. And that's this idea that give me more genes like these by providing that gene list, you say, well, this is the aspect of gene function that I'm interested in. Okay. And so this is how this works. So you have network and profile data. You have a query list. Plug it into your gene recommender system. And so now I've like colored things a little bit more. So as before we have the genes in the query list on this network, which is really hard to see. I'm sorry about that, but you'll see it online in a second. And the black ones are ones that are also that aren't in the query list, but are the sort of the the guilty associates. Right. And then the functions have I've just annotated those as a way of like helping to navigate through this just network that you have here. And the individual networks are indicated by colored links, which maybe that came through in the hard copy. It's not coming through on this projector, but it will come through on the interface itself. Okay. And so now all the other genes that have come along with this query list are probably more genes like the ones you're interested in. And you can use again enrichment as a way of saying something about the shared function in the list. So I guess this is the point at which I show one of the demos. Great. Okay. So this is just our website. If you type in Gmania, we still have the trademark. So nobody else is called Gmania yet. And I'm just going to push each need to get an example gene list. In general, you can just type those in there. And I'm going to press search. And it's going to take about a minute for it to load. No, I thought it was pretty good. Okay, great. So Oh, that's interesting. Someone has adjusted the the default options. This is second year. Okay. I want 20, I want 20 other genes. Okay, so what happened there is it just gave me back my query list and showed me how they were interacting with one another in the sort of the networking interface that was in the middle, but I didn't get any other other genes. So I was unhappy with that. Okay. And now I'm happy. No, I've gotten what I've expected. Okay. As I indicated before, these are the genes in the query list, the ones with the lines, these are the genes that came along for the ride. I think the query list was drawn from like genes in the human involved in DNA damage response. Because I see BRCA in this list. I don't know. It's a query list that makes a nice looking network is how we chose it initially. Okay. So let me just take you through the things that are on the side, and then I'm going to press all the buttons, and then I'm going to go back to, I'm going to go back to my presentation. So here is the networks. You can make this panel go away if you think it looks too messy, but I like it. And so there's the networks that are being shown here. They're organized in the six different categories. The predicted networks, these are predicted interactions. They weren't directly measured or observed in human. They're predicted because the genes that are connected by these links interact in other organisms. Their orthologs interact in other organisms. And so you can, you know, hover over there, you see it, click it on and off. The lines appear and disappear. This little arrow also expands it out. This is a category of networks. And so these are all the various networks that are contributing to that category. And you can see individual networks by just hovering over here. Okay. And it's the same thing with the physical interactions. So these are all the individual networks. And I'm happy, I'll explain a second when all these weird names mean. Here are the shared protein domains. So this is a way of measuring sequence similarity, but we filter it first by looking at whether or not they have the same like PFAM or Interprotein domains. Did anyone present PFAM and Interprotein to you? No? Okay. Okay. So who knows if proteins are made up of domains? Okay. All right. And so Ensembl is there are various bioinformatics organizations, but they connect around the idea that you can describe those domains using what's called a profile HMM. It's something that you can now scan a new protein sequence and score it according to how likely you think it is that it has that domain. So for any time that you submit a genome to Ensembl, there's a whole analysis pipeline goes through. One of the things that happens is they score all the protein sequences based on these domains and then assign those domains to them. Okay. All right. And so when we started, there are like two different sources for those domains. There's probably about like eight different sources for what are called domain models which are the scoring systems and the major ones are PFAM and Interprotein. Okay. And so for these shared protein domain networks, this tells you whether or not the connected genes seem to share predicted domains. Any questions about that? Exactly. It's sequence similarity but at the domain level. Yeah. One of these days we should actually put sequence similarity in, but we don't have it yet. Okay. And then these are just co-expression networks. And the way in which we do co-expression networks is every few years we go to Gene Expression Omnibus and we download all the expression data that we recognize as being on an AFI array or an Agilent array and we automatically make a new co-expression network. So we have some minimum size. I think that minimum size is like 20 experiments that have to be in that database, but we constantly update the co-expression. Pathways. So genes are connected together here if they've been assigned the same pathway. And this is one of the pathway databases. We have multiple pathway databases. And co-localization is kind of a weird category. In yeast it means that the protein products of the gene co-localize in the same part of the cell. In human and mouse it means that the genes are expressed in the similar set of tissues. Okay. So that's the network tab. I can also hover over genes and that tells me all the genes it interacts with. I can click on a gene, get some information about that gene, including a link out that will take me to the gene description page. So you can get a description of the gene from PubMed. The other thing that you can do is if you find a gene that you think should have been in your list you can click on it and say add and it'll restart the process expanding your query list. So as I was saying before if you start with a single gene you want to know about more about its function and you find it's 20 closest associates then you can click click those other ones that you are the aspects that have the function or aspects that you're interested in to change the query list. Okay so let me just press all the other buttons for you so you've seen them all pressed. I'm going to stay away from this button because I'm going to explain to it as explained earlier. You can change your organism. There's eight different organisms is that right? Nine. Nine different organisms that we support. What does this button do? Oh okay that's the search button. Every organism has its own query list. Okay so these three buttons they just they relay out the graph. So this is one of my favorite ones so this this layout what it does is these genes here on this side they're all the query list and oh what happened here and then these ones these are all the the guilty associates so these are the these are the genes that were returned and they're sorted by how guilty they are. So this is the one that has the strongest association with the query list and this is the one with the smallest association among those in this list right so we're on this case only showing the top 20 and you can change that if you want. Okay and if you don't like lines you like circles instead you can use this layout it does the same thing as this one but with with the circles and then this is this layout is called um um what no it's not right no it's a force-directed layout thanks Brian yeah it's a force-directed layout so with this which this with what this means is is genes um the they are closest together the nodes for the genes are closest together the ones that they most highly interact with and and because um when you when you do this layout algorithm there's a little bit of randomness it it it can end up in a different place if you start it from a different place you can sort of redo the force-directed layout to get a similar layout that's one that's slightly different by just pressing it it's not random it's no it ignores query versus non-query genes yeah okay and what do these buttons do what does this one do oh thank you okay and here's the information and this is if you want to um save any of this information so you can get like a you can get an online report you can download these images right and you can also get like a text file with the data about all the networks that are shown uh on this list so play around with those i mean it should be svg oh huh i guess it's only jpeg wow okay uh it used to be uh we used to give svg but i didn't like that because i didn't have anything to display svg easily so now it's jpeg um but i think if you go to the report no the report's not svg either okay sorry jpeg it is okay and then this is remember the nodes were colored well this is how you color the nodes so what's happening here is we take this list and we do a gene set of Richmond analysis we do like the the um um the fish is exact test version of it and then you can color nodes um according to the functions that they have so just by clicking on them the uh the gene sets are are sorted by their false discovery rate and the coverage just says how many what proportion of the genes in the in the set genome wide there's 151 genes genome wide that have the go annotation dna recombination of them 23 of them are shown here that's what that means okay any questions about that i'll go back and finish off the concepts that i wanted you to understand great and we didn't go through that other button uh but i'll go through the other button later okay so um one of the things you saw is you saw network weights um what do those network weights mean and how do we get them okay so uh i'm distinguishing between two different types of ways of weighting networks one way is uh what i call contacts independent so regardless of your gene list all the networks get the same weight that the sorry let me say it again because what i said initially was wrong okay the the way that the network gets assigned doesn't depend on your gene list but some networks can get weighted more if they are in general have just better data in them or more informative about gene function okay so that's what i call context independent network weight and so you can pre combine the networks by simple or addition or predetermining the weight so like i said in the first slide where i introduced the idea of a functional interaction network i said okay here's the functional interaction network that express that that represents co-expression so if we find like just a common scale for all these uh all these links so like they're weighted say zero to one zero meaning there's no interaction at all and one meaning we're almost certain that all the function that these two genes have is shared we can make a composite network which looks a little bit ugly here but you can imagine the weight between two genes is just equal to the sum of the weights between those two genes in each one of the networks right that's the easiest way to combine these together right so if you see in interaction between a gene pair in multiple just different ways of measuring functional interaction between genes that gives you a good idea that there's this there's strong evidence that these two genes share function now uh you can take that idea and and expand on a little bit by like changing the weight depending upon how reliable or informative you think a data source is right so most gene expression networks not very informative so on average it'll get kind of smaller weight uh genetic and uh physical interaction networks tend to be much more informative because they tell you about complexes and function usually is shared among genes and complexes so they'll get a little bit more weight and uh there is people including uh G-manian pre pre-computed weights for all these networks based on how good they are at recovering what's known about gene function okay so that's uh those are context independent networks but again you have this problem that who knows what question you're asking when you put t53 p53 okay so the other off the other option is a context dependent weighting so the idea here once again you assign a weight to each network and so you take the weight of the interaction in the network you multiply it by the weight assigned to the network and then you sum up for a given pair of genes this these these weights you take a weighted average of the links between them as a way of inferring evidence for shared function and where do these weights come from will these weights come from looking at the gene list itself if you give me a long list of genes you can ask the question well in what networks are these genes well connected to one another and not well connected to other genes does that make sense if you find a network where the genes in your list are all linked together that network probably provides pretty good information about other genes that might be associated with them does that make sense now you don't want the network to link all the genes together because that's not specific information but if the network largely links the genes in your list together and doesn't link them to too many other genes that's that's good evidence that that network is telling you capturing the aspect of gene function that's represented by that list and that's that's essentially how we're assigning weights in a context dependent manner I mean the way we actually do is using linear regression but I mean the actual technique use doesn't matter so much as the intuition for how that works and so basically there are two rules that get satisfied by the means that we use to weight the networks and other people use similar ideas one is relevance right so the network should be relevant to predicting the function of interest and that's this is the test I just told you about are the genes in the query list more often connected to one another than two other genes right so there's another important rule that we discovered when we started including a lot of co-expression data is it's really easy to get co-expression data so we have hundreds of co-expression networks in our interface in only a small a relatively small number of like physical interaction data sets so in that case you want to make sure that your network doesn't the information provides is not redundant with other data sets right so this is particularly a problem of co-expression so the test here is like do two networks share many of their interactions right so so if you see a co-expression network and it looks a lot similar to another co-expression network well maybe those two are providing redundant information right and so you know ultimately if you're weighting the networks correctly if I took the same network and I assigned it a weight let's say I gave a weight of five and then I just took that network and then repeated it ten times and then reassigned the weights then the total weight those 10 repeats should get should be equal to five right because it's not telling you anything new adding these additional networks aren't you okay so those are the concepts behind the network weighting schemes and in the gene mani interface we give you access to choosing what weighting scheme you're going to use so we have a default because I like it when you know you just put the list into a system and press the button it has like good behavior and so I think that's a nice way to design things so by default we design we choose between two two weighting schemes if you don't give us enough genes or if you only give us one gene well we have a default way that we weight things and basically we weight the networks based on how well the group of networks that you selected recovers shared gene ontology biological process function right so two genes should be how it should have higher weight in our interaction network if they share a lot of their biological process functions and so we compute that and we can compute that on the fly if you change the selection of networks but if you use our default selection of networks it's already cashed it's a bit faster and so if you have one gene or you have less than I think five genes that's what we do you can change the default behavior but that's the default behavior but if you have a longer list like six genes or more then we actually try to do this this context dependent weighting so we actually try to infer which networks are most relevant to your query so you put a long gene list you get the list of networks that are relevant to your query and you'll get their weights and from my point of view that's actually that information alone is very informative so if you have a list you don't know anything about what's happening when we weight networks is we're telling you what is what are the networks that best connect that list together right so you take a list you don't know where that list is from we can be like wow look these are like highly co-expressed and they're highly co-expressed in this particular study that has this high weight and you can go and look at that study and see what it was that they are actually measuring the co-expression for so the weights themselves when you have a long gene list can can be informative all right so just to take you through what these not very carefully selected terms mean so query dependent weighting this is the default automatically selected weighting method and I told you there's two different defaults now if you're not happy with this you can choose a sign based on query gene so that forces the interface to do this this gene list dependent weighting we don't suggest that for shorter list because of what's called overfitting basically there's not enough information in the short gene list to give you reliable network weights so sometimes network weights are due to some degree of randomness or some degree of luck right um but your list has like 10 or more genies or they'll go the network weights become much more reliable so you can force it based on that or if you just use the default it's going to default to assign based on query genes for six more genes if you want to be a bit more conservative you can use gene ontology based weighting so the biological process base that's what I told you the default was when the gene list is small so we we try to wait we'd wait networks based on how well they they recover co-shared function shared biological function but if you want to change your mind and say something about shared molecular function so they have similar biochemical activities or shared localization are they in the same cell or compartment expressed in the same set of tissues so that's the other way that you wait networks these all these three they ignore the query list they don't care about the query list they're only looking at patterns of of shared annotation for the genes across the genome and the last type of waiting is called equal weighting so here equal waiting there's two types of ways you can equally wait networks you can say look I don't know anything but I want all the data types to have equal weights so like physical interactions should be weighted the same way that co-expression is that genetic interactions are so that will force each of the categories to have equal weight so that they're contributing the same amount to to your final measure of interaction but you can also equally weight by networks so some of those categories have more networks than others so that co-expression has like probably 300 networks physical interactions probably has like 80 networks so if you wait equally by network you get 300 over 80 times more waiting on on a given physical interaction network but so you the the amount that a category contributes scales within a number of networks that are in it okay any questions about that yeah so so the question is with a small gene list could you get over around the problem with overfitting by like a simulation so we haven't tried that because we are trying to when we design the website in the first place and we've tried to maintain it this way is we want something that's responsive that takes like you know in general less than a minute to respond to you and a lot of this sort of so the algorithms we we selected were ones that we felt were pretty good but were also fast enough that they could be done on the fly and so and simulation would take much longer time and in that case so for example maybe I should show you the extra button now so you you can get a sense of why we made the types of decisions that we did so so here's where you can choose the networks now it's back to networks okay so here's the here's all the networks that we have I guess there's a lot more physical interaction data sets now than I remember so here this is just a way to go through and select the various types of networks so let's go through and select the co-expression networks and so by default we thought the co-expression wasn't very informative so by default we just include the 20 most informative networks but we can turn them all on I can't remember how to do that I think it's like this yeah so that turns them all on click again turns them all off on or we can just take the first three because they're from labs that we trust open up here you get a little bit more information this is the study and a link out to where the co-expression data was drawn from and this is a link out to the entry in the gene ontology the gene expression omnibus database and then these are labels that we've automatically assigned to this data set based on analyzing the PubMed entry right and so so here you can choose an arbitrarily arbitrary collection of networks every time you do the query so what that means is is that all the network waiting that we have to do has to be done on the fly even ones that that look at co-expression a co-annotation patterns so we we show it like that's where our constraints in choosing the algorithms came from the other thing that you can do is you can upload your own data so if you have a network and basically a three or two column format where you have the genes genes and then wait if you want or just wait them all equally if you don't want and you just upload that network and include it in all your queries in the same way that you include other data right if you do that you want to use one of these waiting schemes probably where you force its weight to be non-zero because if your network is not very informative for your query list is going to get a weight of zero so I've changed the networks that I'm looking at in fact let's just look at just protein interaction networks so I'm only interested in how the genes in this list are physically interacting with one another and I would like to I don't want that many genes back and I'm going to assign them all equal weight right okay thank you so Francis's question is can we select for protein interaction data that comes from more than one study sort of let's say that right so in one way we're already implicitly doing that because like as you can see here there's a bunch of different each one of these if we click on the edge between them there's you know each one of these links is we're telling you what the support is so so some of these links have a lot of protein support now I should say one caveat is that some of these networks are overlapping and redundant with one another okay I might as well say that I don't have too much more material left so let me go into a little bit more detail where you see this this indicates a specific study right so but the other thing is is there are at least eight or nine groups that that go through and they curate protein interaction data studies and they put together their own networks and so we download those too and those are indicated by by IRAF so these are these are different groups or different versions of protein interaction data sets that have been curated by various groups around the world and we include those because you know maybe you want to just look at the intact interactions or the biogrid interactions so you can you can just include those and we also specifically so any study where there's more than 100 protein interactions reported it gets its own entry so you can choose that data from that study or not to include them but there's a lot of studies that just report it like a handful of interactions and those are curated by the various people by the various people listed here after IRAF and so for those we don't want we don't want to give them all their own network because it'd be kind of a boring network they would have like you know 10 interactions with an empty sort of boring so we group all those together with what we call the small scale study and again there's small scale studies from two different data sources so IRAF is a large organization that Francis probably knows more about than I do that groups together about seven of these groups that are independently curating different protein interaction data sets and biogrid is a competitor to them that also puts together their their own interaction data sets and we don't care we just take data from everybody okay any questions about that great so I've shown you network waiting I've pressed all the buttons colored the nodes a little bit let's color some nodes this is really fun okay so now I've colored the nodes based on the their annotations okay let's go back and finish off our our our concepts okay so so far I've told you about gene recommender systems I've described what a functional interaction network is I've told you about how you could answer questions about gene function by looking at composite networks that are made up of weighted combinations of networks and various different sources I've shown you one interface that does that the gene main interface is one I think is particularly good because I had a hand in making it and now I'm going to tell you about yeah if you once you're given a network how you find genes that are highly interacting with uh with your query list and this is the guilt by association idea and then let's in there's two ways of evaluating guilt by association right okay so um here the query list these are you know these are four genes in a query list and let's say this is our network I don't know why that's blue and every once in a while I've fixed the slides so it's not blue anymore um but it really wants to be blue so like you know sometimes the world just tells you what it wants um okay and so you can see there's one network here and then this is uh this is what's called a connected component in meaning that you could like walk between any pair of nodes in this network but you can't go from here to here because there's no link from here to here all right and so these are a query list and we want to find other genes that are highly interacting with the query list we have two ways of going about that okay so so we're going to score nodes based on the strength of the interaction so uh you know uh red is is the highest and it means you're in the query list okay so whoa I went backwards so one of these algorithms is what I call direct interaction and basically that says you'll get every node and you see around it essentially how many of your neighbors are in the query list and how strong is your link to those and then you compute your score based on that measurement okay so you can see here that these nodes that don't directly interact with the query list they they have a score of zero and so that these nodes that aren't even part of that network okay the other way of doing it is called label propagation and so the way that label propagation works one way of thinking about how it works is people have described in various ways it's like this is a you know this is like water and these are pipes and then each one of these have a little sink that goes down you want to measure the flow or is heat diffusion there's a lot or what's called random walk with restart uh in fact this label propagation algorithm was the first algorithm that google used as a way of of ranking search results it was called uh random walk with restart and you know basically it's it's pretty straightforward um so you can you the first step is you do the direct interaction and you score these nodes and then the second step is to redo the scoring so when you redo the scoring this node now is interacting with nodes that have a non-zero score so it's going to get a non-zero score and you redo the scoring over and over and over again you keep iterating and after some point it stabilizes right and then the label propagation this is what it stabilizes at and you can see that this uh this component here this non-connected component doesn't ever get any of the label because there's no path there but the the strength of the label decreases as you get further and further away from the query genes all right um so this slide here just like goes through everything that I um I just described and it's largely for your nodes um but one important point here is is what happens with this with this algorithm and you'll see this in the next slide as well is is a two if your query genes are in a group of uh of genes that are highly interacting with one another wow what uh what you has been introduced to you as a module when you do this iterative update of the scoring they all they all act synergistically to everybody gets their their score increased together at the same time so it's a way of like propagating to identify these modules so this is an example where this is you know here's a network on the left I'm showing you a label propagation example here's a network on the left and each one of these dots is a node each of the edges uh these are links and you can see there's like four modules on this network and the size of the node indicates its its score so here the query genes start with an initial score of one right and we're going to update the scores not only of the query genes but of all the other uh nodes in this network now if we did like direct interaction these uh these nodes here would get a very high score because they're directly linked to a query gene right and then these ones might not get as high a score some of them are directly linked to query gene but not all of them are in fact all of them seem to be no this one's not linked to all the query genes okay good okay and so after label propagation you can see that the score for this one has gone down because there's no other nodes around to support it in fact it's getting a lot of sort of negative information from these nodes that are saying okay well your score should actually be closer to zero but the score of these genes they're all supporting each other right so it's it's it's growing out to identify the modules and that's what label propagation style algorithms do so when you're identifying this sort of the when you're using guilt by association with label propagation you're guilty associates you if there's multiple indirect paths between you and one of the query genes you get a higher score is the other way to think about it and having multiple indirect paths between you and the query gene is kind of like saying you're in the same social clique right so that's you know having multiple indirect paths in addition to a direct path is like the easiest way of identifying whether something is the same module or not right so not only do you know each other not all your friends but all your friends or most of your friends are also friends of that other person while you now you're in the same social clique rather than if you just are friends but you don't have any other shared friends that largely means that I don't know sort of uh it's a friendship uh sort of a you know a more distant friendship that maybe doesn't link to yours I don't have a good explanation so I'm just going to stop you have a drinking problem what sorry yeah a drinking problem I have a drinking problem no sorry I know what you mean there yeah there's someone you met at the bar you hardly remember them but now your facebook friends yeah yeah okay that's never happened to me okay um so so those three parts of gene mania um we have a large automatic update connection of interaction networks there are other interfaces like this like the string interface which I'm going to introduce in a second um uh we have a query algorithm that finds genes and networks that are functionally associated with your query gene list and so you know being able to wait networks dynamically uh is something that's unique to our interface um and we have an interactive client-side network browser with extensive link out so when I you know when I after I press the go and then all that stuff came back I was fooling around with that browser but I was not I didn't need to use anything online right so all of that is downloaded onto your computer and you also installed the gene mania plugin for cytoscape as well right yeah okay okay so where's our data come from like I said there's this organization of like eight or seven or eight groups that are all like going through papers and doing the hard work of curating uh reported interaction in these papers and then they compile them together into these large interaction datasets we use them extensively and we use i ref index which is a simple interact a simple way of interacting with uh the i ref group we use biogrid the evil competitor uh because because they have genetic interactions and they also happen to have physical interaction they are okay there we go ian donson was the one who made the i ref index that we use yeah yes not not not everything's good in canada we get our co-expression data largely from gene expression on omni bus and these are automatically updated uh we get our shared protein domains from interpro now which is which are the are the people who are automatically scanning these things we get predicted interactions from i2d and these are what i've called what i have used this word before i've described them in tera log so that means that the orthologs of those genes interact in another organism that's when in tera log is we add some organism specific databases there's some legacy databases that we put in we use gene id mappings from ensemble ensemble plants so like i want the interface to be as easy as possible so we recognize most gene identifiers as long as they're unique we don't do anything we ignore non-unique identifiers um basically so uh and we get gene and network descriptions from montregene and pub med and we link out to them and our gene annotations come from gene ontology goa and the model organism databases and you know we haven't updated in two years um and sorry about that but we have an update coming out in a couple weeks uh i've just looked at the update and it looks like it's good and we just have to go through and double check a few things but we'll have new data soon okay gene identifiers so we try to recognize anything uh but you know if you give the gene symbol um that's the best identifier for the gene um it's the one that's uh that's uh not that's unique and uh so uniquely identifies a gene unless you're like a weird drosophila researcher where in drosophila there's gene identifiers that are spelled exactly the same way and are only distinguished by capitalization so i mean i don't know why they decided to do that but there you go so uh so there are some there are some gene symbols including one of my favorite genes that we can't identify in drosophila from the gene symbol because we ignore capitalization yeah so maybe or maybe not it depends so we recognize uniprot identifiers um but we recognize only a subset of uniprot identifiers those are those are the specially blessed ones uh that are unique and it's not with mass spec data sometimes you get a mix so um we can recognize a lot of what you give us but not everything necessary just try it out and see and see what we do see see what we can do okay um right and so as i said we're two years out of date and that's that problem be resolved in a couple weeks um so we might not get all the gene mappings but if you're having trouble identifying a gene and we're not recognizing it you can go and try it another way of identifying the same gene and we can often get it okay we have cytoscape plugin it has all the same gene mania functionality um the nice thing about it is it can take longer gene lists uh up to 500 or probably the same size of the organism and you can use it to access older gene mania data releases and this was one of the things that robin was talking about as well in that if you want reproducible uh experiments well you can refer people to the cytoscape plugin it's gonna have it's gonna behave in the same way but then you can access old data uh old data releases that are no longer on the website um and again you can interrogate gene mania networks with other cytoscape analyses and supports longer query lists and so and there is a way to add new organisms so you can make gene mania for like horse if you want uh it's not straightforward but it's possible with this cytoscape plugin and this paper is describing uh the cytoscape plugin uh we also have something called query runner so you know we're doing gene function prediction all the time maybe what you want to do is you just want to run through and make predictions in each gene ontology category of which other genes should be in that category so we have like an offline thing called query runner that does that sort of thing when what we use this for is we use it for um assessing the added value of a new network so like in one of our collaborations uh with uh Brenda Andrews and Charlie Boone's lab they generated a new genetic interaction network and they wanted to say well you know what now like how have we changed the world so so you can say well you know with this new network you are this much more uh it's easier you can recover this much more about what was already known about gene function right so it's the added value of the network to recovering what people had figured out about gene function in a very uh laborious and uh painful way okay and so uh another great gene a recommender system to use is the string database uh they're uh great and they've been around a lot longer um their focus is more on protein whereas ours more on genes so of course we have genes and proteins in uh but we include some interactions that are a bit more gene specific like genetic interactions um which in here if you focus more on proteins you're focusing more on like things that are true for proteins in particular one of the nice things that this is uh an example of a string output one of the nice things that string does is where where there's a structure for the protein i don't know if you can see it in the nodes but they they show the structure that's pretty awesome so here's their list um for them the the way they don't use label propagation to score genes they use genetic direct interaction and there's their their network so this is the score from genetic uh direct interaction and they want that score to be translated as as probability of having the same function or sharing at least some aspect of function we don't provide a translation for our score we just see these are the 20 most highly interacting genes okay and they have like seven different networks but they pre-combine all the networks for you right so you don't have your choice of what networks to turn on and off and I don't think you can yet upload a network to a new network to them okay um so uh here's sort of the the comparison of of string and gene mania one of the thing that uh string is is particularly good at is they have a very large organism coverage so they're they're essentially computing anything that there's an ensemble genome for so they cover 2 000 organisms largely bacterial and a lot of their interactions are bacterial specific using things like um let's see if I can find it here yeah so they use things like the fact that in bacteria genes with the same function are on the same operon as a way of saying whether or not two genes are like they have shared function they also look for especially bacteria genes that are physically interacting with each other and then our obligate interactors often they just get fused so if you see gene fusion meaning that the you know it just becomes one long protein that's often a good indicator for shared function and they also use this idea of co-occurrence so the idea of co-occurrence is if you look at across all the bacteria and you look at bacterial species and you look at the different phenotypes that they have like do they have like that squiggly tail whose name I forgot um and you know all the bacteria with that that tail um no one's gonna help me right but everyone knows what I mean um they you know if there's genes that are specifically in those bacteria that gives a strong indication that that that the presence of the squiggly tail is related to these genes so genes that are in the same set of bacteria often have some aspect of shared function thank you right and so so that's that's that's what that's what they're particularly good at those those types of arguments they don't work as well at least in our experience in mammals and I want to say how are you carryouts but I don't really know but I certainly know in mammals it doesn't work those those types of things don't work as well certainly there aren't operons um except in I guess C. elegans but um you know these other kind of co-occurrence things that it's not as strong a signal as it is a bacteria in my experience also they include text mining and we don't so they they ask whether or not pairs of genes occur in the same abstract and this is another useful indicator of shared function okay great right so we talked about functional interaction networks right those those are networks where the strength of the link between two nodes tells you about the the likelihood that they share a function we talked about yield bi-association so genes that if you're highly linked to genes that have a specific function it's likely that you have that function too we talked about gene recommender systems you put genes in it gives you more genes back you think you'll like we talked about context specific network weighting schemes in particular if you get if you have a long list and you want to know what networks are relevant for predicting genes that are similar to those in the list do you find networks where those genes are highly connected to one another um we went into the difference between direct interaction and label propagation so direct interaction when you take the network it's only the you can only get a score if you are linked to a gene with uh in the query list but in a label propagation type setting you you can uh your score depends not only on direct interaction but number of shared indirect interactions and it's slightly better for identifying modules because of this property um you can now use gene recommender systems to answer two types of questions what does my gene do or give me more genes like these um if you can't now you certainly will be able to once you go through Veronica's um um assignment and uh maybe you are able to select the appropriate network weighting scheme just ask answer your questions about gene function if that wasn't clear certainly I'm around in Veronica's around ask questions about that and it's in your notes okay great uh so we're on a coffee break in networking session