 Okay, good morning everybody. I'm going to continue with pathway analysis and talk about the enrichment map. So we learned yesterday that we can use an enrichment test to basically find which pathways are enriched in a list of genes. And this is an amazing, you know, it's a great idea. That's used to interpret data, gene lists in thousands of papers. If you go to the front page of the David website, you'll notice that they're touting the number of citations they've had. That's just one tool. There's thousands of people who use this, so it's a great idea. And you get this list that is returned, like we saw yesterday, with gene ontology terms or other pathway terms and P values. One of the things that you've probably noticed in this list, and we saw it in the list yesterday, even the glioblastoma list, is that, and I also mentioned it, that a lot of these terms are related to each other. So in this particular list, I have terms like B cell mediated immunity and myeloid cell differentiation, a munifactor process. As a biologist, you probably recognize that those are somehow related. But if you weren't familiar with that area of biology, you might not realize that those terms are related to the immune system. So there's a major burden in relating these similar gene sets or these similar pathways. So what do we do when, if we have a table of data and there's relationships between rows and the data, what's a good way of dealing with that information? Anybody? How would we, what's a good way of visualizing data that has relationships in a table? A network? Yeah. Okay. Great. So, so that's exactly what we, we do for enrichment map. So we take that information and we, we plot it as a network. And by doing this, it actually ends up being a lot more easy to see the functional themes that come out of the, of the enrichment analysis. Instead of having a, a set of related terms spread out all over a big list, all the related terms are put together in a part, in a, in a region of the network. And the way that this, this, and so as I, as I mentioned before with enrichment map, every circle is a, is a gene set. And I'll tell you more about how this enrichment map works later. But the key point is that it makes it a lot easier to visualize, to sort of quickly see functional themes that come out of enrichment analysis. So in the following examples, I'll be using GSEA. So I mentioned briefly yesterday that GSEA takes us and put a ranked list of genes. In this case, it's gene expression data ranked from the most over-expressed genes, the most under-expressed genes, and the ones in the middle are not changing. And otherwise, it's similar to the enrichment tests, or it's just another type of enrichment test. But the output, we get genes sets that are, or pathways that are enriched in the upper part of the gene list, clustered near the top, and also ones that are clustered near the bottom. So we get here, they're colored red and blue. So, so we take these, these gene lists, and we convert them to a network. Each node is a gene set, so cell cycle here, cell cycle here. Edges or connections between gene sets represent overlap, and the thickness of the, of the interaction, the thickness of the edge is proportional to the amount of genes that are overlapping. So if two pathways are related and they have genes, they, they both, there's the same genes, some of the same genes in both pathways, the more genes that they share, the thicker this line will be. And actually the, the size of the node, which I forgot to show here, is if there's a bigger node it has more genes in it. And then finally, the color of the node is proportional to the enrichment score. So if you had David results, you could have a p-value or a corrected p-value mapped to the color of the node. And for David, because we only have one set of p-values, it's a, it will just be colored one color. If you have GSEA, there's two sets of p-values or two sets of scores, one for up-regulated pathways and one for down-regulated pathways, so we have two colors. So there's three major uses of enrichment map. We, it's not just, there's this, the, the first one is sort of the simple case, similar to the enrichments that you did yesterday. And in these examples, we used a published gene expression data set to reanalyze the published gene expression data set that was examining estrogen treatment of breast cancer cells. So this experiment had, they took breast cancer cells, they treated them with estrogen, measured gene expression data, they also measured gene expression data in an untreated control. They had three samples from 12 hours and then they also measured the same, did the same experiment at 24 hours and they were interested to see the difference in, you know, changes in gene expression between these two time points. So the gene expression data estrogen treated was compared to control and p-values were calculated that correspond to the difference in the, the strength or the significance of differential expression at two time points. So we have two gene set, two gene expression rank lists. I'll focus on the 24 hour one, just one time point. And we, we ran this through GSEA and we loaded the results into the enrichment map software and which is a cytoscape plugin that we'll play with during the lab. And you get this, you get something like this. You actually get, one thing to note is that we've manually annotated this list a little, this map a little bit. What you get by default in cytoscape is just the nodes and the edges and the colors. And what we've done is we've looked at the nodes and read the terms associated with each node and then come up with a summary label and then circled that manually. So these labels and these little circles we've manually put, we've manually added to the, to this picture for publication purposes. So if, if you're going to publish one of these, it's nice to kind of annotate it a little bit more. So in this case, you can, you can see that there's a lot of gene sets that are related to translation, RNA transport. And there's a lot of gene sets, a lot of functional themes that are going up and a few that are going down. So here's the, the blue is, the more blue the genes are, gene sets are, the more enriched they are in the down part of the gene expression list. And the more red they're, the more enriched they are in the out part. Okay, so, so this is, you know, much easier to read because you can just very quickly look at this and see the major functional themes. Yeah. Yeah. They're, they're very similar, but there are, they, they are different sizes. So here, they're, you know, it's probably hard to see because they're just quite similar like here, there's, they're very thin. In fact, this one's so thin, it's hard to see on the screen. And here's a much thicker one. So it's a little bit hard to see that they are different. Here's a sort of zoom in. So you can see the actual names of the term. So also in the previous map, we removed the names of all the terms. When you, when you see this in the original map, plug in, you'll see all the, the term names. So we've, we've labeled all these terms. We've summarized them as microtubule and cytoskeleton, but, but they have names like microtubule organizing center, center zone and spindle and other things like that. So all of these terms are, pathways are related to each other. Yeah. Yeah, so, so for publishing, what, what we usually do with these maps is we, we might choose the most important functional themes and just present those. And we might even delete other ones that aren't, that we don't think are, are as important or not ones that we aren't focused on. So we either present the whole map or just a focused subset. And then, and then for that, we, we do something like this where we annotate the, the, we, we circle the functional themes and label them. And then in the supplementary material, I like to show a PDF, like a vector-based image of the whole map with all the labels so that you can zoom in and read all of the labels. And often we, it would be nice if we, or if we're able to with the journal, we put us a site escape session file. So people can download the site escape session file and, and explore their, the map interactively. That's the ideal, I think. So, so the second sort of use of enrich a map is comparison of two enrichments. This is something that's quite difficult to do, if you just have enrichment table or tables results. Enrichment results is listed in a table. So if you, if you wanted to look at enrichment of terms at 12 hours and 24 hours and compare them, you had two tables that result from David, you'd have to kind of manually match up all of the, the terms. So what we've done in the enrichment map is we've visualized one time point as the color of the center of the node. So 12 hours is at the node center. And 24 hours is the node border. So we can, we can quickly compare the enrichments, like two enrichment maps, but on the same map. And so if we're comparing 12 and 24 hours, any circle that's all red means that the, this gene set, that particular gene set, is enriched in the upregulated genes in both time points. And gene sets that are, have a white border or a white center means that, white means that the, that term was not found to be enriched at that time point. So white center here means that these ubiquit independent protein degradation terms or pathways were not found to be enriched at 12 hours. But at 24 hours they were, they were enriched in sample versus control. And these, these little guys are the reverse. So very quickly we can see that if we're interested in the differences between 12 and 24 hours, we can see, oh, all of these guys are up, but you know, they're, they're, they're important, but they're not changing between 12 and 24 hours. Only this little group here and this little group here are changing. So you can kind of zoom in on, on this, this, this group here. And if you have gene expression data, that, if you're working with gene expression data. So enrichment map is valid for any type of enrichment analysis. But if you have gene expression data or gene or information that's like gene expression data, you can actually load that into the enrichment map and view heat maps. So you can click on one of these nodes and you can see the genes that are associated with those nodes and the, the heat map that represents the level of expression across your different experiments. So these are the three samples for treated at 12 hours, three samples for untreated at 12 hours, and then the same for 24 hours. Green means up and purple means down. So you can see here for this, this term that there's actually the protein degradation is up in both 12, 12 hours treated and untreated. So there's no difference between treatment and control at 12 hours. But at 24 hours, the proteasome is, is, in treated is, is, is down. So there's a big difference in treated and control. So the, if you see a red color here doesn't necessarily mean that the, sorry, the gene expression data, the gene expression is up. It just means that it's enriched in the rank list when you compare treatment to control. So control could be up and treatment could be down or, or, or they could be equivalent like, like here, both up or both down. Here's a, here's the reverse example, the replication fork. And you can really clearly see these patterns, a very big difference between treated and untreated at 12 hours and, and less, much less difference at 24 hours. So this really helps you zoom in on the important, interesting pathways that are changing in the, in the data set. Yeah. This is all inside escape. So this, this, when you, when you try to lab, you'll see that there's the expression data is not visualized as a little, as a little, for, for publication, we, we put the heat map in this little box. But the, inside escape, the heat map comes up just underneath the network. But this heat map is, this heat map is inside escape. And we also added this for publication. So inside escape, you'll have like column headings and you'll have to know what those mean. So there's no nice coloring for that. Okay, so the third use that we, we typically use in Enrichment Map 4 is query set analysis. So if you are, so in this case, and I'll show you an example. Basically what query set analysis allows you to do is take computer enrichment map, visualize the enrichment map. And then add another set of genes on top of the enrichment map afterwards to see how that set of genes is related to pathways that are enriched. And there's different biological questions that you can answer with this in the autism example that I showed you yesterday. We were using query set analysis to examine known disease associated genes and how they were related to pathways that we saw enriched in the copy number variant analysis. In this case, we took gene expression data from a mouse where they, there was a knockout of a micro RNA, Mer1-2 in the heart. And they did gene expression data of heart tissue. They measured gene expression data of heart tissue and we compared to control. And we did enrichment analysis and visualized it as an enrichment map. And we can see all the different biological processes that are changing when you knock out a micro RNA. And then what we did was we took the predicted targets of that micro RNA, maybe a one or 200 genes that are predicted to be targeted, the target of that micro RNA. And so for people who aren't as familiar with micro RNAs, micro RNAs are down-regulate the expression of genes, of other genes. So they're negative regulators. And so we expect that all the targets of a micro RNA which are normally, if the micro RNA is expressed, they're normally down-regulated. If you remove the micro RNA from the system, they might go back up or they might be not down-regulated. So this triangle represents the set of genes that are known to be or predicted to be targets of this micro RNA. And the lines represent overlap, basically overlap of this gene set with one of these other gene sets. So we can see as expected a lot of the targets of the micro RNA, the predicted targets, seem to be present in gene sets that we're going up, which is exactly what you'd expect. And some gene sets that we're going down, there's no connection. Doesn't seem to be connection between this micro RNA and those gene sets. Also some of these gene sets don't have micro RNAs connecting micro RNA targets in them. So maybe those are not sort of directly regulated by that micro RNA. So the interpretation of this map would be that the strongest connections between the predicted micro RNA targets in these pathways are probably the ones that are most likely to be directly regulated by that micro RNA. So this is another use of enrichment map. We computed the enrichment map and then added another gene set afterwards to look at how it was related to the pathways. So as I mentioned yesterday, we used all of these or a lot of these things together in this enrichment map that was used to study this autism spectrum disorder copy number variant data set. And these autism and intellectual disability little triangles here are query set analysis. These are the known genes. We were looking at how they overlapped with other pathways. And in this case, we did a few extra things. So the circles that are colored white to red are gene sets that are enriched in the copy number variant data. And then these other yellow shapes are pathways that are enriched in the intellectual disability genes and enriched in the autism genes. And then we computed overlaps between all the sets and showed them here. So this sort of puts a lot of different things together. And then what we did is we annotated the network by putting these circles around here and labeling these using a drawing program, Adobe Illustrator. I showed you this zoom in yesterday. OK, so for the autism data set, we used all the gene ontology gene sets and also pathways from KEG, NCI, and Reactome, and PFAM domains. So many of these gene sets are present in David, but I'm not sure if all of them are. We filtered gene sets so that we kept gene sets that had greater than five genes and less than 700 genes. And this led to 6,000 gene sets. And then the ones that we had actual data for was only about 3,500 gene sets. So if you're interested in how this was done and other work that was done behind the scenes here, you can look it up in this paper that was published last year. That actually reminds me of one question that I was asked yesterday. Sometimes with certain types of genomics data, there might be additional biases in the data set that lead to results that you need to correct for in enrichment analysis. So for this copy number variance project, some copy number variants affect a specific gene region where genes from a particular pathway are all clustered in the gene region. So a common example of this is olfactory receptors. Olfactory receptors are all clustered together in the same part of the genome. Similarly, HLA genes related to the immune system are also clustered. And so sometimes if you have copy number variants that are hitting those regions, you'll get a very strong enrichment in olfactory receptors or something. But that will be mostly due to the fact that they're clustering together and not spread out over the whole genome. So some gene sets you might have to remove due to these biases in certain data. In general, most of the enrichment analysis methods don't help you consider these. So you have to do extra work. We did extra work that's described in the supplementary of this material of this paper. So if you're interested, you can go look at that. I also noticed that just a couple of months ago, there was a paper that was published that tries to do enrichment analysis and consider additional information from certain biases that are useful for cancer genomics. It's called PathScan. And I'll put that paper on the wiki, but they have an additional check that for a potential bias that could come from very long genes, if you're looking at somatic mutations in genes, the longer the gene, the more somatic mutations you expect by chance in that gene. And so if you're counting up somatic mutations and you're including that in your pathway analysis, you might want to consider that longer genes are expected to have more mutations than shorter genes. And that might be, maybe that should be part of the enrichment analysis statistics. So there's a paper that tries to develop a statistical method to consider that extra bias. And that's actually an active area of research to sort of deal with these biases. But just be aware that sometimes gene sets might come up that are related to the sort of characteristic of the type of data that you're working with. So here's what the enrichment map looks like. It is a side-escape plug-in. So it's implemented as a side-escape plug-in. This panel allows you to load up data. So the David results that you saved yesterday could be loaded up here. And then the enrichment map is visualized here. And if you have gene expression data loaded, which is optional, you can visualize that as heat map here. And this panel allows you to, it has slider bars that allows you to play with the thresholds of the P values, for instance. So if you were interested in just looking at the gene sets that were very strongly enriched, then you could slide the slider bar to one side. And this map will dynamically update. So you can just see the gene sets that are very enriched and see how the map changes using more stringent or more liberal thresholds. And you can make two color enrichment maps and do query set analysis. These yellow genes, if you're working with GSEA, if you're familiar with GSEA, GSEA has a concept, an additional concept called leading edge genes. And it calculates genes that, leading edge are the genes that most contribute to the enrichment out of the entire gene set. And so those might be genes that are kind of very interesting too in your data set. So we colored those yellow here if you loaded up GSEA data. So that's the enrichment map. And we'll get a chance to look at it in the lab. One of the, just to mention some future work, how we're thinking of expanding enrichment map. Right now, enrichment map gives you a very nice global and enrichment analysis in general. And in the enrichment map, give you a very nice global view of your data set. So you can see all the functional themes that are enriched. And it allows you to quickly find things that you might be interested in. And you can delve in and go into more detail. Now often, when you want to go into more detail, you might want to go down to a level that's deeper than the gene set and see, for instance, how genes are connected to each other or map the results in a pathway like you saw David has for keg pathways. So for instance, in this enrichment map, there's a little section of sort of apoptosis genes, apoptosis related gene sets that we were interested in for one particular paper. And one of these gene sets came from Reactome, which is a pathway database that has a lot more detail about gene sets, as I mentioned yesterday. And so we were able to go back to the pathway database and visualize the pathway as a network. And so in this view, the circles represent gene sets. And so what I've done is I've taken one gene set, which represents a pathway, the apoptosis pathway for Reactome. And I've blown that up so that in this view, I actually see genes. So here the circles represent genes. And the connections between the circles, the genes are interactions that are related to the pathway, like pathway protein and protein interactions or reactions. And then I've overlaid the gene expression data. In this case, it was actually protein expression data on the pathway. And I noticed that it wasn't the entire pathway that seems to be differentially expressed. One particular part of the pathway, actually, there's two little parts of the pathway that were differentially expressed. And so zooming in on this a little bit further, you can see that there might be interesting areas of this particular complex that are changing. And so that allows you to kind of zoom into the gene level and eventually maybe make some hypotheses about mechanism that's specific mechanisms, that biochemical mechanisms that are being altered in your experiment. So we'd like to have this kind of seamlessly working in a future version of enrichment map. Another thing that we do with enrichment map is we draw circles manually. And we label the circles manually. We nice to have that done in an automated way. It's difficult to do that because different people might want to emphasize different parts of their enrichment map. And so they're actually making a conscious decision about what to emphasize. That's impossible for a computer to do unless you tell it what to emphasize. But we're exploring different ways of doing this. And one summer student, a couple of years ago, built a plug-in or set-escape plug-in. Actually, I guess it was last year, built a set-escape plug-in called a word cloud plug-in. So if you select a set of gene sets in enrichment map, you can pull up the word cloud plug-in. And it will show you all the words that are associated with these gene sets. And the bigger the word, the more frequently it occurs. So this cluster is sort of signaling. And then there's different types of signaling pathways. And it has a little algorithm to cluster the words into related words. And so there's a few different ways of doing this in word cloud. So word cloud is a free plug-in that might help you quickly summarize a set of nodes inside escape in enrichment map. And then finally, Ruth Isserlin, who developed the enriched map plug-in, she was really happy with the result. And so she decided to bake the enriched map cookie and bring it to lab meeting. And so we ate this cookie at lab meeting. It was really good. So the enriched maps are not only useful, they're also really good tasting. OK, if I can get people's attention, I'm going to talk about the last section of the pathway analysis part of the workshop, which is gene function prediction. And this also makes use of the knowledge about networks that we learned about yesterday. So I'll talk about just general concepts in gene function prediction, what it is. I mentioned it briefly yesterday, this idea of guilt-by-association. And then I'll talk about the gene mania software that we've developed to help you do this. And this is also good software for sort of converting a gene list into a network. OK, so the idea, so as a general introduction, a lot of data is being generated by lots of different types of genomic methods from expression data and chromatin IP protein interaction. And generally, this huge amount of data is quite fairly difficult to sort of combine and use altogether. But one of the main questions that people have often is what's the function of my gene? Or what's the function of my gene list? And so this gene function prediction sort of helps you collect all of that information and use it all to help you answer that question. And the gene mania software that I'll talk about is a little bit like this Google Sets thing. So how many people have played with Google Sets or a version of Google Sets, a few people? So this is actually a very old part of Google that you can type in words. And Google will try to extend the words to find similar words based on the words that you give it. So if you give it a set of colors, Google will try and find additional colors. And we'll just do that by looking to see how words are occurring in web pages. And so here we typed in three names, Memphis, Knoxville and Nashville, which you might recognize as cities in Tennessee. And so if I click Google Sets, ask Google to find me similar words. Google finds Chattanooga, Morristown, Jackson, and these are all other cities in Tennessee. So it's actually doing a pretty good job of extending this list to additional cities. Now, Memphis is also a town and a city in Egypt. And so if I type Memphis, Alexandria, and Cairo, which are other cities in Egypt, and I click Google Sets, Google now will predict Luxor, it will basically find additional cities in Egypt. So Google has actually recognized that even though the Memphis was found in both lists, the fact that there's other words in other lists, it's sort of figured out the type of words that those are and extended those words. Now, this would be great for biology because you can type in a set of genes, and it will give you other genes that are like those genes. So that's like a type of gene function prediction. So if I have genes that are part of the same pathway and I want to find new members of that pathway, I could type in the genes that I know and find new genes. Or if I have genes that I know are involved in breast cancer and I want to find new genes that are involved in breast cancer, I should be able to do that as well. Or any kind of query like that. Yeah? No. You have three days to try that. So good thing we had captured our screenshots before this thing. So we're not actually going to use Google Sets for this. It's not really that important. So the reason why is because we actually tried this with Google Sets that you can type in a set of genes that are part of the same complex, and you ask Google Sets to predict additional genes, and it actually does a terrible job. So we need to do something more for biology. There's not enough power in Google Sets for biological data. So what we've done with is created a, so myself in Quaid Morris, who's another faculty member at the Donnelly Center, have created a system called Genemania, which answers this question for gene loss. So you can type in a set of genes that are related to each other, and then you'll, Genemania will predict additional genes that are part of that list. And it will also give you a network of connections of how those genes are connected. And Genemania is at genemania.org, so you can already try it out. So I'll just quickly show you a demo of Genemania, and then I'll talk about the concepts behind Genemania. OK, so here's a genemania.org website, and I can choose to find genes that are in different species. So right now, there's seven species that are seven major model organisms that are present in the system. So if I'm interested in a set of genes in human, I can just type in those genes there. And if I type in a gene that Genemania doesn't recognize, then it will tell me. It says there's an unrecognized gene symbol. You have to fix that. So let me try another one. And that one's recognized, and that one's recognized. So you can type in a few more here. And you can also paste a larger gene list, like your gene list. I think the website supports up to a few hundred genes in a list. And there's another system I'll mention that supports bigger lists if you have really big lists. OK, so I've typed in my gene list. I can also click this button here to add in to if you don't want to type in your own gene list to view an example. And then I can press Go. And Genemania searches a lot of different genomics data, and it will find all the connections between the genes that I gave it. So I gave it five genes. And you can see that the resulting network in Genemania contains a lot of different circles which represent genes and edges which represent different types of connections between the genes. These connections are here. And if you want to have a legend of what the edges mean, you can click on this button here, Networks Legend. And you can move this around to here or something, or maybe down here. And so core expression is purple, co-localization is blue, genetic interactions is sort of light, blue pathways, physical interactions and predicted interactions are all listed here. The genes that I entered are the biggest genes that are gray, colored gray here. And Genemania has predicted additional genes that are similar to these genes. And if I click the Genes tab here, then I can see what those genes are. And here's my query genes, the five genes that I entered. And here are additional genes that are predicted to be similar to those genes. So RAD 51 is the most similar to that set of genes that I added. And so it's the biggest circle. And then genes are listed down the list here. And they sort of have these numbers associated with them that are useful for ranking these genes based on their similarity. But these numbers aren't really useful for anything else. Like you can't really use those numbers to compare between gene set, between gene mania queries, but they are useful for ranking these gene lists. So genes at the top of this list that have the biggest score here and are also the biggest circles are the most similar to the genes that I entered. And ones at the bottom of the list are least similar in this list. I can also ask for more genes to come back. The default is 20, but you can ask for more genes. So one nice thing about Genemania is that you can interact with this network. So I can, just like in SiteEscape, this is actually a web-based version of SiteEscape called SiteEscape Web, which is useful for people who are making websites, developers of websites, that can embed like a network in their website. So here is, you know, you can select a set of genes and move them around. And I can click on genes to find out more information about them and I can go to Entrez. I can click on connections between genes and that will tell me where that data came from. So there's a link here between BRCA2 and XRCC2 that came from co-expression. These two studies, these two papers that published gene expression data have these two genes that are co-expressed in them. Similarly, you can look at these other edges. Here's a pathway edge that comes from a data set that's from the Reactome group and from NCI nature. So you can click to get back to those. If you're interested in looking at multiple genes at the same time, you can drag these little tooltips around and have them all up at the same time. So I'm gonna close those. And then finally there's a functions tab that computes a gene ontology enrichment for the genes that are in this network. And if you're interested in seeing what those, you can just move your mouse over a particular term and I will highlight the corresponding genes. And if you want to save this color, you can click this little plus here and I'm gonna click that. And so now all the DNA repair genes are red and now I have another color to work with and I can say cell cycle, I can choose cell cycle arrest and choose that one. And now I have another color to work with and I can choose another one of these guys like nuclear matrix. And now I'm sort of coloring my network based on functions that I'm interested in. If one of these functions is more important than another, so you can only show one function on a node at the same time, one color on a node at the same time. So if I want one to take precedence then I can just move it up in this list. I just drag it and move it up in this list. That's basically it for Gene Mania. It's very simple website. But there's a lot of power in this website. I'll show you some advanced options here. The help tab just is very simple. It just shows you a couple of pictures about how the interface works. So, oh yeah, so if I have functions that I've selected here and I'm working with genes, I can choose the functions legend and this will, this shows me the colors that I picked for here. If I get rid of one of these guys, it updates automatically here. So you can also save a report. So if I create a report, it will open up a report in another page and it takes a little while for the report to generate. But this report has all the information about the network. This is a publication quality or vector graphic figure that you can, once you've arranged it the way you like, you can save it in its high resolution. And here's the legends that I showed you and then here's the search parameters. If I don't like the fact that if I don't really wanna see this in the report, I can just remove it and then it will disappear. I wanna put it back, then I can go back up here, I've removed it here. So I can select different things to add and add from the report. And then when I'm finished, I can just press this big print button and it will save it as a, it will, you can either print it or you can save it as a PDF. But here's tons of information about all the networks that were selected, all the networks that were searched and it's actually really big. So you might not wanna save everything but maybe only wanna save the network in the list of genes. So you can customize the report, yeah? So can you choose which network? Yes, so the question is, can you choose which networks to search? So that's under advanced options. So if you click advanced options, you can select among the different networks that are searched. So by default, sort of a reasonable set of networks are searched but you can click to enable all of the networks and you can go in here and you can say, or I can look at pathways. I can say I can look at pathways from human psych or reactome and it tells you how many connections there are from that pathway database. Here are co-expression data sets and you can click here to get more information about the papers that co-expression was calculated from and if I click on this, I'll go to PubMed to get access to that paper. You can also upload your own networks. So if you're interested in, if you have your own gene expression data and you have correlation matrix or if you have your own protein interaction data set or your own curated data set, you can upload your own data there and there's a little help button here that tells you the format, which is very simple. You can just construct that format in a spreadsheet and then you can upload it. So there are other tools like Genemania, like String is another one which works for more organisms than Genemania and it searches slightly different networks. The main, one of the main differences is that Genemania is the only tool of this kind that allows you to choose the networks that you wanna search and upload your own networks and the reason for this is that the algorithm for Genemania is developed by Quaid Morris' lab is the fastest algorithm available and it's able to combine all the networks and search all the networks within a few seconds whereas other tools often have to be run offline to pre-compute everything because they take longer. So there's also different types of options here. So here's then you can choose to get more or less, more or fewer genes. So if I don't want any other genes to come up, any additional genes to come up, I can select zero and this will give me back just the connections within my gene list without adding any new genes. So that might be useful if you're just interested in your gene list or you can get like a lot more genes like a hundred and you'll get a much bigger network with more similar genes. I'll explain these in a second. Just relating to this, getting more and more genes, there's a new feature that we just added a couple of days ago which allows you to select a set of genes that you might be interested in here and then you can click this button to redo the search with just these genes and or you can go to this query menu here and you can say do the same thing, rerun the query adding the selected genes or removing the selected genes from the query genes or rerun the query with only the selected genes. So if I do that, then it will rerun the analysis searching the networks that I've chosen and now it's only using the genes that I selected. Okay, so there's a few other options in these other menus that you can try like you can change the layout or turn off labels. There's this little box here that you can move around actually and it allows you to zoom in and out or move the network around if you have a bigger network. Yes, so for site escape in terms of dragging the network I can just quickly show you. Let me see if I can quickly open up a, so there's two ways of dragging the network in site escape. One is if you have a three button mouse, the middle mouse button, clicking the middle mouse button will allow you to drag the network around and the other way is using this bird's eye view which you can move and move. So it's a slightly different way of doing it but it does the same thing. Okay, so advanced options. So by default, Gene Mania has a couple of different ways of combining networks and by default it will do something that's very similar to Google Sets if you give it enough genes. So Google Sets, just like Google Sets you have to give it a certain number of genes. You can't just say Memphis, otherwise it won't know if you're talking about Tennessee or Egypt. So you have to give it a few examples for it to learn what you want and then it will give you more things that you want. With the limit for Gene Mania is that it needs at least five genes that are similar to each other to learn what you want and if you give it less than five genes it will use gene ontology, biological process to find similar genes. So it will find similar genes that have similar biological processes to the ones that you gave it. So I'll go over that a little bit more in the slides. Any, oh yeah, so one additional piece of information is the ranking of these networks. So in this particular network that I, or query that I ran, the most informative networks were predicted interactions and pathway interactions and then co-expression and then physical interactions. Co-localization was very limited and no other types of networks came up. So these numbers here, this percentage tells you how much information from this type of data Gene Mania is using to find similar genes and the way that that works for larger queries that are at least five genes or more is that Gene Mania looks for networks that have a lot of connections between the genes that you gave it. So if you give a list of genes to Gene Mania like 10 genes and Gene Mania finds a network and those genes aren't connected at all in that network, that network will get a low weight and you won't ever see it. So Gene Mania tries to weight the networks that have a lot of information about your genes and that those genes are highly connected in that network, tries to weight that highly. Okay, any questions? I went to File and then I create reports. I can also save the information like different types of information as text files. So for instance, I can save the network as a text file and saves it as Gene Mania network here and then this network I can load up in Site Escape if I want and start to sort of see it because it's wrapped here but this has different columns, Gene A, Gene B, the weight of connection between the gene, the type of expression, et cetera. So I'll show you later that you don't really need to ever load this into Site Escape because we have a Site Escape plugin that runs Gene Mania queries inside Site Escape and so if you're in Site Escape you can just get these pull these networks in directly from Gene Mania. Any other questions? Yeah? So I'll explain a little bit. So we tried to update the database every few months and I'll explain more in the coming slides where we get all the data from. We pull data from other databases. So all the Gene Expression data and all the papers that are associated with the Gene Expression data for instance come from the Geo database, the Gene Expression Omnibus and whenever that's updated, if there's a retracted paper they have to deal with it. So we don't deal with any curation, we just pull data in from all the databases. Okay, so I'll explain a few more aspects of that in a little bit more detail. Just wanted to give you a demo of Gene Mania. So Gene Mania is using this Guild Play Association principle that I explained yesterday. Basically the idea is that if you have a set of genes that are connected to each other and all these genes are involved in the cell cycle and here's a bunch of genes that are involved in protein degradation and here's two genes that are unknown. This is unknown gene one and unknown gene two. You would by looking at this network and this is the co-expression network. So edges here represent co-expression in this microwave expression data where you have genes, all the rows are genes and all the columns are different conditions and then if I compute, if I look for genes that have similar patterns across all the conditions and I draw a line that's related to how strongly the similarity is, how strong the similarity is, that's my co-expression network and usually these are calculated using piercing correlation. So here I've labeled genes that I know are involved in the cell cycle and this is an unknown gene. I would predict this unknown gene to be part of the cell cycle if it's co-expressed with other guys that are in the cell cycle and this gene to be involved in protein degradation if it's co-expressed with other genes that are involved in protein degradation. So that's the guilt by association principle that I explained yesterday. So there's two types of functional prediction that I kind of alluded to or I showed you. One is give me more genes like these so I have some genes that are in the cell cycle here and I wanna find more genes like that. Well, if I have this network, the first gene that I'll find is probably unknown one, right, UNK one. So I would predict that this should be part of the cell cycle list. So if I have the cell cycle list and I wanna extend it, I can sort of work my way out or explore outwards from the cell cycle genes in this network, finding genes that are highly connected to cell cycle genes. So that's sort of the first way of doing it. So that's what would happen if you had a set of genes that you know something about them and you wanted to extend the list or find more things. The other way of doing it is focusing on an individual gene and asking what does my gene do? And so then you're focusing on this unknown gene and you wanna know what it's connected to and you would predict that it's involved in the cell cycle. So it's just two ways of looking at the same network. So give me more genes like these. The gene mania system is basically doing this, just like the gene sets example, the Google sets example that I mentioned. So it takes all of this information that we have on your query list and it recommends additional genes that should be part of this query list. So in that sense, it's kind of like a gene recommender system. And there's lots of different biological questions that you can answer using a system like this. So I mentioned a few of them. If you are doing, if you're planning an RNAI screen and you don't wanna do a full genome screen, you only want to study 1,000 genes or if you're doing sequencing and you only wanna sequence 1,000 genes, what 1,000 genes, if you wanna figure out which 1,000 genes you should choose, you might and you might be interested in a particular area like a particular type of cancer and you know genes that are associated with that cancer, put all the genes that are associated with that cancer in a gene recommender system and ask for the 1,000 genes that are most similar to those genes. And gene mania will do that for you, for instance. And then you can take those 1,000 genes and say, okay, I'm gonna study these. And then the second question, what does my gene do? In this case, you would enter just a single gene like CDC48 and you'd use the gene mania type of recommender system to find additional genes that it connects with and then you'd use enrichment analysis to see which gene ontology terms or pathways are enriched in that network. So in this case, we typed in CDC48 and CDC48 interacts with all of these other guys and I looked in the enrichment functions tab and I found that proteolysis is enriched. So I would predict that my gene is involved in proteolysis if I did not really know anything about it, okay? So gene mania handles those two questions. I guess it handles one other question which is not really related to gene function prediction but still very useful. Sort of a side effect of having gene mania query all of these networks is that if you're interested in, if you just have a gene list and you want to find out all the connect, all the protein interactions involved in that gene list, you can type that gene list into, you can paste that gene list into gene mania and ask it to search just protein interactions and you get all the connections between the genes in the list and then you can use that in set escape or you can do other types of analyses with it. And you could even ask for all of the genes, if you type in all the genes in the genome into gene mania, it will find all the connections, all the say protein interactions between all genes in the genome and it will just give you that huge network which you might be interested in analyzing. So gene mania has three parts. It has a large automatically updated collection of interaction networks. Yeah? Yes. Yeah. So, and so that ends up actually being fairly useful and I'll talk about that a little bit more with the set escape plug-in gene mania. So there's three parts of gene mania, this automatic database, automatically updated database of interaction networks. There's the query algorithm that searches the interaction networks and there's this interactive website that allows you to browse the results. And there's a lot of link outs so you can go, if you want to find out more information about genes, you can or networks, you can connect back to the original data. So this answers a little bit more detailed answer to the question earlier about how often things are updated. So here, gene mania pulls data from these different databases currently. So we have co-expression and co-localization data from GO, pathways from pathway commons which collects pathways and physical interactions and genetic interactions actually come from pathway commons and biogrid and pathway commons is a, I mentioned yesterday, collects data from a number of different databases, mint, reactome, NCI, human protein reference database and others. We get shared domain networks from Ensembl so shared domain networks are a shared domain relationship between two proteins occurs if those two proteins share domains. So if you have a kinase, if you have a protein with a kinase domain, it will be connected to other proteins with kinase domains or if you have proteins with a similar set of domains they'll also be connected with this even stronger connection. Predictive interactions come from a few other sources like I2D and then there's some other ones. So depending on the species, there might be some specific networks that we've added like for mouse, we were able to get a phenotype correlation network for genes. So two genes that when knocked out have similar phenotypes would be connected. So those are interesting functional relationships. And this system is automatically run periodically and we update it every few months. And usually when we update it, we'll send an announcement to the announcement list and we'll post something on the Gene Mania website saying that this is the latest dataset is being updated. Gene identifiers that Gene Mania recognizes are gene symbols, official gene symbols, entree gene IDs, ensemble IDs, unipart identifiers, and also some synonyms that organize some specific names if they're non-ambiguous. So I mentioned yesterday that you don't really want to type in P53 because it's not an official gene name. It's actually, the official gene name is actually TP53 but Gene Mania actually will recognize P53 because we've put in all the synonyms in there as well. And we've just removed synonyms that are ambiguous. Basically synonyms where the same name has been attached to genes, we've thrown those out and we've kept the rest. So in certain cases, when there's non-ambiguous names, it will actually recognize the common name for a gene. So Gene Mania currently has seven organisms and about 1200 interaction networks, most of which are half of which are co-expression. And I showed you the web network browser. So there's also a plugin for Sight Escape. The Gene Mania website is meant for people to sort of, if you're just generally interested in doing a quick search, but it's limited because of web browser technology is limited, so it won't be able to visualize very large networks. The assist site will start getting slow. If you type in 200 or 300 genes, it probably will work but we've set some limit, which I can't remember, it's like 300 or 400 genes. So if you have lists that are bigger than a few hundred genes, then you should use the Sight Escape plugin. So the Sight Escape plugin is only limited by the power of your computer. So you can type in as many genes as you want in the Sight Escape plugin and you can also select as many genes as you want to come back. So it's not limited to just 100. You can type in as many genes as you want. You can assign similarities to the entire genome if you're interested. And the only issue with the Sight Escape plugin is that currently the way it's implemented is that before you use the Sight Escape plugin it has to download the entire database for the organism that you're interested in. The database is pretty large and so it might take like 15, 20 minutes to download it on a fast connection. So that's a little bit annoying but it's mostly once you've downloaded it at that one time then you don't have to download it again unless there's an update. And so Gene Mania, you can check for updates. But this is a great way in Sight Escape to get networks from a gene list. So and this is what it looks like. You can just paste your genes in here and you can select your networks to search here and then you away you go and you get the results like this that looks very similar to the Gene Mania website. It has functions and genes here and the network is colored the same. So it's mirroring the website. It's just more powerful. And there's also for people who are power users and you wanna, if you're interested in running Gene Mania many times, for instance, in automating it in some kind of pipeline like say you're doing gene function prediction with a new genome that you've sequenced or something or you just wanna run it thousands of times. There's a command line, the Gene Mania plugin includes a bunch of command line tools that allow you to completely automate any Gene Mania analysis and save the results as text files that you can interface with in another tool. And that's very powerful. There's some very powerful tools in that plugin manager, that set of command line options for the plugin. So we're going to be adding additional organisms like E. coli, also non-coding genes, microRNAs and we'll be adding regulatory networks from chromatin IP and microRNA networks. We'll be adding additional phenotypic information from OMIME and eventually we'll be adding more orthologous information. So if you have a network in one species, you can map it over to another species. That's already done somewhat in the predicted section. So a lot of the predicted networks are predicted between species. So we're going to kind of do this in a more systematic way. And so Gene Mania is at genemania.org. If you want to try new features before they come out, it's not always, there aren't always new features here, but you can always try the beta.genemania.org site. And sometimes if in a class situation, if you, lots of people are using genemania.org, this is another server that's independent that is often interesting. So I included some extra slides that basically talk a little bit more about the other kind of network weightings here, so I'll just briefly mention them, but you can, they're not, the extra slides are kind of going into more detail about how the algorithm works. But the only thing I guess I want to mention here is that there's a, if you really want to, so there's a section of the advanced options called network weighting, and there's one, a couple of options that are under equal weighting. And you can say equal by network or equal by data type. What that allows you to do is bypass Gene Mania's weighting of the network. So normally Gene Mania will try and give you the most informative networks for your gene set, and it will remove other networks that are less informative. But maybe you want to find every possible connection that you could possibly find between your genes, in which, in that case, you should choose equal by network, and what Gene Mania will do is we'll just treat all networks equally, and you'll get them all back. So that would be the setting to use if you wanted to just get, if you wanted to convert your gene list into a network, you just select equal by network, and now all of the possible networks will come back of the type that you selected. So there are other slides here that you can go through, but I'm not really gonna go through them in detail, just talks more about the algorithm and how the weighting algorithm works. So any questions? So Gene Mania, we didn't have a lot of time to go through an official lab, but we did include, and it was handed out today, an open helix tutorial. I heard you guys learned about, or heard about open helix tutorials a couple of days ago, but basically it's a company that makes tutorials, and we asked them to make a tutorial for Gene Mania, and there's a couple of pages there. If you're interested in following kind of a protocol for Gene Mania, you can follow through those pages, and it kind of goes through all of the things that I mentioned, but it shows you in more detail and a few additional things. So that's basically it for the section. Any questions? So I guess to summarize, try to teach you some of the most common and practical pathway analysis methods and tools that are out there, kind of focused on things that, so enrichment analysis is useful for summarizing gene sets, finding which pathways are enriched. That's the most common pathway analysis method, and there's lots of tools out there for doing that, like David and others. We've developed this enrichment map plugin to help visualize the results, and so that's the way we prefer to do enrichment analysis these days is we always visualize the results with enrichment maps because it's just easier to interpret that way. And then I showed you site escape, which is useful for network analysis, really could only give you kind of a brief flavor of site escape, so you learned how to use it briefly and saw how plugins work, but there's lots of other plugins that are available. And so, and I tried to give you some pointers on certain slides that can point you in different directions if you're interested in particular types of analyses, and you can learn more about them. And then finally, Gene Mania is a website that we've developed for, you know, just I guess three things. One, Gene Function Prediction, Extending a Gene List, or Gene Recommender System, finding out what function a single gene might have, and then also converting a gene list into a network. Okay, so there's, yeah? So the question about the clustering chart in David, so David, with the clustering chart, and you looked at that in the David tutorial yesterday, the clustering results from David are basically like a way of making an enrichment map, but in text format, so they've tried to create little clusters of related gene sets, and you can choose different clustering parameters and group gene sets based on those parameters. The enrichment map is a similar idea, it's grouping similar gene sets together, but it's providing a visualization instead of a list of, just a straight list of gene sets. So basically those are conceptually similar, just two different ways of doing it, yeah? So there's a lot of, there is an R package for set escape that can allow you to control set escape, it's called R set escape. It allows you to do a fair number of things, like for instance, if you have a network in R, you can visualize it in set escape, but it's not as connected to R as it could be, so not everything in set escape is available in R. One of the future goals of the next generation, the next version of set escape, which we're working on, is to more closely link scripting languages with set escape so that you can control all aspects of set escape from the scripting languages. So there is actually a fair amount of support for scripting language support in R, but it doesn't have complete coverage, so you might download a plugin like enrichment map, and then that is not able to be controlled by it from R. So if you were doing an enrichment, but you could mix and match the tool, so if you used R for your, or bioconductor for your gene enrichment analysis, and saved the results, you could load it up into the enrichment map plugin, and visualize it. And that would be sort of a way of combining them, but not really controlling it from R, but just inter-converting data. So R, the bioconductor package in R has lots of tools, including enrichment analysis, so it can do enrichment analysis, and if you're doing that in R, it's just a personal preference, you can save the results and visualize it in an enrichment map. And enrichment map supports by default, GSEA, David, Bingo output, so that specific output that it can just load up right away. There's also another option, generic, which is any other tool that's out there that as long as it saves its results in some format, you might have to reformat a little bit, but then you can just load up a generic format. Any other questions? Okay.