 All right, now we will change gears a little bit, and we assume that you have mapped all your genomic variants from cancer genome sequencing to some really interesting genes, and you want to map these genes further to networks to understand what they might be doing in the context of cancer biology, maybe some clinical relevance over there, and also the fact that mutations tend to be sparse, and you can use the idea of pathways and networks to group together meaningful variants into biological themes. All right, some of the learning objectives of the modules. Genes have many names, and it can be a complex maze to navigate, so we need to understand how to use identifiers for genes and how to convert those identifiers. Also, genes have been studied to a greater or lesser extent over the years, and therefore they have many annotations from these studies. We have ideas of what genes may be doing, and in which cell or compartment they may be acting. Then we will learn about the gene set enrichment test and how we would use it for our purpose of interpreting genomic data. Then we will learn about multiple testing corrections, and why would one use a multiple testing correction, and then finally, we will learn about gene set visualization and a little bit about enrichment map. Okay, so this is a very classical scenario in the era of high throughput biology. You do a cool experiment or maybe a sequencing assay, and you come out with a thousand hits. Now what? You don't really want to analyze these hundreds or thousands of hits one by one, and what you usually do is you perform some sort of statistical mathematical analysis. You come out with a ranking or clustering of genes and end up with these long lists of genes. And to further better interpret your long list of genes from your experiment, you need to use prior knowledge. So genomic databases, pathways, networks, literature, and you better use this with different analytical tools because otherwise it's a hassle, and in the end you'll maybe find something new about your favorite molecular mechanism or gene, you publish a good paper. One way of doing this is using a department database, so you have 100 genes of interest. Each one of them will have maybe 10 associated papers, some genes are famous, like P53 will have 100,000 papers, and then you just do a little bit of reading one by one, and you'll eventually end up in the same spot, but maybe you don't have all the time in the world to do that. So this is why you want to use pathway and network analysis tools to do this in a slightly more systematic way, in a statistically sound way and come up with a faster solution. So pathway enrichment analysis in the way we describe comprises two major components. On the one hand you have genomic activity profiles, which is a very broad way of saying anything that you can measure in a high throughput experiment, for example you could measure all the somatic mutations in a set of tumor samples, or you could measure the gene expression in the profiles of all the genes in the cell, so you'll get maybe a list of genes, or maybe a ranked list of genes, or perhaps genes with scores. And on the other hand we have abundant knowledge about what genes have been doing or have been observed to be doing in various experiments that have been accumulated over years and years of research in various pathways, pathway databases and network databases and genomic databases. Using these two sets of information, so gene annotations and the gene themselves, you can use various tools that statistically integrate gene information and annotations, and then you come up with say pathways or networks that are enriched in your genes of interest, or maybe some structural relationships or gene regulatory networks, so on and so forth. So this is a very general description of a family of methods that can be applied. But before you go into that analysis, you want to call genes with the right names. This is really important because they're increasing, you know, even metastas that is showing how everyone's favorite tool, the Excel spreadsheet, tends to generate the wrong names just by opening up text files, which have apparently affects up to 20% of scientific literature in high impact journals, which is, it's scary, right? But besides these very obvious things such as Oxford being converted to October the 4th, there are other more subtle things, because people have studied genes over time, they have attended to call them different names, and these different aliases and names still float around and can cause confusion in your bioinformatics pipelines. Also different genomic databases call genes according to their own identifiers, and often there is no clear one-to-one mapping between them. And these issues that you want to keep in mind is also that your genomic experiment handles some sort of IDs, perhaps you did a microarray experiment, microarray probe set comes with their IDs. When you map these probe set IDs to the genes, there's no one-to-one mapping in many cases, and you want to use automated tools to do that. And then moving on to some of those ID challenges, one of them is one-to-many mappings, particular identify corresponding to multiple genes, the identifiers that have been assigned over time. People who speak about genes sometimes don't talk to people who talk about proteins. So gene identifiers and protein identifiers are not overlapping, and then there are these major efforts to bring everything together. For example, the gene symbols are supposed to be a standardized set of genes that are unique. However, even the gene symbols from the human genome nomadic to check committee, they change about 10 percent over a few years, which is just found out. Actually, an error introduction I just mentioned, October 4th, or October 4th, the stem cell regulator is a very well-known example, but there are others that accept it too. And there will be always problems of reaching 100 percent coverage of all the genes in the genome, so if you really want to achieve that, you should go through these examples one by one and perhaps try different resources, gene cards, ensemble, you see a seed to make sure that the gene that you're talking about maps to that identify multiple sources. There are automated tools that do identify mapping for you. This is part of G-profiler, that's part of my PhD back in the days. It's called G-convert, and it allows you to convert the set of gene identifiers to various different sources, and there's probably a few dozen different types of identifiers that are supported for each species. For human, it's more than 100. Now, let's imagine that you mapped all your gene identifiers, and you're happy with the gene set, you need to start interpreting them in the context of pathways and networks. And pathways and networks are really abstractions of cellular systems and biology. So here on this slide, on the one hand, you'll see like a pretty three-dimensional picture of the cell, and on the right-hand side of the slide, you'll see the database description of that cell as far as we know using our current knowledge. So various components of the cell, various processes of the cell, and molecular functions ultimately boil down to list of genes. This is the simplest type of a pathway or a functional characterization of biology, just lists of genes associated to some sort of function, tagged with a particular term or a description or a sentence. Obviously, this is much more complicated because genes interact with one another. They activate or inhibit one another or make physical complexes. But in the most simplest form, when we talk about functional annotations, we just talk about gene sets and gene lists, absolutely. That is one of the major problems is the redundancy of various pathway lists. But these aren't pathways, like gene sets are different than gene pathways? Yes, it depends on how you analyze them. The easiest way to do pathway enrichment analysis is to treat the pathway as a set of genes without interactions. That is like the biology. It's the very simplest abstraction of biology. You add a layer of interactions. It becomes much more complex, much more meaningful, but you also reduce coverage. So not every pathway that we know of has very detailed interactions. What I'm saying is that even gene can be more... Absolutely. Yes. More coming on that on the next slides. So what are pathways? I think everyone who works in biology or even in pathways will have a different interpretation of what is a pathway. In the context of this lecture, a pathway is mostly a set of genes that are working towards the same function in some way, and they are ultimately coming from a database where human curators or curating algorithms have accumulated that knowledge over time. One of those databases is called the gene ontology. The gene ontology contains biological process, a particular structure that has thousands of sets of genes that have been associated to various pathways and processes. Other pathway databases include Reactome, developed here at UICR, which generally provides more detailed information, also including interactions and post-transnation modifications, various details that we need to know how genes interact in order to carry out the function. But if you have a set of genes, then the pathway enrichment analysis starts to look like a hammer with a bunch of nails. So you can use that type of an approach to do many other things. Besides these gene ontology biological processes, you can also look at cell components, hypothesizing that maybe many of your genes that you detected are more enriched or present in a particular cellular component. Maybe that has something to do with your experiment. You can look for disease annotations or positions within a chromosome, or perhaps transcription factor binding sites that are all seem to be near your genes of interest. So the statistical framework is relevant and applicable to much more than just sets of genes associated to pathways. So what is the gene ontology? This is the major resource that you are likely to use when you do pathway enrichment analysis in the sense of these testing gene sets. And gene ontology generally contains two things. First of all, it's a very complex and detailed dictionary of biological phenomenon. It's a set of words and concepts and phrases from biology, and they are connected into this hierarchical structure. On the top of the hierarchical structure, you may have something like the biological process, which is further distinguished into cellular processes and extracellular processes and so on, all the way to the very bottom most levels of very detailed processes and functions that maybe are associated to only one gene. So point one, the gene ontology is a dictionary. And the second part is that it's a set of annotations. So for many genes that we know of, we have little arrows that connect to these particular words in the dictionary, or in other words, we have found that that gene is associated to a particular process in the gene ontology. And gene ontology is general. So it applies to the biology of prokaryotes and deokaryotes and humans and mammals and fish and everyone alive should essentially be able to be mapped to the gene ontology. Here's actually an example of that structure that I described earlier. This is a directed acyclic graph or an ontology where the term are more general towards the top and more specific towards the bottom. They are connected through various logical relationships. A biological process can be part of another biological process, or it can be a more general specific manifestation of it. And these different terms describe multiple levels of detail of gene function. And as we already started to discuss, genes will be annotated to various levels of organization here, and therefore there's an immense level of redundancy. So one particular gene can be associated to hundreds of thousands of gene ontology terms because these terms are interrelated each other through hierarchy. What does Go cover? Gene ontology has three different branches. So it looks like a tree upside down. And those major three branches represent cellular component, molecular function and biological process. Whereas biological process is usually the most informative when you do pathway enrichment analysis. This is because this covers all the cancer related processes, such as cell cycle, proliferation, differentiation. The molecular function tree is more towards biochemistry and cell components are what they are. So it's more difficult to interpret them directly in the context of cancer. Okay, so as I discussed, there are the terms that make up the gene ontology and the annotations. Where do the terms come from? The terms generally come from two major sources. Literature curation is carried out by experts in the field. The gene ontology editors at the EBI in Cambridge and also species-specific databases maintain their own gene ontology annotations, but they ultimately converge into a gene ontology database that is updated very frequently. I think it's updated every day. Terms are generally added by request and the experts help with major development because this is a live organism, so to say. It changes every once in a while and sometimes there are major changes, like bigger branches being erased or created. And here are some numbers. It's evolving actually pretty rapidly and it's worth paying attention to when you use an online tool to do gene ontology analysis. When was it last updated? Because that may actually change your results quite a bit. The second part of gene ontology are the annotations and these are even more dynamic because people discover new gene functions all the time. Genes are linked or associated with go terms by train curations or more often algorithms are associating these genes to various processes and pathways. These are known as gene associations or go annotations and as we already mentioned, there are multiple annotations per gene. So why are there multiple annotations? One of them is that genes work in multiple pathways, obviously, but the other one is a more technical reason because all of these pathways are hierarchically related. So are their gene contents. So if a particular gene is part of a pathway, it is also part of any parent pathways. So here's an example of the aurora kinase B which is associated to B cell apoptosis, a particular form of apoptosis. But in addition to that specific annotation to B cell apoptosis, it is automatically also part of any other more general forms of apoptosis and cell death and all the way to the biological process. So you can see that by an expert curating aurora kinase B into one process or pathway, automatically it gets replicated to dozens of other pathways that come above that particular pathway. So that creates a lot of information to crunch through the algorithms or visualize. Annotations come in different flavors and that also determines what is their quality in terms of scientific knowledge behind it. The most foolproof is manual annotation curated by scientists. It is generally high quality, but it's also a much smaller number of the total number of annotations. So only so many annotations can be curated by experts, especially these days when there are many high throughput experiments and the supplementary tables or genes associated to who knows what. And then many times these are the automated procedures that go through the different supplementary tables and associate the genes to the functions. So this is called electronic go annotations. They actually make up the majority of annotations. And in some tools that you use, you have the opportunity to exclude electronic annotations and only to stick with the more high confidence annotations when you perform your analysis. So the key point is like, once you find your favorite genes or favorite annotations, make sure to go back into the original annotations or at least the quality scores of those annotations and see what type of evidence was actually provided in order to make the conclusion that one gene is associated to another process. Some evidence types have been formalized and you can see here that they range all from like a direct mutation experiment to gene conservation, to gene expression similarity, all the way to things like author statement or unknown or electronic annotation. So that just tells you that certain results you should take with more salt than other results. And there are certain visualizations, including in the G-profiler software that will allow you to have a quick glance of what type of evidence basically supports your findings in terms of pathways and networks. Here's an example. There are various types of evidence that support relationship between a gene and the pathway or color coded. So the darker redder colors represent the high confidence experiments and then the lighter and bluer it goes, the more likely the evidence is high throughput, maybe curated by an algorithm rather than a person and that will tell you how confident that finding is in general in context of known information. I've mentioned this a couple of times now, but it's worth mentioning again, gene ontology and the special, the gene annotations are evolving very rapidly. So information gets passed there best before date quite quickly. This is a study that we performed in our lab here where we went back into different releases of gene ontology and essentially counted how many terms there existed in the vocabulary at different years and how many gene annotations there existed in different years. In this figure, you'll see that the number of terms or the thickness of the vocabulary of biomedical knowledge basically doubled over the past seven years from 10,000 to 20,000 in mouse and human and other species showed similar growth. But where it really matters is when people do analyze their data using public resources because it turned out that many of the online web-based tools that carry out gene ontology analysis hadn't been updated for years. So we went out to investigate how does that affect the results? So it turns out that there's an elephant in the room denoted by this red big bar. The height of the bar shows the number of citations of that particular software in 2015, so two years ago. It seems that it captured the majority of citations in the pathway analysis space, yet their annotations were more than six years old at that point. So that means that a lot of the knowledge from that tool that went into scientific literature is likely a little bit out of date or at worst it's invalid. And we sort of looked into that more comprehensively by analyzing the list of cancer driver genes from a brain cancer just to compare what does it mean that one study would have used annotations from five years ago and another study would have used very recent annotations. And then this network visualization shows in purple the information that you get from up-to-date pathway analysis and then the yellow shows you the information that you would get from both the updated version and the out-of-date version. So that's about four or five times more information that you would get when you use an up-to-date database compared to a popular yet outdated database. And in terms of cancer biology, this pathway analysis revealed some pathways that are now in clinical trials for cancer drugs that weren't obviously apparent in the earlier days because these pathway targets haven't been discovered yet. So whatever you do when you do a pathway enrichment analysis it's worth paying attention to when the information was last updated. No, it was updated after the bio-archive went viral. So it's fair to say that David has been updated but you could speculate that some of that happened before because people notice that it's really out-of-date. Yeah, but there's a big problem because everybody knows that there is a good solution to that. They probably just don't have funding or something like that. Yeah, well, they need to update it just consistently. No, but it's a big point though that there are very few funding sources for maintenance of databases. Okay, so what are pathways? Everyone will have their own idea. Pathways are simplifications of molecular biology where we attempt to depict the mechanistic details of various systems, metabolic systems, signaling systems, other systems. Pathways are accurate to the level of detail that we know. They're curated by humans. We try to capture cause and effect and perhaps the most powerful way is to visualize these different relationships in a human interpretable way. However, pathways are more sparsely covering the genome that function on sets. So it's easy to pile genes into sets saying that someone somewhere found that these genes are all working on the same function. However, to define their relationships or various kinds of molecular interactions, it's way more difficult. And also, like once you have a pathway that's laid out as an interaction diagram, you need more complex statistical and mathematical ways of interpreting that data in order to say something about, say, the prevalence of cancer mutations. And pathways could be static models or they could also be dynamic models. But in order to build dynamic models, we really need good high resolution data. So here's a small example of a signaling pathway from the Keck database. You'll see that it's immensely complex to understand. And given a set of mutations, you could start to figure out what the mutation upstream is doing to the effect of downstream genes, perhaps, but you really need good molecular data to perhaps measure that. So in the same samples, you'd like to measure maybe protein levels and mutations and RNA levels and all these other things that you often don't have access to. A good resource for these pathway diagrams is the Reactome Database, which also provides several analytical tools to understand the statistical significance of your findings, and these will be discussed tomorrow morning in another lecture. But here we are discussing the most simple types of tests, the Gene Set Enrichment Tests, which really describe your pathway as a set of genes and they also describe your experimental list of genes and the goal is to find whether pathway genes are more present in your experimental results than compared to random chance. So I already showed you that figure earlier, where on the one hand we have some activity profiles, perhaps somatic mutation from cancer genomes, and we have prior knowledge about gene sets, genes that are involved in various processes or pathways. And using statistics, we can determine which of those prior knowledge related gene sets are actually overrepresented among our molecular information, so activity profiles or somatic mutations or gene expression values or so on. So a typical enrichment test looks like this. On the left in yellow, you'll see your experimentally derived positive genes. So these may be genes that, for example, had a detectable amount of positive selection in your cancer genomes, they had more than expected number of mutations. And then the bigger box in tan and yellow represents all the genes that you analyzed. So perhaps you were looking only at the protein coding segment of the genome, you looked at 20,000 genes, out of those 100 showed more than expected number of mutations. This defines your background and foreground sets. So these are all the genes that you studied. Now you have a black box called enrichment test, which will, as additional input, take all the gene set databases as input. It will crunch through the different gene sets and decide that some of the gene sets are particularly representative of your experimental data. So maybe spindle function and apoptosis were some of them. And they come with a significant p-value. You report the p-value, visualize the results and publish. And so what does the p-value really do? One of the more common p-values that we compute is a fissures exact p-value or a hypergeometric test p-value. And the p-value assesses the probability that by random sampling of genes from the detectable set of genes, you will get an overlap with a particular pathway that is as large as the pathway that you observed. To compute the fissures exact test, you essentially build a contingency table where you decide whether a gene set, whether any gene is present in a gene set, yes or no, or whether any gene is present in your significant demutated list of genes, yes or no. And you estimate the probability of seeing that by chance using the hypergeometric distribution. And then if the gene set is more overrepresented than you would expect by chance, you can report that that pathway maybe has some sort of a significance in your experiment. And one really important thing to pay attention to is the background rate. So the experimentally detected genes is usually set to the entire set of protein-coding genes in any pathway tool. And that assumes that you are actually actively able to measure any of the detectable genes in the genome. So perhaps this is the case in a standard sequencing study where you were actually actively sequencing all the exons of all the genes. However, let's imagine a situation that you're looking at an older data set where only 100 genes were sequenced. So maybe that's part of a gene panel of very well-established trans-age genes. And any patient in that set only was sequenced for that 100. So the rest of the almost 20,000 genes have no signal whatsoever. If you run your statistical enrichment test now using the entire set of coding genes, you will get very large inflations in any pathways because any gene that you come out with that has a meaningful score will only be sampled from that 100 that comprise the original panel that you sequenced. Another example is that maybe you're doing a phosphoproteomic analysis and you're looking for the proteins that get phosphorylated under a particular cellular condition. However, it turns out that only about two thirds of proteins can get ever phosphorylated and the other proteins are not phosphorylated at all. Therefore, if you do a pathway enrichment analysis that considers all the genes in the genome, then you will naturally get an enrichment for phosphoproteins because these phosphoproteins are the ones that ever get a signal because of your experimental design. So therefore, when you do a pathway enrichment analysis, think about this a little. Did all the genes in your genome get the signal according to experimental design? And if not, then you need to select a more conservative pathway test where the background gene set is explicitly defined as the list of gene that were part of your experiment. So multiple testing corrections. I'm sure that many of you have heard of them and maybe actively use them. They are essential in the era of genomics because we measure many things at the same time. We conduct hundreds of thousands of statistical tests and therefore it's very likely that one of those statistical tests will win the p-value lottery just by chance and you'll get a very meaningful result that looks meaningful but can be noise. So here's a visualization of what this fission's exact test actually looks like. Say we have a bowl of genes or balls and most of them are boring genes. So they're red and some of them are really interesting genes say the cancer driver genes that are known, these are black. So if you just take one random draw then you're quite likely to mostly sample the red balls and none of the black ones. But because in genomics we carry out thousands of tests, for example in the pathway enrichment analysis we can easily look at 10,000 genontology terms. Then we sample and sample and sample and then ultimately a few thousand draws later we pull out the bowl, a handful of bowls or genes that are all really interesting, all the black ones. Because of that we need to apply multiple testing correction. So any p-value cannot be taken at face value because if you conduct a series of them it's more likely that at least some of the series will look really significant even though there is no reason to see that. So in essence you can expect the random draw to have an observed enrichment once every one over p-value draws. So if you draw a good number of tests then one of them will look very significant just by chance. Another way of doing it is not sampling the same bowl but sampling different bowls every time. So one time you sample black versus red genes, other time you sample square versus round genes, other time you sample apoptosis genes versus schizophrenia genes. But if your experiment comprises a large number of those tests you need to take more care to interpret the p-values. The simplest p-value correction is the Pontheroni correction. This is very stringent and it's also a little out of date so mostly for teaching purposes. If you have M tests and then each one of them provides a p-value then you need to multiply each p-value by M so say 100 to get the actual post-correction p-value. So this is very stringent so suddenly 0.001 becomes 0.1 and you can't report anything. Now the assumptions of that test are also quite stringent. So you apply this to the correction and afterwards you are certain that there are no false positives. So the correct p-value is greater than or equal to the probability that at least one or more of the observed enrichment is due to random drops, so at least one. And you control for the family-wise error rate which is stringent correction. So that we already discussed. You may have actually very nice looking and valid results but if you apply Pontheroni and you go through a large number of tests then you are quite likely to wash out any important signal and just remain with no significant results. So therefore, often in genomic studies we are willing to accept something that is a little weaker. So we accept weaker results. We accept more false positives among our results but we actually gain results with that caveat in mind. So this is called false discovery rate or FDR. I'm sure that everyone has encountered that. FDR is the expected proportion of the observed enrichments are due to random chance. So instead of putting a very focused count of one false positive or more, you say 5% of the results could be wrong. So if you have 100 results, maybe five of them will be wrong. And typically the FDR correction is calculating using the Benjamini hopper procedure which is state-of-the-art in genomics but there are many variations of this procedure for various situations. Let's try to walk through an example. This is maybe a pathway enrichment analysis where the nominal p-value coming out from a fissure's exact test gives you the number of random cases that you would count to see that observation occur by random chance and to see that they're ranked by their significance starting from 0.001 all the way to 0.99. So the way the Benjamini hopper correction works is the following. First, you multiply each p-value by the rank of it in its list. So you'll see that there were 53 of them altogether, for example. Then the first one is multiplied by 53 over one and the other is 53 over two and so on. So they get adjusted. And then this is similar to the way adjustment happens in the Ponteroni. Now the q-value or the final post-correction p-value or fdr-value is computed such that the corresponding, let me rephrase this, the corresponding to a nominal p-value is the smallest adjusted p-value assigned to p-values with the same or higher ranks. So it becomes like a staircase. Instead of a flat series of values, the bottom most value has this equal fdr-value to the top most value and further down, another set of pathways will have similar values all the way to down. And now, based on that correction, you will filter the 5% threshold, seeing that the first four will become significant and the rest will be corrected. Compare that to the first column where all of the visible pathways had a significant p-value. The corrected p-value is way more conservative. So as you see, when you're really interested in getting out important or significant p-values, but you still run a large number of tests, you can't really get around it. The more tests you run, the more conservative you have to be about each individual p-value. So one way to get around this is just to do a little less tests. You may want to look at your set of results that you're testing a little bit more carefully, filter things out that you think are low confidence, and then you'll naturally have fewer tests to go through, and then each p-value will be considered more interesting as a sum. So one way of doing this is to filter in your input data. For example, if you are looking at gene expression data in RNA sequencing, then quite often people filter genes that are expressed at a very low level. So because they're expressed at a very low level, you don't have a very good confidence whether they were expressed or whether it was noise, and you can filter them. In pathway enrichment analysis, you may want to take a hard look at the pathways that you're analyzing. For example, there may be thousands of pathways that are just associated to one gene because they are very specific, and therefore you may want to ignore all these small number pathways and only focus on the larger ones. That will also dramatically reduce your false discovery corrections. So finally, once you have detected your interesting enriched pathways, almost essentially is the visualization of that data because as we discussed earlier, there's a lot of redundancy, mostly because genes are involved in different pathways, but also because pathways are called different names and different levels of specificity. So one way of doing that is to build an enrichment map. The problem we already discussed, there are many gene sets, and there are different definitions of pathways. Moreover, when you start to analyze, say two pathway database at the same time, gene ontology and rectome, they will overlap by a large extent. So here's an example. You started by analyzing 100 genes and you thought that you can boil them down to three pathways, but it turns out that you boil them down to 500 pathways, what do you do? You can rank them by p-value and take the top 10 pathways, but because they're all redundant, it can turn out that the top 10 pathways represent exactly the same thing, that for some reason is quite strongly represented in your data, and the more interesting and non-redundant stuff comes down the road. The way we often visualize this pathway enrichment maps is called enrichment map, and it's developed here at the U of T in the lab of Gary Bader. It's a network visualization, but as opposed to a classical network where genes are nodes and they're connected to other genes through edges, then this network is a network of pathways. So each node here represents a particular pathway, and it's connected to another node or another pathway if those two pathways share many genes. So if pathway A and pathway B share many genes, then maybe they are also functionally related or they're doing the same thing in cells, or maybe they're just two different descriptions of the same biological phenomenon. One example of using pathway analysis together with enrichment maps in a nice way is coming from a paper of a pendemoma. A pendemoma is a pediatric and adult nervous system cancer in brain and also the spine. Usually pathology is the primary means of understanding these tumors, which means microscopes and tissue slides and so on. However, now there are new methylation-based biomarkers developed for classification. And when you analyze these methylation patterns and gene expression patterns, it turns out that the pendemoma is not a single disease, but it's comprised of multiple different subtypes which have distinct molecular alterations and clinical features and age characteristics and sex characteristics and so on. What we did in this analysis was to understand the biological underpinnings of these different subtypes using pathway enrichment analysis and network visualization shown here. So in addition to highlighting these various pathways as nodes and as networks and clusters of pathways, we also use the color annotation. So different colors correspond to these cancer subtypes of various kinds. And if a node has multiple colors assigned to them, then that pathway is highly representative of multiple subtypes of a pendemoma. So now we used to have nine lists of genes. Each one of them had hundreds or thousands of genes and instead now we have dozens of biological themes that represent these nine lists of genes. So we're going from individual genes to more biological interpretation using textbook knowledge rather than alphabet soups of gene symbols. Okay, so in the final chapter of this lecture, I'd just like to introduce a little bit of the cytoscape software and about what it takes to build a network. A network is ultimately a set of relationships, right? And you can build a network from just a very simple two-column table where the first column is node number one or gene number one and second column is node number two or gene number two. And if they occur in the same pair, then they are interacting and you can generally assign a weight to any of these networks. And there are various ways of visualizing a network. You can also visualize the same network using a heat map such as shown on the left. The cytoscape is a freely available open source and Java-based application. In this tutorial that comes, we will use cytoscape to build a network of pathways, but you can build a network of almost anything. The one recent example is that you can build a network of wine cheese pairings that was published by my former mentor Gary Bader just before the Christmas season. So it's a very powerful tool of visualizing knowledge and interactions. Some of the key ideas about network visualization, when you just build a network and it's a large network, it will likely look like a hairball, very difficult to interpret. So there's a whole battery of different network visualization tools that allow you to lay out the network in a two-dimensional and maybe three-dimensional space and it becomes much more organized depending on these various network algorithms. The other thing that cytoscape is really powerful at is assigning visual attributes to your network. So as we saw earlier, we have pairs of nodes or genes and they're optional weights. You can immediately start to assign these features to your network. For example, thicker edges mean that there's a stronger interaction between genes. If a node is larger, maybe that gene is somehow more important or more central to the network, so on and so forth. There's probably several dozen of features that you can assign for nodes or edges and it looks visually really appealing. So here's an example of a network payout before and after. When you're just loading your network into cytoscape, it will look like a difficult hairball to interpret but when you lay it out, it starts to show clusters. Maybe all the blue genes are interacting with one another way more often than genes of different colors and then you can start to build hypotheses by just visualizing and looking at the data in various rotations. Visual attribute loading into the network is very easy in cytoscape. It's quite intuitive and it's very powerful. The input is often just a spreadsheet and you can use information in that spreadsheet in various kinds. It can be numeric or it can be just consist of different classes and you can define gradients to color your various visual attributes. So that completes the lecture part.