 examples of where we've successfully applied pathway and network analysis to real data and real projects, so you can sort of get a sense for some of the things that you can do with the topics that we're going to discuss today. So as I hope everybody saw in the intro recording video that we made available, the beginning of all these types of analyses starts with some genomics data usually, a screen or a gene expression analysis study, and the first thing that is interesting is that the screen or the data worked and produced a lot of results, but then the question comes up is, now what? How do I interpret this data? And with genomics data, there's so much being produced, it's a challenge to interpret and the trend is that the data continues to get larger and larger, so single cell transcriptomics experiments, a single experiment provides information like 5,000 RNA-seq experiments in one experiment. So what we want to know is what's interesting, what's novel, what new discoveries did I make, what did I learn from this data? And one of the first things that we normally start with is to ask the question, what's interesting about the set of genes or other entities, metabolites, proteins, other things that come out of large scale, omics analysis? And usually after you rank the data or cluster it somehow and you get this gene list, one of the ways that the main ways that we can try to answer this question is to find out if they're enriched in known pathways, complexes, or functions. So can we learn anything about cellular mechanism based on known information by just automatically cross-referencing it? And this is more, you know, and using automated methods is more basically saves time compared to a traditional approach where you'd have to go through the genes one by one and do a literature search on each one. You still have to do some of this usually, but the tools that we're learning in this class help you do this much faster. So pathway and network analysis is, in my view, is any type of analysis that helps you gain mechanistic insight into omics data. Might be identifying a master regulator, drug targets, characterizing pathways that are active in a sample, inferring a network, a regulatory network, a more broad regulatory network, and looking at thoughts, relations sites, any type of mechanistic insight. And it's also any type of analysis that involves pathway or network information, and we'll talk more about that later. As I mentioned, it's the most common thing to do once you have a large set of genes that results from an omics data set, and the most popular type is pathway enrichment analysis, but many others are useful and we'll talk about those as well. So just to go into some examples. The first example I've selected is a project that we worked on more than 10 years ago now on autism spectrum disorder. So autism is a genetic disease that's known to be highly heritable from twin studies, and at the time that the study was started, there were a small amount of the heritability could be explained from individual gene mutations that were linked to rare single gene disorders. And it was also being known that de novo copy number variants were reported in a number of cases, and so there was some thinking that de novo copy number variants were important in this disease. So Stephen Scherer, who's at the hospital for sick children in Toronto, who studies autism genomics, genetics, initiated a study to map rare copy number variants in autism spectrum disorder cases, about 1,000 cases and 1,000 controls roughly, and process the data to sort of generate the highest quality set of rare copy number variants. So these were variants that were less than 1% frequency, and the reason that the team focused on rare copy number variants is because the common, it's unlikely that common variants would have explained this more rare disorder, and they had previous evidence that rare copy number variants were the sort of one of the important causes. So they found a lot of copy number variants, they found in general that autism spectrum disorder cases carried more copy number variants, and especially de novo copy number variants, which were not inherited from the parents, but there weren't that many genes linked to the copy number variants that were repeatedly seen, there weren't that many repeated copy number variants. It looked like there were different copy number variants in many of the different individuals. So we did a pathway enrichment analysis, or pathway analysis on this data. So we took all the genes that were associated with all the copy number variants, and we mapped them to pathways, so even copy number variants, so they're all rare, but some of them are more rare than others, so some of them are present just in an individual sample, and not seen in more than one sample. So we took all of that data, all the genes associated with the copy number variants mapped them to pathways, and then asked the question, given a pathway that is defined as a set of genes, is the pathway associated with the cases more than we'd expect compared to the controls? And the way we figured that out was we looked at the number of cases that were affected in a gene that was part of the pathway, so let's say we have the, you know, a kinase regulation pathway, and it has 100 genes in it, we asked, are there any, how many cases are a copy number, have a copy number variant affecting any of the genes in this pathway, any of the 100 genes, and how many controls? So we get a number for each one, say let's say 10 cases are affected and two controls are affected. And then we shuffled the case control labels thousands of times, and asked the question again, kept on asking the question to sort of understand what level of difference cases and controls we'd expect to have by chance, and then we computed a false discovery rate and a p-value, and that's what this color represents. So each circle here represents a pathway. The lines between them represent, so each circle represents a pathway that's enriched in the autism cases compared to controls. The colors represent the false discovery rate that I mentioned, with lower levels being more significant, lower false discovery rates being more significant and being colored more red, and then green lines connect pathways that are shared genes. You can think about this as pathway crosstalk or just redundancy in the pathway databases, and so all these circles here represent all the pathways that were more enriched in autism cases compared to controls. And as you can see, there were a lot of interesting pathways that came up, and many more, a lot more information than we had from just looking at the individual genes themselves at the gene level. A number of these pathways were not previously known to be connected to autism, and so one of the things we did was we collected all the autism genes that we knew about and all the intellectual disability genes that we knew about, and we did a pathway enrichment analysis on those as well, and that's what these triangles here represent, and this parallelogram here. So all these pathways are known to be enriched in autism, and all these triangles are known to be enriched in intellectual disability, and then we used these connection lines to sort of say how many genes are common between these pathways, and what we found was that there were quite a few pathways that even though they didn't necessarily have, they weren't necessarily mutated in our case control study in the known intellectual disability gene, they were part of pathways that included genes that were related to pathways that were involved in intellectual disability and autism, or they have genes in common. So this gave us a whole bunch of insight about the mechanism of autism spectrum disorder that we didn't know about before, and it was not possible to identify with just looking at individual genes. And maybe just to emphasize one of the interesting things is when we went about this project is when we went to look at an individual pathway to find out which genes were kind of linked to that pathway, we found that and how many samples they were affected in, many of the genes linked to an individual pathway were affected, only were seen in one individual. So we had a lot of genes from the pathway kind of spread out across individuals, so if you took let's say 10 genes from a pathway, they might be mutated in 10 different individuals. So by looking at the genes, you wouldn't be able to sort of see any repeated pattern, but by looking at the pathways, you could see a repeated pattern, and I'll talk about this more in a bit. Okay, so the second project is appendemoma pathway analysis. So appendemoma is a brain cancer, it's a pediatric brain cancer, it's the third most common type of brain cancer in children, and the most common and morbid location for this in childhood is the posterior fossa, and the posterior fossa is at the back of the head and the brainstem and the cerebellum, and previously people had known, based on the anatomy of where the tumor occurs, of how serious it is, and it could occur in many different parts of the brain, but if it occurred at the back of the brain, then people knew it was the most dangerous type. So looking at this type, specifically the posterior fossa and insomical location, Michael Taylor, who's a neurosurgeon at Sick Kids Hospital in Toronto again, who led this project, had previously identified two subtypes of this disease based on gene expression data clustering, and the A subtype, or posterior fossa A, affects the youngest individuals and has a terrible outcome, and posterior fossa B affects the oldest individuals and has an excellent outcome. So even though anatomically people lumped everything together, lumped all cases together if they affected the back of the brain, and just lumped it all together, and said this is going to be a serious cancer, Michael found that some of those, there's a subtype of those that actually have an excellent outcome and maybe shouldn't be treated as strongly as the ones that have a terrible outcome. And this is important because there is no treatment, no targeted treatment for a pendymoma except for radiation and surgery, and radiation and surgery targeted to the brain is devastating for children. Anybody who undergoes this treatment and survives will not have a good quality of life for the rest of their life. So the goal of the field is really to try to avoid radiation and surgery treatment as much as possible, and come up with more targeted drugs. So even just identifying a subtype that has an excellent outcome might help with this, but Michael wanted to learn more about this disease and through whole genome sampling and exo, whole genome sequencing and exome sequencing was looking for mutations that might identify a cause for this disease. And unfortunately, we discovered no mutations. There's maybe two or three that were repeated between any of the samples, and this is very unusual for cancer because cancer is thought to be a disease of genome instability filled with mutations, but it's not totally surprising given this is a pediatric cancer. It's true that cancer is a disease of many mutations, but most cancers are adult cancers, and we know that mutations build up over time in our bodies, and when you're younger you have fewer mutations in general that correlated with the mutation rate in any cell in your body is correlated with age. So another pediatric tumors have also been shown to have very few mutations. And unfortunately it still didn't give us any insight into, any more insight into this disease, but Michael looked at methylation data and found that DNA methylation was able to cluster the data into these two subtypes perfectly, and so that identifies another mechanistic, potential mechanistic insight, which is maybe that DNA methylation or epigenomic processes might be important as the kind of cause of this disease. And in particular, the A-type, Pocerophosphate A, was found to be more transcriptional silence by CPG islands DNA methylation, and this affected about 2,000 genes, and standard pathway enrichment analysis didn't, on these 2,000 genes didn't pull up any identifying pathway. We used a larger pathway database that we discussed in this course, and a more appropriate statistical test that was more sensitive for the sense of data, which I'll mention in a bit, that was able to identify a very strong signal of enrichment with a particular set of mechanisms regarding that were connected to the Polycomopressive II complex, or the PRC II complex. So this bar plot shows the significance of enrichment. So these numbers are a log scale of the P-value that transforms the P-value into a number that, you know, where the bigger the number, the more significant the P-value is. And basically, all of these pathways here were significant in the Group A tumors, and there was nothing really significant in the Group B tumors. And all of these pathways are really related. So SUS-12 and EED are subunits of this PRC II complex, but this is basically means it's all the same thing. And it's just saying that wherever this PRC II complex is known to bind or target on the genome, those are the places where it was very enriched, there was a big overlap in between those known targets and the 2,000 genes we saw differentially methylated. And PRC II complex is known to methylate histones and then DNA is methylated, and it's interesting in this case because it was the first mechanistic insight into this disease, that maybe this complex was somehow causing this disease. And it's also interesting because it was already studied, the enzyme DNA methylases in this complex have been studied, and people have compounds and tool compounds and drugs that target this process, and these preferentially killed cells in cell lines in the mouse model for this disease. And again, I mentioned before that there was no known targeted treatment for this disease, it's, you know, as a result only could be treated with radiation and surgery. When we found out this polycom or PRC II complex, too, might be involved in that there were known drugs and that they killed specifically these cells, immediately the physicians involved were searched for a drug that might be available to treat patients. And they found one called azacytidine, or also called VEDASA, and this drug is just general DNA, anti-DNA methylation drug. And so they were actually able to try it out on a patient on compassionate grounds. This is an individual at the hospital for sick children who his pneumonia had metastasized to the lung, and this is showing the tumor here. And after two months, it doubled in size. And so there was nothing left to do for this patient, unfortunately. So on compassionate grounds, they said, let's try this on the market drug, the drug had been made for a type of blood disorder, and it had never been tried in any kind of neurological disorder. So they were able to try it. And one round of treatment actually, one course of treatment actually stopped the tumor from growing, and the effect lasted for 15 months before it started again, and the patient regained their energy. And they were able to leave the hospital, so it was actually an amazing result that we were able to go from a genomic study where nothing was really known about the disease to a mechanism, a drug, and trying it in a patient within about two years. And now this drug has been, for the past few years, is being part of the clinical trial to more generally test its efficacy. Okay, so here's another, so that's my best example of how pathway enrichment analysis, at least in a project that we worked on, was successful in very quickly moving from, as I said, going over that process. This is another example, just to show you different types of visualization that we can do with this type of data. This is molecular classification of a penimal tumor, so this is, again, a pandemoma, but not just focusing on this posterior fossa A, a pandemoma. It turns out there are nine different subtypes, not just A and D, but a whole bunch of others, supertentorial, and others mostly based on their anatomical region. And we did pathway enrichment analysis based on gene expression data for all of these tumors and visualized them like this. Again, the circles here represent pathways that are represented as, that are basically linked to a set of genes in the pathway, and the lines between them represent overlap between the pathway gene sets, so these pathways have something in common, they have genes in common and they're, they're all grouped together, and then we label them all in this way, and we've colored them here based on which subtypes of the pandemoma they occur in. And as you can see, some pathways are very specific to specific subtypes, so this is, you know, only, this ion homeostasis pathway is only present in this one cyan subtype, same thing with this one, and here's one that's, you know, specific pathways in another subtype, and these pathways for neuron development are present in almost all the subtypes. So this gives a very nice overview of the biology of all of these different subtypes to display this like this. So I should have mentioned that, feel free to interrupt if you have any questions. I can't see the flak, unfortunately, because, but I'll try and turn on the chat here. And I know we're saying go to the flak, but if you have a question and you want to... We actually turned off the Zoom chat, but, yeah, so don't try to look for it. But definitely interrupt, Gary, if you need to, if you have a question. Yeah, and just very generous with the time, and so we're happy to support that. Yeah. Yeah, so sorry, I can't quickly get flak up at the same time I'm presenting. It's quite right, yeah. That's why you have, that's why we have the rest of us to help you. If we see something like that deserves an interruption, we will definitely interrupt you. Okay. Great. Thanks. Okay. So here's a fourth example. So this is an example based on single cell RNA sequencing of five healthy livers. So each, this plot is sort of a typical plot from a single cell transcriptomics experiment where each dot represents a cell, and the cells are organized in this map. In this case, it's a tisny map. The important point is that all the cells are organized, so that cells with similar gene expression profiles are grouped together, and then clustered and colored. And it turns out that when you do that, you identify a whole bunch of cell types. And in this particular map, we identify 20 different cell types that were mostly known, although one of them was these inflammatory macrophages were newly discovered, in the, in the deliver, actually, subtyping. They didn't know that it was, that there were two macrophages. So this is based on over 8,000 single cells. And one of the questions we had were, we found a whole bunch of different clusters of hepatocytes, and we didn't really know what to make of this because, you know, we know that there's some anatomical gradients that occur in the liver with hepatocytes, but, you know, the question is what, how are these hepatocytes, you know, are they, do they have some kind of specialized function? And so we did a pathway enrichment analysis on that. And we, similar to the other plots that I showed, which are we call it, we call enrichment maps, and we'll be talking about during this workshop later today. And again, it's the same representation with pathways represented as circles or gene sets, pathway gene sets represented as circles. And then they're grouped by redundancy and labeled according to their pathway name. And so each of these hepatocytes clusters, you can see, I don't have the cluster names here, but cluster numbers, but all of these clusters are hepatocytes. And we ordered them based on where they anatomically occur. And then we showed the pathways that were enriched in each set. And there's some overlap here, so these boxes are overlapping a bit. But there's also a lot of specialization which was interesting. So this showed us that the, that the hepatocytes that were clustering are probably have different functional cell function. Some of these had more metabolic functions that are known to be related to hepatocytes, but they were, these functions were spread over, over these clusters. So that was a nice visualization. And I'm going to give you one more example, which is getting into a little bit more about how pathway analysis is, or why pathway analysis is useful. And this is the case of a genome-wide association study. In this case, I'm not using a specific example. I'm just giving you a toy data set. And let's imagine if we're doing a genome-wide association study that we have genotypes for 10 cases and 10 controls. Actually, we should say five, I guess five cases and five controls, sorry, that are listed here. So the ideal situation with a genome-wide association study is that you have, you map up mutations in all of your cases and controls. And you look for repeated patterns that are associated only with the cases or only with the controls. So in an ideal situation, you might have SNP A as present in all the cases and none of the controls. So that's perfectly associated with cases. And you might have SNP D associated with all the controls and none of the cases. So then it's perfectly associated with controls. And this would get like a perfect P value to say it's perfectly associated with either cases, controls or cases. The reality is that when we look at real GWAS data, it's much more frequent that it's like this, where each of the mutations is present in a different case. And this is more similar to the autism study where we didn't see the copy number variance in that case occurring over and over again across the cases in the same position very much. Instead, we saw them spread at the mutations spread out all over the genome. So if you were using traditional statistics, GWAS statistics like the way that we assess the P values of the ideal situation would be with a chi-square test or if this was an exact test, then you wouldn't be able to do anything with this situation, this more realistic situation because the case would only have one count and the controls would have zero counts. And there would be no statistically significant association of any of these mutations with cases or controls. So that's a problem. Basically means that this data would traditionally be thought of as not successful. However, if you looked at the same data using a pathway analysis view, so you took all those snips A to F, A, B, C, D, E, F, and you recognized the part of the same pathway, maybe the pathway is called apoptosis. Now you can collapse them all so that you can say that all the cases that are affected in the apoptosis pathway and then the controls are affected in the apoptosis pathway. And I realized that I have some ones here and I shouldn't have put those here because let's say that those are all zeroes, sorry. So in this case we'd have all of the snips, all of the cases were affected in apoptosis and now the controls are affected in apoptosis. And now we do the exact same statistical test, like a chi-square test that could measure the chance of having five cases affected versus zero controls and we would get, again, a perfect association. So what happened here? So we were able to increase our statistical power by doing two things. One is aggregating the counts, that's the main thing we did. So we took all of these individual counts and we combined them into one count so that makes the signal stronger. And the second thing we did is I generally explain very much that we have six nips here to test and so this, we have to do six tests and if we do six tests there's a chance that we can get a test looking significant or being significant by chance and you have to correct for that with multiple testing correction which we'll talk about later. And when we move to the pathway level there's only one test. So we don't have to correct as much when we are working in a space of pathways just because there are typically fewer pathways than genes or in this case, nips. Another useful thing that we gained by converting the data into a pathway view is that we now know that apoptosis might be related to this disease somehow and so that means that we've gained some potential mechanistic understanding or at least generated a mechanistic hypothesis, for instance apoptosis relates to the case phenotype that we didn't have before and if we would have wanted to find that we would have had to take whatever signal we found from the snips which was not even possible in this case and look at the snips in which genes they might be affecting and then which pathways those would be affecting. So, but this sort of automatically gives you this mechanistic hypothesis. Okay, so I think the last example is another theoretical example that's just to illustrate again some of the analysis that we'll be learning about in the class today is the case of a gene expression experiment or transcriptomics experiment where we have a set of differentially expressed genes. Like let's imagine we have a thousand differentially expressed genes between samples so in this heat map view the columns represent samples and the rows represent genes and the colors represent the strength of the differential expression usually this is, you know, you have cases compared to controls so you might identify genes that are more expressed in your experimental condition of interest versus the controls and those will be in this let's say red. So the more red they are the more they're expressed in the cases, the more blue they are the more they're expressed in the controls and so we have a set of genes that's expressed more in cases and a set of genes that's expressed more in controls and we want to know and let's say we have about a thousand of them we want to know, you know, what does this tell us about the condition, the experimental condition we're studying and we can do pathway enrichment analysis on that and that can tell us something like I've showed you but another type of analysis that we can do is master regulator analysis where we take known sets of transcription factors or microRNAs or whatever master regulator we're interested in and we take a database of known targets of these transcription factors and we test if those targets are significantly, statistically significantly overlapping the genes in our list and if they are enriched then maybe this given transcription factor is an explanation for why these genes are differentially expressed and so that might identify a regulator that's important and then we could go test that by, for instance, perturbing the regulator in an experimental model and testing if we get the same phenotype or if our phenotype is reversed or otherwise affected by perturbing the transcription factor. So again, it's sort of focused on helping us gain mechanistic insight into our data, mostly through hypothesis generation. Okay, just to summarize the benefits of pathway analysis versus transcripts, proteins, and SNPs. Pathways are typically easier to interpret because they work with familiar concepts like apoptosis or the cell cycle compared to SNPs, for instance, or genes. It helps identify possible causal mechanisms. It can be used to predict new roles for genes so we might identify genes that are linked to a disease, for instance, and it improves statistical power in the way that I explained. Another useful thing people have found is that it tends to be more reproducible in general. So, for instance, if you have two different cohorts of data, two different data sets, let's say two different people did an experiment on the same experimental condition. Let's say it's a transcriptomics experiment, so each person collected 50 samples and measured transcriptomics data on each of those samples and then tried to identify a set of different express genes that might be used as a biomarker, for instance, to predict cases versus controls. What people have found is that those biomarkers tend not to be reproducible across studies. For various reasons. There's lots of confounding factors that are not known or possible to control perfectly in a genomics study, is the main reason. And so that's really, you know, there's a lot of excitement at the beginning of genomics and proteomics that will identify these biomarkers and will create a new precision medicine that will revolutionize all of our treatment and diagnosis. And that really, you know, happened a little bit, but not as easily as people thought in the beginning. And the main reason for that was that these biomarkers were not reproducible. However, when mapped to pathways, they tend to be more reproducible because frequently when you look at the genes that were involved in these biomarkers, even though they weren't necessarily the same gene, frequently they were affecting the same pathways. And so a pathway-based biomarker has been shown to be somewhat more reproducible than gene expression-based biomarkers in that case. Another thing that is useful is that you can integrate multiple different samples and data types. So that's how I do a pathway enrichment analysis on proteomics and gene expression data and metabolomics data. All of the results are represented as pathways and it's the same set of pathways. And so I can compare them, you know, the pie chart view that I showed you for a penemoma. You could make a pie chart view that is not just for different subtypes of the penemoma, but it would be, for instance, different gene expression, protein expression, different modes of genomics or omics analysis on your data. Okay, oops. So the typical pathway analysis workflow that we'll be focusing on in this course starts with omics data collection. We don't cover how to normalize and score that, but typically these normalize and scoring methods are standard given the type of data that you're working with. And almost all of them will generate a gene list. And then the goal with pathway analysis and pathway network analysis is to learn about the underlying cellular mechanism. And that, you know, we'll be talking about more during the class. And there's a much bigger flow chart that we'll be going through that covers things in more detail and we'll go through this in more detail in the course. So in particular, the workshop will cover pathway enrichment analysis, which is useful for summarizing, comparing data. I focused mostly on pathway enrichment analysis in my intro because that's the most popular technique by, you know, most commonly applied technique by far. It's usually the first thing that people do with a gene list. We're also gonna talk about network analysis. I didn't include a slide here to distinguish these, the difference between pathways and networks, but the, I think it was in the intro that everybody watched in the video. But the idea here is that pathways represent models of biological processes that are developed over time through many studies that are usually described in a mechanistic stepwise fashion that might include reactions or regulation events. And, you know, I think everyone knows what a biological pathway is. Many examples of biological pathways like glycolysis or the TGF beta pathway and network information. So pathway analysis is more focused on that type of information. Usually, frequently the pathways are mapped to sets of genes. Network analysis works with networks. So networks, we'll see. I showed you some networks here, but they're pathway relationship networks. You can make many different types of networks like protein interaction networks where you're kind of capturing the relationships between lots of different proteins or genes. And network analysis is useful for, it's also useful for the same types of things that I mentioned that pathway enrichment analysis is useful for, gaining mechanistic insight into your data. It's just that there's some pros and cons of each one which we'll talk about later. But some of the things that we use network analysis for is to predict gene function, identify new pathway members, identify functional modules, and new pathways that might not be obvious. And yeah, so one of the advantages of pathways is that it works with well-studied concepts. One of the disadvantages of that is that because they're well-studied, they don't cover all of the genes in the genome because we don't know the function of some genes. So network information, yeah? Yeah, sorry, Francis here. Can you say that networks are more sort of large-scale studies and pathways are more, like when gene studies in general? Yeah, I was just about to say that. And I think that's the case. So networks, one of the reasons why network analysis is useful sometimes to discover new things is because we map networks using large-scale experiments as Francis just mentioned. Like if you're mapping protein interactions at a large scale with mass spectrometry, you map all types of genes. All types of genes, not just ones that have been studied previously in a focus way that's led to understanding how they work in a pathway. And because they cover more of the genome, it's useful to use that information. So these pathway and network types of information are related and somewhat complementary. Does that make sense? Okay, so, and then we'll also talk about regulatory network analysis, which is more like the master regulator analysis that I mentioned, to find and analyze master regulators or controllers in the system.