 Yeah, hi everybody as I think we discussed earlier during the introduction. I think there's some new people. So I'm Gary Bader. I'm a faculty at University of Toronto and I'm going to give an introduction to what we're going to work on and learn about during this course. So as a reminder, Nia mentioned that all of the content here is freely shareable, so you're welcome to reuse it. And even if you want to run your own workshops, so and also modify it and improve it. Okay, so why are we here? Most of us are interested in genes and what they do, their function, and this is not so hard to study. If you're interested in one gene, but if you have many genes to work on, whether they're from, you know, a screen or a cross evolutionary space, or from any kind of big genomics experiment, we're usually stuck with figuring out what they all do. And this is sort of typically a problem in genomics, any kind of genomics where we collect data on many genes. And generally we're interested in trying to understand what's interesting about these genes, what do the genes have in common, what do they do, how are they related to a disease or phenotype that I'm studying. So, and frequently the sort of starting point for analyzing genomics data is you might rank the genes by some score that you have, like a gene expression score, or you might cluster it to find genes that work similarly or have similar patterns to each other because of genes have similar patterns, maybe they're working together in some way. And but then, you know, what happens next. And in this course we're going to talk about what happens next. One way that people have found is extremely useful to answer the question what's interesting about these genes is to find out if they're related to known biology known pathways complexes and functions. And we'd like to automate that because if you want to do that manually, let's say you had 1000 genes and you wanted to find out what pathways they were involved in if you didn't have any other resources and you're just looking at the literature. It would take a long time to go through those genes. So, pathway and network analysis is the area of computational biology or bioinformatics that tries to automate this type of activity. And with the goal of helping to gain a mechanistic insight into big gene lists and frequently genomics data. The difference between gene list and genomics data is the genomics data often has measurements available with the genes. So it's not just a list of genes it's also gene expression values or other things like that. And we might like to identify a master regulator, or drug targets or characterize pathways that are active in a sample just to better understand the biology that's occurring in the context that we're studying. And in the context of this course, any type of analysis that involves pathway or network information is we're calling pathway analysis. So the most popular type is pathway enrichment analysis which we'll talk a lot about today but there are many others that are available and useful, and so we'll cover some of those as well. In the next few slides, I'm just going to give you some success stories of pathway and network analysis that we've used in our research that have that's mostly collaborative that has led to interesting insights into and discoveries. So the first analysis example I wanted to mention is work that we did with a lab that studies autism spectrum disorder. The lab is Steve Sharer's lab at the University of Sick Children, the hospital for sick children nearby here. And they've studied the genetics of this, this disorder. You might know that autism spectrum disorder is sort of a kind of spectrum of phenotypes that includes a number of different aspects. Trouble with social situations all the way down to really severe developmental delays and intellectual disability. And it is known to have a genetic component previous to the study that I'll tell you about it sort of knew that we knew that there were some rare genetic single gene disorders and chromosomal rearrangements that caused autism spectrum disorder. And people had previously before this study was published this was in 2010 so it's a while ago but before this people had a general understanding that copy number variants that were de novo or not inherited from the parents might be involved in explaining say 5 to 10% of ASD cases. So, and but in general, not much was known about the genetics of this disease at the time. So, the share lab was very interested in this new at the time finding that copy number variants, especially de novo copy number variants were important. So they measured copy number variants in about 1000 cases and controls cases are, you know, individuals that had severe autism and controls were cases that didn't or individuals that didn't. And they used a snip array to measure snips across the genome and then converted those to copy number variants. You have a deleted region of a genome, there's no snip measurements in a region and so you can say okay that's a deletion and similarly you can call gains. And then they were interested in de novo ones that in particular that were rare, which again they thought sort of had a hypothesis that they were involved in the genetics. So they selected copy number variants from that list that were less that were rare less than 1% frequency. And they found that in general, ASD individuals had more to know those CNVs than than expected. And, but they didn't find a lot of genes that were associated repeatedly in the 1000 case roughly 1000 cases. So only about 10 genes were associated from this analysis. So we thought we can look for, we can do a pathway analysis of all the genes that were associated with all the copy number variants. Interestingly, we found a rich set of pathways that were seem to be involved in this disease so all of the circles and we'll learn how to make maps like this later. This is called an enrichment map, and it's a method that our lab developed, Ruth is one of the original developers actually of this software that made this. All the circles represent pathways, the size of the circle represents the number of genes in the pathway, and the lines that connect them also called edges represent crosstalk or overlap between the pathways so pathways share the same genes they get a strong green line that connects them. And then, so, and then the white to red color represents how strong the genes on that pathway are linked to autism spectrum disorder cases versus controls. And so we found lots of interesting pathways, self projection motility, different types of CNS or brain related pathways like neuron cell, well, that were involved in basically neuronal neuronal processes like CNS development brain development. So now that was interesting. However, and kind of expected given that autism affects the brain. And however, we didn't. A lot of the pathways were new. And so one of the questions that was asked by the, our collaborators was how does this relate to known genes that we know affect autism and intellectual disability which is a related phenotype. So, we took all the, let's say 100 genes, intellectual disability genes and 100 autism genes, and we plotted those and their pathways that were enriched in those gene lists on this map. And that's what these other symbols are so triangles or intellectual disability and parallelograms are autism spectrum disorder, and, and we plotted those in relationship to all the copy number variant and rich pathways and there was a bunch of overlap which is interesting, as I was showing here. But the interesting, what one in particular interesting thing to me was how we found so much more by all biology from doing pathway analysis compared to looking at individual genes. And when we looked at some of these individual pathways, we found that. Yes, there were more autism spectrum disorder patient or individuals that were that had genes in those pathways and we expected, but it wasn't the same genes each time. It was different genes in different individuals, but it was always the same pathway. So, when we looked at the gene level we couldn't see repeated patterns, we just saw everybody had different genes that were affected. And when we map those genes to pathways, we found it was the same pathways over over time so just that mapping of genes to pathways just the bringing in the additional information about prior knowledge about pathways made a huge difference in the study, and being able to find the patterns that we think are important in the biology of this disease. And I'll come back to that and again a few times that general concept. The second pathway analysis example that is my favorite was work that we did with Michael Taylor who's a neurosurgeon, a pediatric neurosurgeon also at the hospital for sick children. He studies his lab studies at PENOMA. PENOMA is a cancer of the appendix, which is the lining of the central nervous system. And it's the most. It's the third most common brain cancer in children, fortunately not brain cancer is not very common in children but among brain cancers and children so third most common, and people had known for a long time that, depending on where the brain cancer occurs anatomically. It has different outcomes, and in particular, the most common and morbid location in childhood is in the posterior fossa which is the back of the brain. It contains the brain stem and the cerebellum. Gene expression analysis that we've done also Ruth was involved in analyzing this data identified that there were two classes of this disease. Previously, anatomically they all everybody thought if it's in the back of the brain it's bad. Otherwise it's not as bad, but just the ones that are in the back of the brain there was actually turns out two types. One that affect the affected the youngest individuals and had a terrible outcome and another one that affected the oldest individuals, still children but had a excellent outcome. And that was determined just by gene expression clustering. So we wanted to find out more about the genetics and you know in cancer, typically we expect mutations to be important. So we searched for mutations with whole exome and whole genome sequencing, but there were no mutations basically basically like, you know, very surprising to me but it turns out pediatric cancers actually have a much lower mutation burden than expected, or then we expect from adult cancers. And so that was not actually that surprising given the context of the pediatric condition. Moving on to another genomics data layer DNA methylation we found that the, the, the PF posterior faucet type a that affects the youngest individuals was more transcriptionally silenced with CPG methylation compared to the B type that affects that has a better outcome. And there were about 2000 genes that were differentially methylated, and it was not very easy to kind of figure out what those genes had in common. If you just looked at them standard kind of looking through is hard to kind of know. So we looked at enriched pathway enrichment analysis with a large database that we collected from many different other databases. And what we found was that there was one pathway that was really strongly enriched in the 2000 in the 2000 genes that were differentially methylated, and that was a pathway that we collected from the GSA MSIC DB database that was had collected a set of genes related to the PRC2 complex. So PRC2 is a complex that methylates histones and then DNA gets methylated. The targets of that complex have been mapped. And this bar here represents that gene set and then also they'd also mapped the targets of subunits of the complex whose 12 and ED are proteins that are part of the PRC2 complex. And the bar plot here represents the significance kind of measured as a number that converts the p-value into a number by taking the negative logarithm of that number. So the higher, the longer the bar, the more significant the pathway is enriched in the list of 2000 genes than expected. And that's interesting. So we'll tell you all about how to do those statistics later today. But the interesting thing biologically was that basically no mechanism had been known to be important in a pendemoma. There were no drugs or therapy other than surgery or radiation. And so this PRC2 complex represents at the first time that a molecular mechanism had been identified in this disease, in particular the serious type A of the posterior fossa pendemoma. And not only that, people have been studying this PRC2 complex for a while and had identified various chemical inhibitors to it. And those were tested in cell lines and mouse models and they killed, preferentially killed, or specifically killed the appendemoma cells. And so that was very interesting. And then it was actually basically validated as a mechanism through those experiments. And then clinicians on the team searched for drugs that might be already on the market that target similar processes and they found a drug called the DASA or 5-Azicididine that inhibits DNA methylation, a related pathway. And so they were actually able to, because this safe drug is on the market, they were actually able to test this in an individual. This is a child who came in with a posterior fossa type A that the patient reached the end stage of their treatment where the tumor had metastasized to the lung. And this is a picture of the tumor in the lung. And over two months, the tumor had doubled in size. So at that point, there's no more treatments left for this patient. They're just going to be in hospital and for the rest of their life. So they tried on compassionate grounds to give this patient a course of treatment of this drug on the market drug. And it stopped the tumor growing and the patient gained their energy and was actually able to leave the hospital. And that lasted for 15 months before the tumor start growing again. But it was enough information to start a couple of clinical trials and those are ongoing and still and lots of patients are now taking this therapy, which is definitely helping them. So this is a great story for us because in a short period of time, just a couple of years, we went from knowing very little about the mechanism of the disease to identifying a mechanism and then a drug and then actually treating the patient basically within two years. And it's a great success story, I think for genomics and pathway analysis and we were also able to find kind of, you know, because there was only one pathway that came up, we were able to conclusively say that if it wasn't for pathway analysis we wouldn't have been able to it wasn't for collecting all the data in our database and identifying that pathway we wouldn't have made that link. And we wouldn't have had, we wouldn't have found that important mechanism. Okay, so a couple more examples I'm going to tell you about that are more quick. The third example is studying these ependymoma tumors more broadly. It turns out that there's not just posterior or phosphotype AMB ependymoma occurs all over the central nervous system. And we had lots of different data gene expression data for these. And we wanted to find out if there were any differences between the known clusters of this disease from gene expression data. So previous work had found at this point, a few years after the ependymoma story that I mentioned that there were nine types of this ependymoma by gene expression clustering. And we did a pathway analysis like similar to the one I showed you before and visualized it as this enrichment map view where the circles represent the circles represent pathways and the size of the circle is the number of genes in the pathway. And you can see links between pathways that are similar. And then we group them into bubbles that we label as kind of major categories of pathways. And then we colored them based on how prevalent these pathways were in each subtype. And you can see that a really nice view of the biology of all the different subtypes. There are certain pathways that are specific to specific subtypes there are other pathways that are present in multiple subtypes. And so this summary just provides a really nice overview of the biology of the whole disease. And the last example I'm going to tell you about is more recent looking at single cell transcriptomics data of five healthy livers that we worked on with the liver team here in Toronto that initially revealed 20 different cell types. You might have seen these plots if you definitely familiar with them if you work with single cell genomics data, each little dot is a cell and they're placed in space so that cells that have similar transcriptional profiles are close to each other. So when you look do that you see that there's groups of cells that are have similar transcriptional profiles and it turns out those are major cell types like hepatocytes or different types of immune cell types. And, and so we've labeled those all here. We, one of the interesting things that we noticed that there was a whole bunch of different clusters that we called hepatocytes. And so we did a pathway analysis on this, it's maybe hard to see all the little, little dots here, but we, we also had some information about where the cell types occur anatomically in the liver. And so the, we ordered the, so we made these little enrichment maps per cell type. There's different clusters here cluster 5146 these represent these clusters here, like 4216 here. We ordered ordered them along an access of like an anatomical access from Peri portal to Peri central, which is, you know, part kind of defines helps to find where blood flows through liver sinusoid and lobules. So I, I'm not really explaining how that works. I'm not showing you how that works exactly, but just to the take home messages that this is a prior way that we knew to order things. And then we, we visualize the pathways in kind of an anatomical view, and we found that there was an interesting division of labor among the different hepatocytes cell types. So if you zoom in on these pathways, you can see that there's a lot of drug metabolism pathways, which is under an important function of hepatocytes, but different hepatocytes of different parts of the anatomy do different things. And so that was interesting, you know, description basically of the, the data that helps us kind of identify where certain functions occur anatomically. So far about any things I mentioned. Okay, so I'm going to go through a little bit more general background now. Those are just examples, a few examples from our own work that we're excited, we are, you know, excited about and it's just hopefully motivating to kind of see, okay, these are real world examples where pathway analysis was important. You can view it and we, you know, to answer different questions and we'll teach you during this workshop how to do those analyses and other ones that might be related to your work. Okay, so coming back to a point that I made about the autism spectrum disorder. I mentioned that the genes that we identified as being important were not identified repeatedly across individuals, but the pathways that were in were were identified repeatedly across individuals. So I want to give you some kind of theoretical background or just an example of a few sort of theoretical or conceptual view of how that works. So exactly statistically why you get signal with this prior knowledge that we have when you didn't have signal in your data before. So let's imagine we're doing a genome wide association study, where we have cases and controls, and we measure snips or mutations for all the individuals. These are five cases, these are five individual people and five controls and other five individual people, and we measure a bunch of snips a to f here. And if the individual has a snip in one form it's called one, otherwise it's zero. And the ideal situation for GWAS study a genome wide association study is to identify a repeated pattern that's present in all the cases and none of the controls. So A in this case is present in all the cases and none of the controls. And there's another snip D that's present in all the controls and none of the cases. So if we saw a pattern like that it would be a very strong signal to get a very good p value and using GWAS statistics. And we might conclude or predict that that snip might be causal for the phenotype that we're studying. So in the statistics situation whenever you do these studies and whenever we look at genomics data it's very messy, much more realistically would you'll have like, you know, each individual here has a different snip. And there's no repeated snips. So now you're on the other side of the, you know, the worst case scenario I showed you the best case scenario where the pattern is repeated it's obvious it's easy to see here. And that's easy to see other than everybody has a different snip that's like the pattern and you can't really do much with that. So, however, if we map the genes to pathways and let's say all the snips that we identified were part of apoptosis pathway, we can collapse all of those, those single numbers down to make our perfect pattern again. So now we have a pattern where we've collected all the ones from all the gene measurements are all snip measurements. We know that the snip measurements are on the pathway. We rewrite this table so that instead of looking at snips we're looking at pathways and we're looking to see if there's a snip in the pathway, given pathway and cases versus controls. And now, and so we recover that signal. So the increased statistical power that we get comes from two things basically one is aggregating the counts. So we take all of these counts now that we're all spread out and we didn't see any relationship between them, and we merge them together and now we see a strong relationship. And that's the probably the most important type of method of basically the most important statistical technique that we gain from using prior information about how genes or snips are related to each other. The other thing that I'm not showing you very much here is and we'll talk about more later today is it reduces multiple testing correction problems. And so if you have multiple association tests like we're testing if snip a is related to cases and then we're testing this and B is related to cases, etc. Let's say we have millions of snips. There's a chance we just test if we're just repeating the test over and over again there's a chance that we'll get a significant p value randomly. And so we need to correct when we're making many tests we need to correct for that and there's a standard statistical method for do there are standard statistical methods for doing that. It's multiple testing correction so we'll talk about it later, but it reduces if you especially if you have lots of data it reduces your ability to identify significant signal signals. In this case of pathway analysis by reducing the number of tests that we're doing from, you know, all the snips and merging them together into a smaller number of pathways. We reduce the number of tasks that we do and increase our ability to identify signal without getting confounded by random significance results. Then the third interesting thing conceptually is that the pathway analysis gives us some mechanistic insight, ideally into what we're studying so we're just looking at a bunch of snips they're like okay the snip a is effect is you know associated with cases. But now with pathway analysis we say apoptosis is related with cases, and that is a lot easier for us to follow up on. So those are basically the, you know, the, the key fundamentals between behind everything that we're going to be doing in the course, and anytime you see an opportunity to do this. In general, you can, you can use these techniques and improve your statistical power and interpretability of your of your results. Okay, so, you know, just one one additional example. I didn't have a good publication example of this but you know, in terms of mechanistic understanding. An example of master regulator analysis, and this is, you know, another way of explaining about gaining mechanistic insight is that we want to explain the data in some way so let's say we have a bunch of samples each column here is a is a sample and each row here is genes and all the colors are measurements of the genes, we have a bunch of expression data with red being high and blue being low. And we've grouped all of these genes we've identified, say hundreds or thousands of genes that are following some kind of pattern here and we're we can identify the pattern very clearly in our data, but we don't have a mechanistic explanation for it. One of the things we might be able to do just as an example is find out that the all the genes that are have a similar pattern are all regulated by the same transcription factor. And if that known set of transcription factor targets are enriched, they're present in this list of genes more than we expect by chance. So this transcription factor is might be controlling these genes and maybe that explains this big pattern that we see. And this is what we did with a PRC to complex example in a in a way. Okay, so, and we'll talk about that as well. Okay, so just to summarize and I'll go through a couple more slides. So this is what we're doing, what we'll do for the rest of the class, the benefits, the rest of the workshop, the benefits of pathway analysis versus analyzing individual snaps or transcripts is that generally the results of pathway analysis are hopefully easier to interpret because they use familiar concepts. We identify possible causal mechanisms like the transcription factor example I mentioned. So we might be able to use it to predict new roles for genes I didn't talk about that very much but we'll talk about gene function prediction. Later, it improves statistical power and the way that I mentioned, and one way that I didn't mention is that it can be more reproducible across cohorts or studies, because, you know, you can imagine if you did a GWAS study like the one I mentioned with A to F. And, you know, every individual has a different snip if I look at 100 individuals in all of different snips, and then I go do that in another population will probably have another, you know, everyone's affected by different snips but the pathway information is hopefully if it explains the biology, it's likely going to be more reproducible across those studies. And it can facilitate integration of multiple data types, like if you have different genomics layers, gene expression and protein expression and proteomics other types of proteomics data. Okay, so finally just the last little bit of information I want to give to you is a preview of the pathway analysis workflow that we're going to be covering in the course. The general idea is that we have some genomics data or experimental data that generates a lot of information about genes. We normalize and score it using standard methods, which we're not going to cover here but we can answer questions about. And that gives us a list of genes potentially available with scores. And then we learn about the underlying cellular mechanism with pathway and network analysis that's the ideal. And that would be, you know, involve visualizing and identifying interesting pathways, trying to find one that's interesting like it's novel, and then we drill down on that and we look into more detail about it and we might be able to publish that that specific thing. Blowing this out into more detail we have lots of different types of genomics data up here different types of normalization methods. They all generate a gene list. And then we have ways of finding interesting pathways interesting networks and doing this mechanistic drill down each of these little boxes is a type of analysis and each highlighted yellow box is a tool. We'll be talking about almost all of me probably almost almost all of these tools in the workshop. So we'll come back to this. Okay, so that's it the workshop outline. As I mentioned just to summarize it is that we'll be working on pathway enrichment analysis where we can summarize and compare. So that's network analysis can help us predict gene function, find new pathway members identify new pathways. We'll be also talking about regulatory network analysis and which helps us find and analyze master regulators. And that's it for the intro so right now we're on a 30 minute coffee break and networking session so we can chat amongst ourselves.