 Morning, I'm going to give you a introduction to the various different things that we're doing in the course. And I'm going to cover a lot of background material. Some of this material you might know already, but not everybody knows it. So we're going to go over, after this lecture, the basics, knowledge, and concepts will be clear for everybody, ideally. OK, so the basic idea of this course is sort of starts off when you have done a large scale experiment and you have, when it worked, which is great. But then, all of a sudden, you have data overload because you have thousands of data points from all this genomics or proteomics data that you've collected. And often, these get sort of translated into genes or molecules of some sort, but usually genes. And so this course is trying to answer, now what do I do? Don't panic, there's lots of things that are useful that we can do with that. And one of the main things that people tend to do with this information is ideally, basically, you want to know what's interesting about this set of genes. If I have 1,000 genes from a gene expression experiment, tell me what's interesting about them. And sort of one of the main most commonly asked or performed analyses is trying to ask if that list of genes is enriched in any known pathway or protein complex or gene function. So this workflow tends to have some kind of pre-processing step where you've converted your raw genomics data into a list of genes. Could be there are many steps involved in that. Could be quite complicated, actually. But we're kind of starting with this gene list. And then we want to basically compare it to pathways and ideally find some interesting new biological process that can be targeted by a drug or something like that. So pathway network analysis has been created to help automate the traditional approach to this problem. So if you didn't have all the tools that we'll cover today, you have to go through the gene list manually and you look up each gene in the literature and try to learn about what genes you've identified. So it's basically saving time compared to that traditional approach. It's not to say that it completely replaces this. It's not going to do all your homework for you. It's not going to read the literature. But it provides a lot of useful information. So the general idea of pathway and network analysis is to help gain some kind of mechanistic insight into your genomics data. So it might help you identify a regulator that's important in the system. It might help you identify drug targets or just characterizing pathways that are active, which might help you understand the mechanisms that are at play or maybe important in your system. I like to say that it's a kind of use a very general definition of pathway network analysis. It's really any kind of pathway, any kind of analysis that involves pathway or network data. And I'll talk about what pathways and networks mean. Networks, when I say networks, it's molecular networks like protein interactions. Pathways are signaling pathways or metabolic pathways. Pathway network analysis, as I mentioned, is sort of the main way that people analyze gene lists probably. The most popular type, as I mentioned, is enrichment analysis, but many others are useful. So I'll just mention enrichment analysis again because it's so well-used. And we'll spend a lot of time on that today. The basic idea is you look for categories of gene function like pathways in your gene list that are more enriched than you'd expect. So if you have 1,000 genes or, say, 100 genes, and 50 of them are cell cycle genes, and you look in the genome, and so you have half of your gene list of cell cycle genes, and you look in your genome, and there's only 3% or 5% of the genome is involved in the cell cycle, you have a 10-fold enrichment of cell cycle genes in your list. And that is a very significant result. And it's telling you something that that cell cycle is important in your gene list somehow. OK, so I'm going to go over a couple of examples that were successful. There are examples from work that we've done that illustrate a few of these points or illustrate the utility of this. The first example is related to autism spectrum disorder. So this was work that Daniel America, who is a postdoc in my group at the time, did with Steven Sherer, who is an autism genetics researcher at the Hospital for Sick Children in Toronto here. And what I learned through this project is that autism is actually is quite heritable. So autism spectrum disorder for people that don't know is a disorder of social and movement and interaction with other people, but also could involve movement disorders. And there's a number of phenotypes that are involved in it, and it's sort of on a spectrum. It's quite heritable. So identical twins have a high rate of both having autism. A certain amount of the heredity or the genetics are explained by rare single gene disorders. And a few years ago, people discovered that copy number variants, rare copy number variants in particular, are important in autism spectrum disorder. So what Steven Sherer did when his group did is they collected roughly 1,000 cases, 1,000 controls. They used a SNP array to measure the genotypes of all of these individuals. And then from the SNP intensities, they computed copy number variants. And so the way that that works is if you have a deletion, so copy number variant can be a deletion or a gain of a genomic region in part of your DNA. If you've got a deletion, then when you run a SNP chip to genotype an individual, if that individual has a deletion, none of the SNPs basically in a line on the genome will have any DNA present. And so they'll all have zero intensities on the chip. So if you look, if you order all those SNPs along the genome and you see a big region that doesn't light up at all, that means there was no DNA in the sample for that region, and so that's a deletion. Similarly, you can do the same thing with gains. And so once they had these mutations for all the individuals, they did a genome-wide association study, which basically is a fairly standard technique as well, for identifying mutations that correlate with a phenotype. So in this case, all the cases that have autism spectrum disorder, they looked for copy number variants that correlate with the cases and not with the control. So they were looking for mutations that are present more frequently in the cases and less frequently in the controls than you'd expect. So they did that analysis, and that's the sort of standard analysis. And they found a few copy number variants that were significant according to their statistics. And that was good. They found some new information. But a lot of the data, basically I think they found two or three that were really significant. So we looked at that data from a pathway perspective, and we found a very rich set of pathways that seem to be more frequently mutated than you'd expect. And I'll explain, well, we'll kind of go through plots like this and how to make them. But basically, all of the pink circles here, all the circles that are colored from white to red, so some of them are pink, represent pathways that are enriched in autism cases. So instead of looking on a mutation by mutation basis or a gene by gene basis, we looked on a pathway by pathway basis. So we took a pathway, and we represented that pathway as a set of genes. So say the Wint signaling pathway has 100 genes, and we looked to see if that pathway had more deletions in it than you would expect by chance, in cases compared to controls. And for people that are interested in the statistics, you can compute a p-value with a Fisher's exact test, and then we computed a false discovery rate by shuffling the case control labels thousands of times, repeating the analysis, and each time we checked how, we checked how frequently we got that level of enrichment by chance with that shuffling procedure. So the color actually is mapped, that FDR, the false discovery rate, is mapped to the color in this plot. So a lot of these pathways were a little bit general. So Kine's activity, GTPAs, RAS signaling, some of them were very specific. So there was a central nervous system development set of pathways. And the, sorry, I should mention, I should explain this plot a little bit. So each circle is a pathway, you know, it's a set of genes as we've represented the pathway in a sort of very simple way. And the lines between the circles represent crosstalk or genes that are part of more than one pathway. And all the circles, as I mentioned, are pathways that were enriched in the autism cases versus controls. So when we presented this to our collaborators, there's some, not all the pathways were, you know, obviously related to autism. So we had to, yeah, a question. You said you only found two propagations across a huge number of pathways. Yeah, so I'll explain how that works. So that's a key point of this example. So because this study looked at rare copy number variants, they actually focused on that. They removed the common ones because they thought that the common ones are less likely to be explaining a rare disorder. So they're focusing on the rare ones. And they had previous knowledge that the rare ones were important in the genetics. So the issue with that is that it's very difficult to see the, you know, if you only have a few mutations per patient, per sample, it's difficult to see mutations frequently mutated across the sample. So there are some that those frequencies are still low. They're like five or six or seven, even at the maximum. So pathway analysis helps address that. And I'll explain how, and basically the way that these pathways became significant when we looked at them is taking information from multiple patients, multiple samples. So looking at an individual pathway like cell projection organization here that was quite significant, it wasn't a single variant that was mutated frequently. It was a variant that was across the population of a thousand cases. It was a hundred genes in the pathway mutated frequently, more frequently than you expect. And each sample that you looked at, each person that you looked at, had a different mutation in the pathway. So that's the kind of signal that would be very difficult to see at the gene level, but at the pathway level you see another. So say you have a, I'll just illustrate it with one more example. So say you have a genome-wide association study with 10 cases and 10 controls. And you look at the cases and everybody has a mutation in a different gene. So if you want to identify a gene that's frequently mutated, you can't because none of them are frequently mutated. They're all mutated. You know, all the genes that are mutated are mutated only once. So you'd never see any signal. You couldn't basically do much with that result. However, if you knew that all of the genes were part of the same pathway, and instead of thinking about things at the gene level, you did that analysis at the pathway level and you said, now I'm gonna look at this pathway and see how frequently it's mutated. It's mutated in all 10 out of 10 cases and zero out of 10 controls. And so you've gone from a situation just by knowing that pathway information and that the genes are part of the pathways, you've collected all of the single counts into one big strong count. And so that actually helps you with statistical power in two ways. One, it's collected the small counts into a big count and the big count makes it easier to get a signal. The second way is that it reduces the number of tests that you have to do. So in a gene case, you have to do a test for every gene. And in the pathway case, you know, for this example with one pathway, you do one test. So you've, say you've reduced your test by 10 fold. So that helps you with multiple testing correction, which we'll explain later as well. Does that make sense? Okay. I wanted to just briefly explain some of the rest of this plot. So when we presented our results to our collaborator, they didn't quite know what to make of all of the pathways because a lot of these pathways they hadn't seen before and they weren't obviously, apart from the nervous system development ones, they weren't obviously related to the autism phenotype. So what we did was we looked at known genes in related disorder in autism and intellectual disability, which is a related disorder. And we added another little symbol here for all the autism genes, which was about a hundred and all the intellectual disability genes which also are like 150 or something like that. And then we did a pathway enrichment analysis on those as well. And for the intellectual disability genes, all these little triangles are pathways that were enriched in known intellectual disability genes. And similarly here for autism, we have this parallelogram shape was used to represent pathways that were enriched in autism genes. And what we found was that even though the same genes were not mutated in, were not present in the known gene list and the genes that were mutated in this case control study, when you looked at the pathways that were enriched, there was a lot of overlap. So in the middle here, you can see here's intellectual disability. So all of these pathways, cerebral cortex, cell migration, central nervous system differentiation are the triangles are only enriched in the known intellectual disability genes. The circles come from this new data set and the circles have a lot of genes in common with the triangles. So that means that they're, even though it wasn't the same genes, it's they're somehow focusing on the same process. And one of the interesting things that we found is that we generally found pathways that were less well annotated when we did this experiment. When we looked at the pathways that were in the known, that were related to known disease genes that were very well annotated. And that usually happens because people, the literature and the information in the database is biased to things that people have studied frequently. And so people are gonna study disease genes that are obvious much more frequently than they're gonna study other genes. And when we do a new genomics experiment, you might find that you're uncovering genes that nobody's ever really studied much before and they don't have a lot of literature on them. And so in this case that manifested in identifying pathways that were a little bit more generic. So they hadn't been studied as well. So that's interesting and a good thing to just keep in note in mind. So sometimes you do a genomics experiment, especially when you're doing new tech, using new technology, you branch out in areas of biology that other people have never really explored very much. And that's exciting but also challenging. Okay, any questions, any other questions? So I'm gonna give you a second of two pathway analysis examples. The two that I've chosen here, this autism one and this one right here, were successful examples that where we really learned a lot. And so it's just illustrating the power of the approach. And this is a project that was done in collaboration with Michael Taylor. I noticed there's John is I guess here from Michael's lab. So someone from Michael's lab is here. Michael Taylor is a neurosurgeon at SickKids who runs a large lab. They do a lot of genomics on brain tumors. One of them is ependymoma, which is the third most common childhood brain tumor. Ependymoma occurs, it affects the ependymum, which is the lining of the central nervous system that helps form the blood-brain barrier. It can occur in many places in the brain. And the previous two genomics, the genomics era pathologists and surgeons who have been treating ependymoma by surgery and radiation because there's no known treatment other than that, figured out that depending on the anatomical location that the ependymoma occurs, it can be more serious or less serious. And the most common and most serious location was in the posterior fossa, which is the brainstem and cerebellum. So out of all the different places that occur, if it occurred there, then the surgeon and pathologist would say, this is the most serious type, and they might change treatment based on that. Michael wanted to look at the genomics of this. Nobid really looked at it much before, so he looked at, he collected gene expression data using microarrays around 2010, 2011, and identified for this one anatomical region that everybody lumped together, it was actually two classes of gene expression that were seen. So the gene expression for all the samples really very clearly clustered into two types. One type that was called posterior fossa A affected the youngest individuals and had the worst outcome, and posterior fossa B affected the oldest individuals and had an excellent outcome. So even though surgeons were, doctors were lumping, basically if you find this in this region, in the anatomical region, it's serious, it turns out that there's actually two different possible diseases in that anatomical region. One that has a good outcome, one that has a bad outcome, and so it's very important to know that because right now, previously that, people were just treating everybody the same, thinking everybody would have a bad outcome, maximize the radiation and surgery, and that's not good because if someone's gonna have a good outcome, they're gonna be affected negatively by radiation to the brain, especially developing children. So Michael wanted to look further into this disease to understand more about the mechanism and why these classes are different, and he did a number of, with collaborators, a number of axiome sequencing and whole genome sequencing. I think these numbers are a little bit out of date, actually, I noticed. So there's actually quite a lot of whole genome sequencing, and strikingly they found no mutations, basically. So there were two or three mutations per sample and no recurrent mutations that were new. There were a couple of known copy number variants on the B side that people already knew, but otherwise no mutations. This was very surprising because, as everybody knows, as everybody has taught, cancer, one of the major hallmarks of cancer is genome instability. You should have a lot of mutations if you have tumor. In this case, there were basically no mutations, and one of the reasons for that may be because some of these samples were very young, babies basically, and they might not have had, we know that mutation rate correlates with age, so even in tumors, if you'll have tumors that have a higher mutation rate if you're older, so maybe the mutations hadn't, there wasn't enough time to collect a lot of mutations, but in any case, the axiome and genome sequence data did not help interpreting this tumor type. So they looked at epigenetics, methylation, looking specifically at DNA methylation of CBG islands using an array, a methylation array readout, and they found that the A type was more transcriptionally silent, so the CBG island or the promoters had a lot more methylation, were a lot more methylated than in A versus B, and there were about 2000 genes that were differentially methylated, and when they took these 2000 genes and they used the standard tools that were out there at the time, they didn't find any pathways that were enriched, I shouldn't emphasize this, but it's not important, but the main thing is that when we looked at this by pathway analysis, we found that there was a single pathway that was extremely enriched in these genes, so it was targets of PRC2 complex. PRC2 complex is a polycomb or pressive complex too, it's involved in methylating histones, and so it's involved in methylation process, DNA gets methylated after that, and these other SUSE12 and ED are parts of the PRC2 complex, so that was really interesting because it identified the first mechanism that had been associated with this type of epenemoma, and now you can think about drugs, maybe these cells are dependent on this PRC2, this overactive PRC2, so if you can find a drug to shut it down, maybe that will help, and actually it turned out that as soon as Michael started talking to people, realized that people are studying this, epigenetic drugs are a hot topic and drug discovery arena, and there are drugs available, and on looking at mouse models, the drugs were working, and actually in the end of the project, they were actually able to treat a patient who was at the end stage of their disease, there was nothing left, no more treatment options left for this patient, and the tumor had been passed aside to the lung, and in two months had doubled in size, that's this thing here to this thing here, so they decided to use an on-the-market anti-DNA methylation drug, and they treated the patient, and the patient regained their energy, and the tumor stopped growing, and that effect lasted for 15 months, so that was very exciting because it, and for the purpose of this example, it illustrates the power of the pathway analysis, in particular the ability, if you're lucky, to get really good mechanistic insight that you didn't have before, and in this case, it really uncovered something important for the tumor type, and the other interesting thing, yeah, I'll skip the other thing, okay, so to summarize, the benefits of pathway analysis are compared to looking at things based on transcripts, proteins, snips, et cetera, genes, I'm not saying you shouldn't look at your data by those things, obviously you should, you should start with genes, proteins, et cetera, but it's very useful once you've looked at your genes and proteins to also do a pathway analysis, because the results that you get back are usually easier to interpret, you are using the language of pathways, which we usually biologists are very familiar with, it can identify possible causal mechanisms, like the epenomoma story I talked to you about, it can predict new roles for genes, so if you have a gene that is unknown, but it's connected to a phenotype that you're studying, or other genes that are known, that are behaving the same way, then that might help you get some more additional information about other genes, it improves statistical power in the way that I explained, by aggregating data and reducing the number of statistical tests, it tends to be more reproducible, I should say, I should actually use the word comparable, and that is illustrated in the autism case where we looked at the pathways that were enriched in the copy number variants, and the pathways that were enriched in the known autism genes, you couldn't compare at the gene, there were no overlaps at the gene level, but there were overlaps at the pathway level, so it just illustrates that in a disease that is oriented around a pathway, basically the idea is that the disease is caused by the misfunctioning of a pathway or system, there might be many different ways of causing that misfunction, you could mutate a gene, you could overexpress a gene, there might be 50 genes that you can change to affect the pathway, for instance, you might need to, if the pathway stops functioning, you get a disease, there are lots of different ways of stopping the pathway from functioning, and by looking at the gene level, you might not see that overlap because there are lots of different ways, so every time you look, it's a different way, but at the pathway level, you just see it all connecting on that pathway, and it also helps facilitate integration of multiple data types, if you are doing an experiment that involves multiple different types of omics data, you can do a pathway analysis on all of them and then compare that data at the pathway level, and that's often a very useful approach to integrating data, it's not the only approach, but it's useful. So I mentioned pathways and networks, we a little bit use the term interchangeably in this course, pathways are, and we'll talk more about the difference here, yes, any question? Do you think the mutations were different or different mutations were different? Who knew, and then you would, and then, who knew argue that maybe you looked at different gene mutations which are different mutations following the argument. Yeah, we might, but we tried that and it didn't work, I should have mentioned that, but there were too few mutations to even get any signal there, so yeah. So we don't know why PRC2 is active in this case, it could be, there could be many reasons, but we couldn't find a region in the whole genome sequencing, reason in the whole genome sequencing. So even when we did pathway analysis with whole genome sequencing. Yeah, so we, in that case it didn't work. So it doesn't always, these are good examples where it's working. So PRC2 is an enzyme itself? It's a complex that has a methyl transferase enzyme activity. So the activity of that was increased? In the tumor type, in the one type. We couldn't figure that out by... We didn't see it in the genome sequence data. We didn't see anything related to that, any mutations related to that in the genome sequence data. Yeah, so we don't know why that's increased. It could be caused by other cells. It could be an environmental cause, like a virus or something, we don't know. And also whole genome sequence, even though it says whole genome, usually it's probably 85% of the genome or 90%. It doesn't cover everything. It misses repetitive regions, telomeres, centromeres. It misses a bunch of DNA that's difficult to sequence. So even if you look for MRNA expression, you wouldn't see it. Well, MRNA expression, we saw a big difference. That was the original study, identified the original classes, A and B, based on gene expression data. So that was a big difference. So we had a big difference in gene expression data, no mutations, and a big difference in methylation. So we would argue that methylation is causing the difference in the gene expression. We don't know why the methylation has changed, but at least we have quite a number of layers that we can think about. Any other questions? Yeah? So when you say there are no gender queries, you say they don't have power to detect it, or they were not there at all? They were not there at all, yeah. There were a few, but they were really, really minimal. And second question is, like you say improved statistical power, but the number of genes are less, but... Usually there are fewer pathways than genes. It doesn't have to be that way. If you use a very big pathway database, that includes tons of stuff, you might find more pathways than genes. We'll talk about that later in the course, but there's a statistical... One of the properties of the pathway information is that there's a lot of redundancy, and if you reduce that redundancy, you don't have to do as many tests. So that's not totally figured out, as you'll see later in this course, how to approach that problem. But usually when we do pathway analysis, we might do pathway analysis with 1,000 to 7,000 or 8,000 pathways. And depending on your organism, well, depending, I'm talking about, say, the cases I... The examples I mentioned are human cases. So human has 20,000 genes, and if you consider mycarnase and linkarnase, then you might even get 50 or 60,000. And we're usually working with 1,000 or 2,000 or in the 1,000 range of pathways. For bacteria or other organisms that have fewer genes, you'll have fewer pathways usually as well. Any other questions? Okay, so we talk about pathways and networks. Pathways are what you kind of know about pathways. They're sort of a series of unordered events, molecular events. So usually when we say pathways, we usually have... Ideally, we'd have a lot of information about how the pathway is working. So we kind of understand the mechanism a little bit better. Networks are just connections between genes or proteins. It can be anything, and we'll talk about that as well. So you could have A represses B, A binds to B, A genetically interacts with B, or is related to B somehow. And there tends to be less molecular information, less sort of detailed information here, but because the relationships are a little bit more generic, you can often find more relationships. So there's sort of pros and cons. Sometimes you can generate these networks from the genomics data itself, and then you can use it sometimes where you don't have pathway information. So pathway information usually comes from a lot of experiments published in a lot of papers, and then somebody has, over time, a consensus has reached on how a model for how the pathway is working, and network data can come sometimes very quickly from large-scale genomic data. It can also come from... You can map pathways to networks if you want as well. Okay, so there are three major types of pathway network analysis. This is from a review, I think we should have referenced it here, but I think we'll make sure that it's... I'll make sure that it's in the pre-reading or notes. This... So the enrichment that I mentioned, the enrichment analysis is the first type. So it's enrichment of fixed gene sets, which represent pathways. A lot of people in literature use the word gene sets. I always say pathways, because what's a gene set? That doesn't tell you what... It's actually confusing, because what is it, you know... Give me a set of genes. What is it? Is it a pathway? Is it a set of co-expressed genes? Is it genes that are related to... That are orthologs of something? So I like to use the word pathway. And the reason I also always focus on pathways is because there might be multiple types of gene sets, but pathways are the ones that I'm talking about that have the highest level of interpretability, usually, because they're concepts that we recognize. Another type of network analysis is sub... This thing calls it de novo subnetwork construction and clustering. The basic idea here is you have a big network of relationships and you map your data, your gene expression data, for instance, to that network, and you can find regions of the network that are differentially expressed and you can pull those out. And we'll talk about that tomorrow. And then people have come up with lots of different fancier methods that this review calls pathway-based modeling. The idea here is it sort of takes a little bit more of the... It considers a little bit more of the mechanistic details, like whether this phosphorylates that or particular patterns of biochemical reactions. And we're not gonna cover that too much in this course because even though there are methods available and published, actually, we'll cover one of them as paradigm probably tomorrow as well. It's a new... Traditionally, this last category hasn't taken off as much. The first category is what everybody does. If you just look at citations of enrichment, you get tens of thousands of people using it. De novo network clustering, hundreds of people or hundreds of papers are using it. And pathway-based modeling, the tools have not been at the level where they're really user-friendly, traditionally. But actually, we'll find, I think we'll hear tomorrow, that the reactome group who's based here has been developing an easy-to-use version of this based on their database of this paradigm thing. So they might... Robin might cover it tomorrow. It's new, so I don't know if they're covering it. Okay, so this kind of explains what I talked about. So the questions that this answers, the first one, what biological processes are altered in. This is sort of taken from a cancer perspective. The second one can help you find, I didn't talk about the benefits, but can help you find... So the first one with the known pathways, one of the issues with it is that, one of the good things about it is that, it can identify pathways that are easy to interpret. The disadvantage is that it relies on knowledge of pathways, so there might be hundreds of, or there are thousands of genes, usually in any given genome, where we don't know much about them, and then they may not be considered in the analysis. So the network construction sometimes covers a lot of those genes, and then you can identify regions in the network that include genes that don't have a lot of pathway annotation or information, and you'll still find them. Now, the challenge there still is that, once you get that network section or module, you still have to interpret it, so you'll have to look at the genes in that, even if they're not well known. You can also cluster your data by pathways and networks, and I'll skip that last one. Okay, so the workflow that we are covering today is summarized here, and in the next slide, I'll go into more detail on each of these. The basic idea is that you collect some kind of genomics data. This course tends to focus a lot on mRNA expression data. The reason for that is that it's traditionally the most common type of data that we see in genomics, microarrays or RNA-seq, but there are lots of other types of genomics data. And all of the concepts that we're teaching in this course are applicable to all of those other data types, as long as they result in a gene list, and even for metabolomics that results in a, is anyone here working with metabolomics data? So a couple of people, so we won't touch on that too much, but there are tools that do all the same stuff from metabolomics data, so we can talk. One of them is called Metabo Analyst, which you might know about. It's kind of one of the main ones, I think. So once you have your data, you have to process it, and this involves scoring and normalization. For instance, with gene expression data, you have to normalize it, and then you have to compute differential expression usually. You compare your sample versus your control, and we're not gonna cover how to do that in this workshop. There are other workshops in this series that do cover that, but we will mention some of the basics. The important thing here is that you should, you should make sure that this is working to reduce noise and increase signal, which is the way that, the main motivation for doing that, and often there are standard workflows for this, especially for any kind of genomics data, omics data that has been used for, is well-established. Usually there's a standard workflow that you just run. So then once you have that, you generate your gene list somehow, so we'll talk about that, and then you ideally get some insight. You'd like to learn about the mechanism, for instance, so you might wanna visualize and identify interesting pathways and networks, and then once you find one that you like, once you've found an interesting one, then you drill down and try to understand the mechanism, and then you can publish the model. So when you're drilling down, you might go, that's where you might go into literature again. So, pathway analysis often will help you get a very broad overview of your data. It will show you all the pathways, even then it might be too many pathways, but at least the pathways is many far fewer than the number of genes, so then you can go through those pathways, identify interesting ones, and then zoom into those, and so that's a general strategy. If you do it that way, then you don't get overloaded, so you use pathways not to get overloaded by genes, and you zoom into only one pathway of interest, and don't zoom into all of them. So, and to zoom in, that's where you can really focus in on the mechanism of that one pathway, or a few pathways. Okay, so this is a blow up of the, each of those boxes gets expanded, so the blue box here and the yellow box here gets expanded into blue boxes at the top, and yellow boxes, so at the top we have sort of all of the different types of genomics data that are standard, and depending on the type of genomics data that we collect, the gene list might mean different things, so we, you might, the gene list from gene expression data often relates to pathways that are being regulated. That's the way we think of the cell, that it's expressing genes when it needs to, and so if you see a bunch of genes that are differentially regulated, you expect that they're somehow important for that, what's happening to the cell in its environment. But you can also do proteomics to find protein interactions, and in that case the list of proteins is the list of interactors of your target, or microRNA, or transcription factor. Your association studies find you genes that are linked to a phenotype, so depending on the class of data, your input gene list might mean different things, so just good to think about that. So as I mentioned, I mentioned these things already, so for instance it could be, your gene list could be all of the things that are present in the same tissue. That's not the same as associated with a phenotype. So that's, as I explained, is the top part here. Now the second part here is processing this data, and if you read these little boxes, you can see score methylation of gene promoters, compute differential expression, and then you get your gene list. So the important thing, as I mentioned, is to use standard normalization procedures that increase signal and reduce noise, oops. And one of the things that you probably need to think about is the size of your gene list, and it can't be too small or too big, but we'll get into that later, and to make sure that your gene identifiers are compatible with your workflow, which I'll talk about as well. So that's this, as I mentioned, these boxes here. Okay, so any questions so far? Okay, so the next thing I want to talk about is sort of what biological questions you can ask of a pathway analysis. We found that, if we just talk about gene sets and pathway analysis, it's so generic, it's good to kind of think about the actual questions that people have, and this list kind of summarizes major questions that people have that are answered by pathway analysis. So one thing that you can do is you can summarize your data until it's a bit descriptive, but you'll just get a description of the pathways that are active, and obviously that might help you identify interesting pathways to follow up on. You can perform a differential analysis, so you can see which pathways are differentially active in your samples, and you can use that to classify your samples as well, so you might be able to classify your samples by pathways instead of by genes or mutations. You can find a controller for a process. People are often interested in finding a transcription factor or a microRNA that's controlling, that's important, an important regulator in their system, because that's a target, they can go knock that down or over express it, and there's a clear experiment that you can do to test whether that's an important regulator. So that's what we'll talk about that a little bit. You can find new pathways. The network analysis can sometimes find new pathways that you didn't really know much about, or new pathway members, so we'll talk about that and discovering gene function with gene mania later. And you can correlate pathways with a disease or phenotype like I explained with the autism case, and you might be able to find a drug target, so that's one of the other advantages I should have put on the pathway advantages slide, is that not only does it give you insight about mechanism, but once you have some insight about mechanism, you can start thinking about perturbing the system, and there's a few ways of thinking about that. You can perturb it experimentally to try to prove your hypothesis or you could find perturbations that are useful like drugs. Okay, so today we're talking about pathway enrichment analysis, which sort of focuses on the summarizing compare and classify. Day two has focused a little bit more, we'll talk about network analysis today, but a little bit, we'll get started on it, but we'll spend much more time on it tomorrow, and then regulatory network analysis about transcription factors and such on day three. Okay, so I talked about the data, the blue boxes, you normalize it and score it, and then you get a gene list somehow. We'll talk about different ways of doing that from your data, but the basic idea is that you can just use all of your genes that are ranked by some score often, or if you have a set of genes that is, you wanna choose a score cut off and say I want the top thousand, but those are the two ways of doing it. Then you can do pathway analysis here or network analysis here, and they both end up giving you ideally some mechanistic insight that you wanna drill down, they both identify some pathway that you might wanna drill down on, and so then you can go into this box and see what you can use to drill down, and each of these little yellow highlighted labels represents the name of a software tool. Most of them we'll cover today during this workshop, sorry, that help you at each of these stages. So this, we put together this workflow as a reference for you to kind of try to integrate all the different things into a workflow. Okay, so any questions, yeah? The ranking can be formed in different ways. For gene expression data, you usually rank by differential, level of differential expression between your samples and your controls, but you could rank it by the p-value of association in your GWAS study, for instance. Any way that you can think about ranking something that makes sense is a valid way of using it. Okay, so I talked about pathway enrichment analysis. I won't really go into this because I think I covered it earlier. The main point I want to make at this point is that I'm now gonna kind of give you an introduction about some background concepts that are very basic, hopefully not too boring, but hopefully a more positive way of saying that is hopefully informative to people here and they don't know about it already. Some of the tips, but basically, pathway enrichment analysis needs, as input, a gene list from your experiment and a list of pathways, and then you run some statistical scoring system which we'll talk about this afternoon or later today that might even be this morning, that gives you enriched pathways according to some, and it will give you a p-value. So the two tools that we talk about is GSEA, or Gene Set Enrichment Analysis and G Profiler, and we might talk about others. So I just wanted to talk about gene identifiers because when you put your gene list into these tools, you need to know something about gene identifiers and there's some technical points and also where the pathways come from and at least a couple of sources of them. Okay, so not spending too much time on gene identifiers, they are ideally unique, stable names or numbers that help, so in general identifiers, when you hear that word in the technical context, like related to computers, it means that it's the label or number that is used to track database records. So your driver's license ID number is tracking you and a government database, people who have driver's licenses. The Entrez Gene ID 41232 is a number in the Entrez Gene database from the NCBI that is ideally uniquely and stable keeping track of that gene. However, it gets complicated because gene information is stored in many different databases, so genes have many different identifiers and there are also genes can be thought of as DNA, RNA, protein if they're protein coding genes and each of those can have different identifiers and you can have different identifiers for different transcripts, different alternative splice variants. So it's important to recognize what you're dealing with when you're working with identifiers, make sure you know that for instance that you're working with the gene identifier if you're using a gene-based analysis and if you needed to keep track of transcripts you can't use that because they're just gonna lump all the transcripts together. So also gene records for instance, just as another example, don't store the sequence, they just store the gene and genes can, even genes can have multiple sequences because they can have different transcription start sites, you might have different versions of the promoter and stuff like that. So that's just a technical detail. So here's a list of different examples of identifiers for different types of biological gene related concepts. The red ones we've highlighted ones are basically we recommend. One of the issues with these gene identifiers is that sometimes databases change the identifier and then your identifier becomes old and then when you go and search it in the database you can't find it anymore and that's annoying because now you've lost data. Especially if you ask, you come to it, say you start a new project in your, somebody gives you data that was generated five years ago and just says we need to analyze this with the pathway analysis that you just learned of your workshop and you have all these problems mapping the identifiers because they're out of date. So we'll cover some, so there's basically a number of tools that help recognize all the identifiers that are known and map them to current ones or map them to some of the standard ones. One of the other issues with them is, yeah okay, so there's lots of identifiers, the tools that we use on a regular basis usually only recognize a few and so there's a need to map your identifiers between types. So often this manifests in using some kind of technology that uses its own identifiers like when people use a lot of affymetrics, microarrays, affymetrics, probes, each one had their identifier and it was different per chip and so you have to always map them to genes and so there are ID mapping services like Gconvert and Ensembl Biomart and lots of others that help you to map from one identifier to another. So another one of the issues other than identifiers becoming out of date is that you use an identifier that's not standardized, it's not guaranteed to be unique like a gene name. So we have to recognize that technically some of the gene names that we use are unique and some of them aren't. So when they're unique, often we call them a gene symbol, so that special word symbol, says that that gene name is being standardized and usually people who maintain an organism genome standardized gene names and for human, it's Hugo, the human genome organization, Human Gene Naming Commission, so they're responsible for naming genes and they assign gene symbols and those are like ideally sort of stable but they actually do change occasionally, which is annoying. But if you just go to the literature and you look at protein and gene names, then it's a Wild West, people can name things anything they want, it doesn't have to be, it could overlap so you might see a name that is, you know, one of these names like LFS1, but there might be 10 genes called LFS1 and actually that's TP53 is the symbol for that, so they don't even have to be related. You also have to be aware of one to many mappings which can happen, for instance, if you're converting genes to transcripts and then once you do that then you have to kind of deal with a, it might be messy for you if you are trying to think about one gene, one ID. Another important note is that if you use Excel as a spreadsheet, it can introduce errors in your gene list. How many people have had problems with this in the past? So every time I ask this question, more and more people put up their hands so I think it's good because more and more people are doing this analysis but one of the issues is that sometimes gene names, or gene symbols, get recognized as special, special sets of characters in Excel and will convert them to a date, for instance. So Oct4 is a particularly important transcription factor and it's for stem cells and if you just paste it into, copy and paste into Excel, it will convert it to October 4th because Excel's made for people that usually want to type in Oct4 and convert to a date. So this is okay if you just have a few genes and you notice it, but if you paste a thousand genes into Excel and all the way at the bottom off the screen, it changes it, then you don't notice it and you've ruined some of your data, basically. And interestingly, if you go to the database for Oct4, sometimes, or some of these genes that Excel makes auto conversions for, those auto-converted names have gotten back into the database via the literature or something like that because someone's published it without noticing. So that's not, it's just messy. So the way to get around this is to always make sure you paste this text and turn off, you can turn off, that's the main way in Excel. Yeah? G-Profiler is a hiding mapping. G-Profiler has a number of tools. One of them is called G-Convert and that helps you with the mapping. Okay, so just as a kind of take on, to emphasize the take on message, here's an example of a paper that was published in 2003, when microRNAs were kind of discovered and there was a lot of buzz about microRNAs and these people had reported one of the first mechanistic analysis of a particular microRNA and they said that HES-1 is targeted by microRNA-23. And they had to retract the paper, unfortunately, because when they did their search, there's two genes that are linked to HES-1 and they used the wrong one and they used it in different parts of their experiment differently. So their paper was basically not, their results were not right at all, basically. So unfortunately, you have to be careful to make sure that you keep track of everything like this. Okay, so the general recommendations are for proteins and genes, which doesn't consider splice forms, map everything to entree gene IDs or official gene symbols using a spreadsheet and the tools that you have available and review it to make sure you're not introducing errors. If you, often this doesn't always work 100%, sometimes the identifiers that you use are for genes that we really don't know much about and the genome annotation is actually, doesn't even get finished for that gene and every time you download it in a new genome build, you'll get a different version of that gene and then sometimes considered not a gene, sometimes a gene. So there's a class of identifiers that are just difficult to work with and so you might not always be, it likely won't be 100% always, yeah? Yeah, and they did their experiments on the second gene but then made conclusions about the first one. So they could have avoided it by keeping track of the sequence and a better identifier. So they used the gene name to do all their searching which was not a standard identifier. So they could have used an accession number or an entree gene ID and then they could have known that what they're working with because that's a unique identifier. It can be avoided for, it's very straightforward to avoid this by using identifiers instead of gene names to do your keeping track of things basically. Okay, so that was a quick overview of identifiers. So I'm gonna move on to the next thing which is pathways. We'll learn more about pathways and where they come from later but I wanted to go over one source of pathways that's very common and that's gene ontology. How many people here know about the gene ontology? Okay, so good fraction but not everybody. So I'm gonna just review the gene ontology as well to make sure everyone knows. So gene ontology is one of the major sources of biological pathway information for gene set enrichment analysis. Almost every enrichment analysis tool that you go to online will use gene ontology primarily and often will be supplemented by other databases but it's the biggest source of gene sets that are related to function. So basically the gene ontology is a bunch of words of phrases of biological phrases called terms. Each one has a definition. They might say protein kinase, apoptosis and the ontology word means a formal system for describing knowledge. So it's not just a list, it's not just a dictionary for biology. It also has a structure. The terms are related in a hierarchy. The top of the hierarchy is very general terms and the bottom is very specific terms. So it can describe gene function at multiple levels of detail. So here you have B cell apoptosis at the bottom and it's a type of apoptosis and it's type of program cell death, et cetera. Gene ontology covers three major areas of biology currently. Where things are in a cell called cellular component. What the enzymatic function of the gene is, which is molecular function and or what the biochemical function is and biological process, which is where the pathways are. So you will be able to find from almost every pathway that you know about a gene ontology term like the wind pathway or cell cycle, you can go find a gene ontology term for that and it will be associated with a set of genes. So terms are part one of the two concepts that gene ontology captures. There are tens of thousands of terms here in statistics. They add to them all the time so it's always getting updated. The second part of gene ontology is annotations, which is where curators and other systems take the terms and they link them to genes. So this can happen by a trained curator who's reading the literature and says this paper shows an experiment where this gene is part of the cell cycle so I'm gonna take this gene identifier and map it to and link it to the cell cycle term. And when you do that, you also capture some of the evidence like what the paper is and what experiment they use, did they use a physical interaction experiment or did they knock the gene out? So that kind of stuff is captured in gene ontology. But a lot of gene ontology annotations are created automatically without any human review so it's important to understand that. The hierarchical nature of gene ontology or GO can sometimes create some headaches for you, especially if you take a gene and you map it to all of the terms directly mapped but also all of the parents of the term. And so it can create an explosion of names of terms associated with your genes. Usually you only work with the, usually the enrichment analysis tools handle this for you so you don't have to worry about it too much these days. The nice part of gene ontology is that there's a lot of manually curated, very high quality annotation but it's time consuming to create so they have automated systems. One type of automated system is computational analysis that automatically assigns terms to genes and there's two types of that. One is type that's reviewed by people so they make sure that the system is working well and some of that computational analysis can be extremely accurate. So some computational analysis is very accurate, some is not, some is not. So for instance, identifying membrane proteins is extremely, you can do that extremely accurately using computational tools. There's a tool that looks from transmembrane regions and proteins and it's like 99%, 98% accurate. Some other tools are less accurate. So the curators just review that it's working well. And there's also annotation derived without any human validation which just tends to be lower quality. People consider it lower quality but sometimes you have to use it because you don't have any information of high quality information. So especially if you're working with, this happens in every organism but especially true if you're using an organism that has recently been sequenced and people don't do a lot of experiments on it, so a non-model organism, then you usually have to map your data from another organism and so that's electronic mapping. So the key point is to be aware of the annotation origin. The useful thing in Go as I mentioned is that they keep track of the evidence for annotating a term and there are all these evidence codes which you might see popping up like TAS is Traceable Author Statement. So and the one that's unreviewed is IEA inferred from electronic annotation. So often people will remove that if they can which means that they have the other direct evidence sources. As I mentioned, there's a lot of variability in the coverage of the different organisms and here's a plot from the genontology statistics page that shows that the number of blue or non-experimental annotations and the number of green or experimental annotations. So you can see that human has the most green and rat has the most blue but actually there's quite a lot of organisms that are annotated. Just for your information, a lot of databases contribute to this system and not important which ones but just so you know, there's a lot of people involved. Genontology also maintains a concept called a slim version. Yes? What's non-experimental? Non-experimental they call inferred from electronic annotate sources. So somewhere an experiment has been done but the information is being copied. So it wasn't an experiment on that gene. It may have been an experiment on a gene that had sequence similarity to it or something like that. Does that mean that green is more credible in blue? Yes, green is more credible in blue. Yeah, so the green ones and I should have used the same color scheme everywhere to make this clear and I'll do that next time but this is the red part here is the higher quality, more credible evidence types and the blue one here is the inferred from electronic annotation. So as I mentioned, it's good sometimes like on G-profiler, I think on G-profiler you can just say no IEA, right? Anyway, I think there's like a button that says don't include the electronic annotation ones. Yes, they're curated. The green ones are curated or reviewed computer analysis. So they're all looked at by a person but some of them are really curated in a high depth. Some of them are curated less depth but still curated and then the blue ones are not looked at at all. Nobody looks at them, yeah. So you might be looking at data that was just completely generated by a computer program and nobody ever bothered to look that it was correct. I mean, obviously the developer wants to make sure that it's as correct as possible but there could be systematic errors in that data or a lot of wrong information. But it is useful as I mentioned if you don't have other, if you don't have, if you're working with an organism for instance that doesn't have experiments. Done on it a lot. These are for genes, yeah. So this is annotations which are the terms associated with the gene, yeah. Okay, so just a quick comment, go maintains one of the issues sometimes people have is that there's too many go terms, like 26,000 is too many. So gene ontology also maintains a slim version which is less than 50 or something. There's a generic one and one for plant and yeast, maybe human now. There's a lot of tools that support gene ontology and the one, so you'll see it in different, used in a lot of different software. The one that I like for just exploring gene ontology is called quick go. So that's why I put this here just for your information. So you can go look at quick go, you can browse around the gene ontology. You should just know that there are other ontologies, it's not just gene ontology. So there's ontologies for tissues and tissue names and lots of different things. Okay, so pathway information. Gene ontology is a very rich source of pathway information but there are also hundreds of pathway databases, other pathway databases and even other sources of gene sets and I just list some here. So MCDB is made available by the Broad Institute and it collects gene sets, pathways and also other types of gene sets so some of the other types of gene sets might be a signature for a disease that was published you don't know what the pathways are but it's just used to classify disease versus not and pathway commons is a project that we work on that collects a number of major pathway databases into one location and is trying to be a convenient single point of access for pathways and we're building it over time. And we'll talk more about that tomorrow because Lincoln Stein who runs the Reactome database will be, which is one of the major pathway databases and it's headquartered here, will be talking about it and going into more detail on that. Okay, so there's lots of other types of annotations that you could use to create gene sets like chromosome position, disease association, other things like whether a protein has a particular protein domain or not, you could use these things as well as gene sets in your enrichment analysis. I really focus on pathways. As I said, I like it as the starting point in an enrichment analysis because it's most easily interpretable but you could use other types of gene sets and I just list some general, I'm not gonna go into them in too much detail but there's a lot of places you can get this information. One of them that I like is called BioMart. BioMart, this slide just shows you that you can, if you select genes and your organism, as long as the organism is in ensemble, which has quite a few, then you can choose a set of genes in your filters. Like you can say I want all the genes on chromosome one or I want, given my thousand genes that I have in my gene list, you can put it in here and then you can download information about those genes like their sequences or their protein domains or their gene ontology terms and sometimes that's useful if you just wanna download a lot of information for a set of genes. Okay, so what have we learned? Just to summarize, there are a lot of information about genes in various databases. I like to start thinking about pathway analysis with pathways. Gene ontology is the biological process part of gene ontology is a good source of that information as well as various pathway databases like Reacto, which we'll learn more about. And so that's good background for everybody. Okay, so that pretty much wraps this part up. I'll just end by again, summarizing this workflow to emphasize it. This course is really thinking about genomics data that you have, converting it somehow to gene lists and we can talk about that. I should say, even though we focus on gene expression often, not always, there's some cancer genomics data sets that we're also using but one of the advantages of this course is that you can put up your hand during the lab sessions and speak to instructors and TAs to get recommendations for your data if it doesn't fit the kind of examples that we're using. And we're happy to answer questions about as much as we can about your data itself, your experiments and talk about your experiments with you and try to understand if you can use particular type of analysis or not and where you fit on this complicated map, for instance. Okay, so I also included for people that are interested if they never looked at the gene identifiers and I think I forgot to put this text file to share it with the course material but I'm not gonna go through this but there's a gene list, we'll put a gene list, we'll make a gene list available that you can use if you don't have one to try out gene identifier mapping systems and I might put some more information on this on the Notepad online because I realized I forgot a URL here as well. Okay, so that's it for the morning intro. Any questions? Yep? So the question is, does the size of the database affect enrichment analysis? Yes, sizes of your list and the database do affect enrichment analysis and we'll talk about this in more detail later but your gene list, if you give it one gene it's not gonna find any enrichment because it just can't, right? You can't have, you know, you don't have a lot of dynamic range there to work with so you need to have more than a few genes like five, 10 or more, probably. Usually we have in the hundreds and if you give it the whole genome it's not gonna find any enrichment because I mean you'll give it the whole genome for rank lists but I'm getting into too much detail and we're covering it all later so I'll just mention that the size of your gene list affects things as I mentioned. The size of the database, the number of pathways that you have affects the multiple testing correction that you have to do and so the final p-value that you'll get out of the results will be affected and that will also be covered later. Okay, yeah? In the self-construction of the gene, so as of this... It goes into one but because of the way the ontology works because the one at the bottom is a subclass of the one at higher up, logically it also fits in the other ones. So even though gene ontology doesn't say this gene is part of B cell apoptosis, apoptosis, program cell death, et cetera, it just says this gene is part of B cell apoptosis but the computer can just follow this path upward and it can create additional... It can add these labels. Sorry, where's my mouse pointer here? Can add all of these labels to the gene. So the advantage of that is that sometimes the gene is sometimes you have a set of genes and some of them are annotated to specific terms, some of them are annotated to general terms. So if you only consider the specific term, you'd miss out on some of the more general terms, the genes that are part of the more general terms. So moving up the hierarchy allows you to collect them all into more general terms. So if you put the gene to the green box, what does it do? It gives you both, in quick go, it will give you the green box but then you can click to find the rest of them. Do you want to know which is the most important one? The most important one is always the most specific one that's actually directly annotated. It's always the green one. Yeah, it won't be green all the time but in this case. Any other questions? Yeah? I just had a question. So presumably genes can, same gene can have different function in like different cell or different genes. Yes. Depending on its binding factor. So do these data bases also take that into consideration? Yeah, so the question is sometimes genes can have multiple functions. Thank you for asking that question because I forgot to emphasize it around the section here. But there are multiple annotations per gene and so the gene can have multiple functions and the databases definitely consider that but they might not have all the functions for the gene. So there's a few ways that they can miss information. One, the data can be unknown and we just don't know that a gene is important in the brain, for instance, where it's being studied mostly in the heart or something like that. It can also be present in the literature but not yet present in the database. So in that case, the databases are failing us because they're not complete. They don't have all of our knowledge in there. Those are the two main ways. And that's the same for any of the functions of the gene. So a gene can have, there was a good example recently, you may have even seen it because it was related to brain expression that some kind of, I think it was a histone. I can't remember now. It was some type of gene that was very well studied in particular context and you would never guess, I think it was actually related to chromatin modification or histone regulation and people found recently that it had a completely different function in the brain related to neurotransmission or something. Somebody here might remember it was a nature paper like within the last year. And so the gene had only been studied in relation to chromatin. Its name was called chromatin binding factor or something like that or histone or something. But it had some kind of enzymatic function. I don't remember the details on inventing things, but it was a really totally different function in the brain that they just only discovered now. So once they discover that, it will go into the database, but you have to be careful not to judge the gene too strongly based on the information we know about it. So you have to be careful. I guess a general take home message, I need to be careful how long I'm spending with questions because we have break time now. But a general take home message is that genomics gives us the opportunity to be unbiased and make new discoveries that we couldn't make if we're only thinking about the genes that we know. We have to be careful that we don't fall into the trap of just thinking about things the way that we know about them. And it's hard to do that because we don't know about the other functions. But that's a good take home message is when you see data coming out of genomics, it's nice to think about, even if it doesn't make sense, how could this relate to what I'm thinking about? Because it could be that the genes that are coming out are important, but they've just all been studied in another context. And you could make a big discovery that way by just questioning the databases. Yeah? Yes, so the question is, is the Hugo gene the same as the official gene symbol? Yes. Depending on your organism, it'll have different names, like yeast genes are handled by the yeast genome database. So it's the yeast genome database symbol. Human, it's Hugo gene symbol. Yeah. Any other questions? Yeah? You say about the blue bar is that it looks at how... It talks about modeling and how... Yeah, it would be important. The reason I didn't focus on it is because it's not the most common. It would be very nice if we could use those tools, but they aren't as well-developed at this stage in 2016, although, you know, it's an interesting concept to think about. So it's basically this is an emerging research area in computational biology. You'll hear about some of it in the course, but it's just not widely accessible yet. People haven't figured out all the details about how to make it work well. Give an example of a name. Well, Paradigm is one that is recently implemented in the Reactome F5 viz that people will talk about tomorrow. Yeah, Paradigm is a kind of popular one. Okay, maybe... This particular function is only in the right... Gene ontology may provide some information about that, but probably mostly of the time not. What I like to... My approach to that type of problem, which is related to the general idea of what the context of the gene is, the functional context of the gene, is if we can get context information from genomics data, then use that. So for tissue expression, depending on your organism, for mammalian systems, there's GTACS, the Gene Tissue Expression Consortium, or something. And they've done RNA-Seq for dozens of tissues, and they have a tool that you can... That data is publicly available, so you can put your genes into that system and then see which tissues it's known to express it. And similarly, there's atlases of the protein level. So there's proteomics atlases. So there's proteomics atlases that tell you among 30 or 40 different tissues what... Have we seen this protein expressed in this tissue? So that's what I usually like to use. That approach. Try to use the genomics data if it can give us information about the context. It's not perfect, but... Okay, so we're on break.