 Good morning, everybody. It's great to see so much diversity in all the different experimental platforms and gene lists that people have brought to the class. And so what I'm going to do now is give you a general introduction and a little bit of an overview of what we're going to talk about in the course. Most of what we end up focusing on is sort of gene expression type examples, not always. Actually, a couple of examples that will give you a pathway analysis use copy number variation and DNA methylation data. But a lot of the labs and things like that are kind of focused on gene expression. The reason for that is that that ends up being the has traditionally been the majority of the types of genomics data that people have been analyzing with pathway analysis. And a lot of tools have been developed for gene expression analysis. But there are also a lot of additional tools. There are tools for metabolomics and other things like that that we know about. And we can talk to you guys about. But it might not come up too much in examples in the lecture. But the concepts that we're going to be talking about are very general and can be applied to any type of gene list. And that's what we've tried to focus on because of the diversity of possible experiments that people do. We've tried to be general and concept oriented to allow you to translate that information and that knowledge to whatever you're working on. OK, so kind of motivation for this course is the problem that people often have when they start using large scale omics data, genomics, proteomics, metabolomics. And it takes a long time when you're initially starting out to organize everything and get it working unless you have a nice core facility sometimes. But often, once you do that, you're happy that it's working. But then you might get a lot of information coming out of these systems. You might get thousands of hits, thousands of genes. And then the question is, how do I deal with all of this information? Am I really going to have to go through? Yeah. Am I really going to have to go through each of these genes one by one and looking them all up in the literature? So one of the traditional ways that people over the past 10 years or so have been trying to analyze this type of data is to develop, well, work with software and algorithms that try to take a gene list and tell you something about what's interesting about that gene list. And generally, interestingness means enriched and known pathways, complexes, and functions. And the reason why those are usually interesting is because people, when they do a genomics experiment or omics experiment, they usually want to know something about the mechanism that's underlying what they see. So if they see a lot of changes in gene expression and a tool can tell them that all those changes in gene expression are the responsible molecule is a single transcription factor or microRNA, that would be very valuable to understand something about the mechanism underlying that observation of all those genes changing. And that's a hypothesis also that you can go test by overexpressing or knocking down a microRNA. So people often start with a lot of data. They might rank it in some way or cluster it. Either way, you've got a gene list. The gene list could be ranked if you're using some kind of ranking technique, like they might be ranked by differential expression. And then the general idea is that we want to combine all the previous knowledge that we have about genes and come up with some interesting finding in this list. And it obviously saves time compared to the original approach where you're looking in PubMed for each gene one at a time. So as I was saying, and this is a little bit of summary, pathway analysis helps you to gain some mechanistic insight into genomics data. It might be involving identifying a master regulator, drug targets, characterizing pathways that are active in a sample, which is a little bit more descriptive. But it might just tell you something about what's going on in that sample, are the cells growing, are they dying. And for the purpose of this course, pathway network analysis is any type of analysis that involves pathway or network information. And I'll talk more about what that information is. So it's commonly applied to help interpret lists of genes. The most popular type is pathway enrichment analysis, which is what we'll focus on today. But there are many others that are useful that you'll see. Pathway enrichment analysis is just a very brief version of it, and we'll come back to this many times today, is trying to see if there are pathways that are enriched in your gene list more than you'd expect by chance. So if you have 100 genes and 50 of them are involved in the cell cycle and you look in the genome and only 1% or 5% of the genes are involved in the cell cycle and genome, then the fact that you have 50 out of 100 or half of your gene list that's in the cell cycle, that's highly enriched compared to what you would expect if you're just picking genes randomly from the genome. So that's a very important principle, sort of statistical enrichment of some signal in your data, where that signal in this case is some pathway or network information. OK, so a couple of examples, just to give you a sense of the utility and the power of some of these types of analyses, these are two really nice examples that showcase the method, but there are many out there, and these are two that we were involved in. So this first example is studying autism spectrum disorder with Steve Scherer, who's a autism researcher at University of Toronto at Sick Kids. And autism is highly heritable. Depending on your stringency of diagnosis, there's a high twin concordance. And when this project was started, there was 5% to 15% single gene mutations that were known to be related to autism, including chromosomal rearrangements. But about seven or eight years ago, people discovered that copy number variation is very important in autism genetics. And in particular, there was some evidence that de novo or rare copy number variants were involved. So what this team did was they analyzed copy number variants. It's focusing on rare copy number variants. So copy number variants are whole regions of the genome that are deleted or amplified. They used a SNP chip that has SNPs, about a million SNPs, spread around marking different regions of the genome, spread quite evenly across the whole genome. And then there were about 1,000 cases and controls. And if there's a series of SNPs in the genome that are not present in the DNA sample, they're not detected, then that would be a deletion. And otherwise, if it's more a higher intensity in that chip marker, then you'd expect that that might be an amplification. So they were looking for rare copy number variants, which means that they only looked at ones that were present at 1% or less frequency in the whole population. So when they did this, they were looking to see how, initially, how copy number variants affected genes. And they had a few genes that were associated with, that were affected by these CNVs and were associated to cases, just a few, just a handful. And so what we looked at is how copy number variants are affecting pathways. And when we did that, we actually found a rich set of pathways that are affected by copy number variation. And there are all sorts of pathways here. We'll be teaching you how to make these types of diagrams later today. But each one of these little symbols, triangle, or circle is a pathway. And they're kind of organized into groups, self-proliferation. And one of the ones that was quite interesting was central nervous system development. A lot of pathways were already known to be involved, to be related to intellectual disability and autism before the pathways that came up. So this provided a much richer picture than looking at things just gene by gene. One of the interesting things with this analysis is that when we looked at individual pathways, we found that the same gene in that pathway wasn't mutated over and over again in different samples. Not to be expected, because we're focusing on rare copy number variants. Instead, you had 20 or 30 or 100 genes in the pathway. And that pathway could be affected differently in every patient, in every different patient. So each patient had a different way of deleting a gene in that pathway. A different gene was deleted. And only when you put all of that information together into recognizing that those genes are part of the pathway did you see that there's this really strong signal. So looking at individual genes, each gene's different. So I don't see this recurrence. There's no signal. But when you put all of that information together, you see like a dozen patients all affected in a particular central nervous system development pathway, because a gene is deleted in that pathway somehow. So please interrupt with questions if you have any. Just put up your hand. I'm just going to go through this, and I'm happy to take questions in the middle. OK, so the second example is this example really kind of illustrates how you can gain statistical power using pathway analysis approach. So in this case, it was able to collect rare counts of information from different genes and put them together into one bin. And then you can do one statistical test on a pathway and you get a much stronger answer. Because instead of working with one count at a time in different patients, now you're working with a dozen counts. So that's a very, very important concept that's very useful. So the second example is a more recent study. This one is also brain related. It's working with Michael Taylor, who's a neurosurgeon, also at the hospital for sick children. One of the cancer studies, so he studies cancer. One of them, in particular, is appendemoma, which is a cancer of the appendemom, which is aligning of the central nervous system. And there's a number of different types now known about this cancer. One of the most common anatomical locations is in the posterior fossa, which is the brainstem in the cerebellum. It affects its third most common brain tumor in children. And a few years ago, Michael used gene expression analysis to find that there were two major types of posterior fossa appendemoma. Type A, which affects the youngest patients and has a really bad prognosis. And type B affects the oldest patients, still children, but actually has an excellent prognosis. So even though the pathologists, when they look at these samples, they know it's coming from the same brain region, and they can't really differentiate these things. When you look at gene expression, it turns out that there's basically two different diseases. One has a terrible outcome, and one has a great outcome. It's really different. We also looked at the pathways involved in this and saw that there were very different pathways activated in these two types. So that's another use of pathway information. Pathway analysis is to support your statements that you have two different biological functions, for instance, in samples that you're looking at. So if you sort cells, for instance, with flow cytometer and you do some gene expression on two different cell populations, and you see very different pathways coming out from each one, that might support that you have two different functional cell populations. So that's another example. So Michael and colleagues followed up on this by doing a whole bunch of genome sequencing. And unfortunately, and this is the first time we've ever seen this, there were basically no mutations that were found. So especially in this type A, the Ceres type, there were no recurrent mutations, and the patients only had one or a couple of mutations per patient, even with whole genome sequencing. It's the first time that's really being seen in cancer because cancer, hallmark of cancer, is genome instability and usually see lots of mutations. But in this case, you don't see that. And one of the reasons might be that the children are very young, and they haven't had a lot of time to generate acquired mutations. But it also might be that this cancer is different than other cancers. So they then looked at methylation. So even with doing pathway analysis on this, doesn't really find anything because there's no signal there at that level. So he moved to methylation to look at methylation of DNA, CPG islands, which are focused on promoter regions of genes. The idea is if there's a promoter region of a gene that's highly methylated, it's likely to be silenced in its gene expression. And the result of this was actually a very clear signal between these two different subtypes. So it seemed that methylation was an important aspect of the difference in gene expression that was seen. And there were about 2,000 genes that were significantly differentially methylated. He actually, his lab looked at this list in standard pathway analysis methods, and it didn't really show much. And we ended up doing a little bit more detailed work on it. One, we found we used a bigger database of pathway information that we collect in our lab. And actually, we can tell you about this during the lab just for convenience. And we collect a lot more information than sort of maybe available in standard tools. And we used a little bit more of this appropriate statistical test that was tailored for the data. In this case, it was also kind of rare information, low counts. And so we can talk a little bit about that. We won't go into too much detail on choosing different statistical methods if your pathway analysis isn't working unless you have specific data that we see is needs that on a normal basis. But the interesting thing here was that a single pathway came out. We searched 10,000, 15,000 pathways, and only one was significant. It was targets of a protein complex called PRC2, Polychrome Represive Complex 2 that is involved in methylating histones and subsequent methylation of DNA. So it definitely seems to be relevant. And when we told Michael that, and he started looking into this complex, he realized that people are really interested in this complex from drug discovery point of view. GlaxoSmithKline is building a molecule that inhibits the methylation enzyme in this EZH2. And there are drugs on the market that repress DNA methylation. So that's quite interesting because this tumor doesn't have any known chemical therapy. Basically, it's radiation and surgery, which is the sort of basic treatment. It's very debilitating to do brain surgery on anybody, basically, but especially children. So they really want to avoid that. So this represents the first rational target that was discovered in this tumor. So that's highlighting this idea that you might get some mechanistic insight into your data by doing pathway analysis. So in this case, the mechanistic insight is that this protein complex seems to be activated. And in this case, we're really lucky. There's only one thing that came up, which is very unusual. And they were actually able to get a drug and try it out in a patient. And this patient had a tumor that had metastasized to their lung and was doubling in size in two months. And they gave one course of the anti-dename ethylation drug that was on the market. And it stopped the tumor growing basically right away. And the patient actually felt better and started able to run around. And then that lasted for over 15 months, which was basically 15 months more than was expected and would have been for the kid. Yeah? Yeah, it's very different kind of switch drug than that. Yes. So Quaid is going to talk all about Q values and P values after this in the morning. Yeah. So this is hopefully motivating you guys to see the power of this. And then you'll be happy to learn about Q values. And we chose the person in the instructor's name as he starts with Q to talk about that. Quaid, sorry, bad joke. OK. So those are two examples that kind of just illustrate some exciting discoveries that you can make with pathway analysis. And but to summarize the benefits of pathway analysis versus focusing on analyzing your data transcript by transcript or SNP by SNP is that it tends to be easier to interpret because it works with familiar concepts, pathway information that biologists learn about. It identifies possible causal mechanisms, which can be very useful. It could predict new roles for genes. So you might see a gene that you don't know anything about. And it's acting similarly to a whole bunch of other genes that are all in the same pathway or network region. And Quaid will talk about that later when we talk about gene mania and gene function prediction. It improves statistical power, as I mentioned. So generally, when you have thousands or tens of thousands of elements, you might even have millions. And if you're working with GWAS data, like millions of SNPs, each one of those, when you're doing certain tests, each one of those needs to be analyzed with its own statistical test. When you're working with the level of pathways, you have many fewer tests. And that reduces the problem of multiple testing correction, which Quaid will talk about. And as I mentioned, it aggregates data from multiple genes into one pathway. So that's two different ways that it improves statistical power. It's often more reproducible. So one of the things that people noticed when they were trying to use gene expression data to create biomarkers that are predictive diagnostic of particular diseases or predictive of outcome is that different people who, for instance, we're studying breast cancer and want to predict whether the breast cancer was going to metastasize, they would collect different samples, and they would all collect gene expression, and then they would create a biomarker. And the biomarker genes that group A found were totally different than the ones that group B found. However, if you look at the pathways that they relate to, the pathways were often very similar. So that's what I mean by it might increase reproducibility across data. And it facilitates integration of multiple data types. So one thing that you might want to do is you might be collecting different types of data, and you can look at everything at the pathway level, and you'll see how they relate, which might not be as easy if you have treating all the data sets independently and not looking at the pathway level. OK, so any questions so far? So pathways, we talk about pathway and network analysis. Many of you probably know about pathways. The sort of standard pathway that we learn about in biology is kind of a process diagram. This binds to this, and then this happens, and there's a lot of details. This phosphor relates that. So it's often very detailed, high confidence information. It comes from potentially decades of literature. EGF receptor pathways been studied for 65 years or so. And on the other hand, network information kind of looks like this. You might have just connections between genes. Each of these circles represents a gene. You might have some information about which gene represses another gene or which gene activates another gene, but you might not. It might just be A binds to B or A is co-expressed with B. And typically, this information is a little bit more noisy because it comes often from large-scale studies. However, it's often broader coverage of the genome. If we just want to focus on pathways that are very well studied, you might only be able to cover 25% of the genome. But if you consider all the high throughput data that's being published, you might get closer to 100% of the genome. So that might be important when you are working with genomics or omics-type data because omics-type data often gives you information about the whole genome. And so you might get information about parts of the genome that people haven't studied before. You won't find that information in a pathway database, but you might find it in a network database. So this figure is from a review that's going to be published very soon in Nature Methods, I think. And it sort of summarizes three main types of pathway and network analysis. So one is enrichment of fixed gene sets. That's what we'll focus on today. That tools like GSEA, Gene Set Enrichment Analysis, and G Profiler, we're going to cover. And the idea here is that there are, similar to the autism case that I mentioned, so the idea here and this cell cycle example I mentioned where you have overrepresentation of the cell cycle in your list of 100 genes. There is also ways of, so basically what that does is it finds biological processes that are known that are altered in your sample. And this slide is from sort of a more cancer-oriented perspective, which is what this review article when it's published is a little bit more cancer-oriented. But the second type is we call de novo subnetwork construction and clustering. So this uses network information. So the first one uses gene sets, uses pathway information that's represented as gene sets. So we actually, we'll talk about this more, but we basically take information like this, a detailed pathway. We throw away all the details. We just say this set of genes is in a pathway. Like here are the genes that are in the cell cycle. And that's sort of the way that we represent pathways for this first one, this first type. For the second type, we use a network. And that network might be very big. It might involve 10,000 or 15,000 genes. And we layer on the data that we have, like gene expression data. And then we try to see if there's some region of the network that's significantly changing in our data. So if it's gene expression data, we would look for differential gene expression, but also in a region of the network that's densely connected. And what that's interesting for is it might identify new pathways that you don't know much about. Sometimes we might call those modules if they're or systems, if they're connected to each other more than you would expect. We might not know much about them. And this type of analysis would kind of give you those modules. And then there's also pathway-based modeling, which is a little bit more detailed. So how are pathways activities altered in a particular patient? It's a little bit more mechanistic. We're actually not going to cover this in too much detail. These methods are a little bit more tricky to get working these days. They're a little bit newer. They require more information. For instance, they require multiple types of genomics data often. And one pretty powerful one is called Paradigm. And that is, you can get, I think, more mechanistic information and look at it in a per-sample basis. So we'll focus on other ways of getting that information. And we can ask questions about these more advanced methods. So I'm now going to go over a workflow that we've created that tries to summarize a lot of the information in the course. It doesn't include everything. In particular, it's focused mostly on the first two days of the course. But the third day is more about gene regulatory network analysis. And there's many elements of that related to this. But for this workflow, it's focused on day one and two. The general idea of pathway and network analysis is that you collect some kind of omics data. You normalize and score it. So in gene expression, you have to normalize it and compute the differential expression. In GWAS, you have to take your SNP array and you have to normalize it and then do genotype calling. In proteomics, you have to take the mass spectra and you have to search a database using software to identify proteins. And then they're ranked by a particular statistical method. So each genomics data typically has its own way of normalizing and scoring things. So we're not going to talk about how to do that in this course. We assume that you're coming in with a knowledge about how to do that. We can talk about it if you want. But those types of normalization and scoring are often very standardized, especially for established types of omics techniques. If you're working with a new omics technique that doesn't have an established normalization method, that's a method development problem that somebody has to solve to get that omics data set working well for lots of people. The outcome of this is the gene list. That's sort of where we're starting from in this course and where we're trying to focus on. And as we talked about already, the goal is to learn something about the underlying cellular mechanism. So breaking that down in this little box, this is what we're going to focus on here. We can visualize and identify interesting pathways and networks, so we showed some examples. And then often once you find, you might get a lot of pathways coming out of your analysis if you're lucky. There's lots of information in your data set. And then now instead of having 1,000 genes, you might have 500 pathways to look at. A lot of the pathways might be related to each other, so we'll show you methods of how to visualize that and reduce the redundancy. But often what you want to do at that point is figure out which pathways look interesting for further drill down. So you might see a pathway that is well known. Oh, I'm studying cancer and the cell cycle is active. I know that. So that's pretty standard knowledge. You expect to find that, and you find it. It's actually a good validation that your system is working. You might find a pathway that is really novel, and we never seen that pathway come up with this type of analysis or this type of experiment that we're doing. And that might be really interesting, but you might actually not be able to follow up on it too easily because maybe lipid biosynthesis comes up and you don't know anybody who studies lipid biosynthesis. So there might be a whole range of different pathways that come up, and you often what people will do is pick an interesting pathway that kind of relates to their hypotheses, their way of thinking, and ideally you would pick the best pathway, the most strong signal in your data, and I would recommend going for it, but sometimes it's not practical to study certain scenarios of biology. So that's a little bit of an interesting thing that happens with genomics, is that if you ask such a broad question, you get back any possible answer, you should be ideally prepared to follow up on it, but it's a pretty tall order, it's a big task to say to someone, okay, I'm gonna tell you about any area of biology, now you have to go and do some experiments to validate that. So keep that in mind, I guess. Something we notice. And then ideally you'd kind of publish some model explaining the data. So these steps, these mechanistic interpretation steps kind of give you some hypotheses, you might be able to take those hypotheses and experimentally work them up and come up with a mechanistic model that helps, you might find a drug target, so that's what our goal is. Okay, so now I'm taking these three boxes and I'm gonna expand them further, sorry, I'm taking all these boxes, I'm gonna expand them further in the next slide. So this is a much busier slide, but what it shows you is various different paths that you need to take depending on the data that you have. So one of the comments that we've gotten in previous versions of this course is that we're trying to teach the concepts, so we talk about gene, we talk about enrichment analysis. But then people say, well, how do I do enrichment analysis with protein expression? How do I do enrichment analysis with DNA methylation? There's all sorts of different ways of getting a gene list out of your data. And so these blue boxes at the top here kind of talk about where the gene lists come from, the different types of data, and you can just look at them. And so, oops, my slide is messed up here, quickly fix it, looks fine on my, there it is, weird. Okay, so often the experimental methods that people have that generate gene lists are molecular profiling data, gene expression data is pretty common, protein expression data is a little bit more resource-intensive to get access to, but there's increasing amounts of that data. The basic version of it is that you identify a bunch of genes or proteins, and that generates your gene list. You can also quantify some aspect of those genes, I mentioned differential expression for gene expression, and for protein you might have absolute values of the protein concentration in the sample. You can rank your genes based on some score that you might have. It might be the score of statistical association in the GWAS study of that markers associated with that gene for your cases versus your control. So think about, when you're thinking about your gene list, think about two types, just the set of genes, or if you have a score associated with that, that's something important, that you can rank your genes based on that, and that is additional information that you have about your gene list. And we'll talk about methods that use those two types of information differently, and you can get more better results if you have more information, this ranking information might be able to give you better results. Another way that people get gene lists is by taking a lot of different data samples and cluster them to find patterns that are similar across the samples. So if I have 100 gene expression experiments and I see, I cluster the results, I might see that 20 genes are acting very similarly across those 100 experiments. They all go up at the same time, they all go down at the same time, and so those 20 genes, I would say, are related somehow, they're following the same pattern, they might be involved in the same pathway, they might be involved in a set of pathways, some larger process that's regulated across the experiments in the same way, and so clustering will identify those, and that list itself, that list of 20 genes, is a gene list that you can analyze, and that with pathway analysis, and you might be able to get some information about why those genes are going up and down. Protein interactions or molecular interactions is another kind of type of data, that's general type of data, so we have protein interactions, a couple of people are studying that, transcription factor, binding to DNA, and genes that they might be binding to, or it might be regulating, microRNA targets, so there's a lot of ways of getting targets of a gene, and this usually just gives you a list, so I have a protein, and I wanna find out what it connects, what it binds to, but I think people mentioned BioID, that's what you guys were mentioning, so that's for instance one way of getting access to proteins that are in the neighborhood of your protein, and that would define a list, and it also gives you network information, and so that network information we can use as well. So there's many other ways, I think I've mentioned these, there might be other examples in the class as well, the point that I'm trying to make is that gene lists come from different places, and these blue boxes kind of give you a sense of some major examples, and so if you look at that, you might come up with some ideas about how your experiments that you're working can be converted into a gene list. So the next thing is the sort of meaning of the gene list, so depending on your experiment, your gene list might mean different things, so if I'm doing a gene expression experiment, there's sort of a notion that in the cell genes are not regulated randomly, they're regulated for a purpose, so if the cell's gonna turn on a bunch of genes, it's for something, it's to react, respond to a stimulus, so there's some systems that it will turn on, and it won't necessarily keep those systems on all the time because that's wasteful of energy, so there would be some idea, there's some idea that the cell likes to turn things on when it needs them, and it turns on whole systems to react to some effort, so if I see genes changing, it kind of goes back to that concept and says, there must be some pathways that are being turned on, and so if you do a gene expression experiment, you might get information about biological systems, however, you might also be doing some kind of genetic screen or some other type of experiment that doesn't give you information so much about pathways, it might give you information about a cellular location, like I'm doing some purification and I see a whole bunch of things that are in the mitochondria, or I'm doing a linkage analysis and genome-wide association study, and I get that I find that a whole chromosomal region is associated with my disease, and I'm not sure which gene is involved in that, there might be a hundred genes in that chromosomal region, so the point of the slide is just that you should just think about the types of information that's present in your gene list, what you expect based on biological assumptions. I also mentioned that once you have your raw data, you have to normalize it, and for the purpose of this workshop, we assume that, as I mentioned, that you're doing standard normalization, background adjustment, quality control, and you're using statistics that are increasing your signal and reducing your noise. Just because people often ask about even though we're not covering it, people often ask what kinds of ways do I, what is the sort of way concept that I would use DNA methylation, for instance, so for instance, if you're focusing on DNA methylation, you might wanna score methylation of gene promoters, and that relates to silencing of genes, or I talked about these other examples, so these kind of orangey boxes here, just help provide some examples of how you go from your raw data to a gene list, and just to give you a sense of different ways of doing it. Okay, so then you have your biological question, and this sort of relates to the types of information that's in the gene list, but there are a few different things that you can do with a gene list, few different types of questions that you can answer with a gene list once you have it, so one thing you can do is summarize all the biological processes. As I said, it's fairly descriptive, but it might help differentiate samples with a function of different samples. You can identify pathways that are different between samples, and that might give you some mechanistic insight into the differences, like apoptosis is present in the sample, it's active in the sample. You might be able to find a controller for a process, and that's mostly what we're gonna talk about on day three, and you might be able to find that a pathway is regulated by a transcription factor. You can find new pathways or new pathway members, can discover gene functions, so we'll talk about this, and you might be able to find a drug, like I mentioned earlier. So just to summarize, just to give you a, I think I mostly said this, but today we're focusing mostly on pathway enrichment analysis, which addresses this kind of summarize and compare idea, so summarize your data, compare between samples. Tomorrow we're gonna focus more on network analysis, so that's useful to find new pathway members, identify functional modules, predict gene function, and then the third day is about regulatory network analysis, which is more about finding and analyzing controller molecules. And that's sort of summarized by these green boxes here, so what we've tried to do is show you that you can take a gene list and you can send it to pathway analysis type of, pathway type analysis. This is more of a network type analysis, and these yellow highlighted words here are names of software that can be used for each type of analysis that's present in the box. So we're gonna talk about this one, for instance, today. Visualize and identify interesting pathways, and at the bottom here we have mechanistic drill down, so for instance, once you drill down to mechanism, you might wanna overlay, you might find a pathway that's, say you take your gene list, you run a pathway analysis, you've got 100 pathways that result, and you visualize them in a way that we're gonna tell you about today, and you identify your interesting pathway, then you zoom in and you say, okay, now I wanna see what's going on with this pathway and my data, so it'd be nice to overlay your data onto that pathway, for instance. So Path Vizio is a software that helps you take your genomics data and overlay it onto a diagram of a pathway so you can reason over it, and you might find some genes that you're not familiar with, so you can look them up and try to predict their function, integrate additional information, so there's a lot of complexity here, we're not gonna talk about every detail of this during this workshop, we're not able to cover it all, but the purpose of this is to kind of give you some big picture view of the process, the kind of way of thinking about this workflow, and hopefully it's useful and you can use it to follow aspects of the course. Okay, so any questions so far? Okay, so it's introduction, it's pretty clear. I'm now gonna talk about just a background of information that we need to all know to really go into this pathway enrichment analysis idea that Quaid is gonna cover in more detail. So, as I have mentioned multiple times, if we wanted to do pathway enrichment, this pathway enrichment idea is that you have a pathway that's statistically enriched in your gene list more than you would expect, given the frequency of the genes in that pathway in the genome. So, generally there is, and this sort of summarizes it here and Quaid is gonna go into this in a lot more detail, each time you do one of these analyses, you have to take your gene list and you have to take some pathway information and you have to take each pathway one by one and compare it to see, do that statistical test to see if it's enriched. And so, the things that you need for that are your gene list and pathways, pathway information. And then the statistics are in this box. So, Quaid's gonna cover that. But in just as for introductory purposes and we usually about half the class know some of this material and the other half doesn't, we wanna get everyone on the same page with some of the basics. So, we're gonna cover some general concepts and tips related to the basics of these two concepts. So, one is the gene list, some general ideas about it. Sorry, some more specific tips about how gene lists work and also where pathway information might come from. Okay, so, the gene list is, there's just a few couple of important concepts to know about when working with gene lists. So, one is that the way that you name your genes. So, it's not, there are actually many different ways of naming genes as you guys probably know. This doesn't have to be genes, could be metabolites as well, lots of different names. The best kind of names for gene lists are ones that are unique, stable, and that are, you know, names or numbers that help track that gene stably across different versions of databases. And if you give it to your friend, they know what you're talking about. Kind of like a social insurance number or the entry gene ID. But because gene and protein expression or protein information is stored in many databases, genes have lots and lots of different identifiers. So, I'm gonna say ID for identifiers. And it's also important to note that even though you have a gene, gene relates to DNA, RNA, protein, and there's different database records for those different categories. So, you might have an identifier for the DNA location of the gene, the RNA transcript, and the protein translated product and a different number for each of those, different database number for each of those concepts. And also one for the gene. So, it's important to recognize the correct record type because different tools expect different types of information. So, if you have all protein identifiers and those protein identifiers relate to different splice variants of the protein, some tools might not be able to translate that nicely to genes and so you might have to do that yourself. So, it's also important to note that the gene records like the Entrez gene database, which is at NCBI, the people that make PubMed, National Library of Medicine, they don't store sequence. They just have a name of the gene and some information about the function of the gene. It's like they're storing information about the concept of the gene. They don't actually say what the sequence is because it might actually have different sequences in different contexts. And then they link to the sequence in other databases. So, there's lots of different identifiers. Some common ones are listed here. The ones in red we would recommend using because they are more likely to be unique and stable. And I'll tell you about some of the problems that might happen if you don't use these types of things. So, usually in my group, for instance, and what I recommend is to use Entrez gene IDs if you're working with genes, RefSeq IDs if you're working with RNA transcripts, Uniprot or RefSeq if you're working with proteins. And then there are many species specific gene symbols. So, human has the Human Genome Naming Commission and they have a unique name that is a nice human readable symbol usually of a gene. These still change sometimes but they don't change as much as uncontrolled gene names. And it's just good to recognize these different things and try to pick one that's standard. Okay, so just to mention that if you need to translate identifiers between types of naming schemes, so you want to take the gene symbol and translate it to Entrez gene ID or the gene symbol and translate it to protein ID, there are mapping services that help you convert an identifier from one to another type. And we're not going to go into too much detail about that but at the end of this lecture, there's a do-it-yourself lab that we're not going to go through that talks about that a little bit more. Okay, so I think I actually have a slide out of order there so I'm going to talk about that a little bit more. So one of the, actually maybe I'll just do it right now. Okay, so here's an example of an ID mapping service, Gconvert, there's another one, an ensemble biomark and as you can tell, you kind of input your identifiers and you choose the type of identifier it is in your organism and you choose the type of output and then you get, here's choosing an identifier and then you get results. In this case, this Gprofiler Gconvert tool might give you some hints about if you provided an identifier that is ambiguous which means that it's in databases pointing to two different genes, that's very bad because if you have that in your gene list you might make a mistake in downstream analysis and the best example of that is this paper that was retracted in nature in 2003. So this paper was about a gene called HES-1 and it turns out that they did all these experiments on HES-1 and then they found out that a very early database search error led to them using HES-1, not HES-1 because there's another gene called HES-1 and they had the wrong one. They're both named the same thing. One actually has like a capital E versus a lower case E and that kind of thing when you have case sensitivity in your gene name is not gonna be recognized by many computer systems and just a bad idea. You never ever do that because you'll have problems like this. So what these guys did is they did all their experiments on the wrong gene, they published a nature paper and then a few weeks later they had to retract the paper. So that's just a really bad tragic example. There are different types of errors. So the ambiguity that I mentioned is something to be aware of. So don't use identifiers that can map to more than one gene because you have a problem like that HES example. So there are often people who wanna use gene names. I find this often when people are working on protein data, protein names often have different names than the gene name. It's best to just use the gene name because the protein names aren't standardized often and people would say P53 and the gene name is TP53. If you just type P53 in, you might match lots of genes. That TP53 is a standard symbol for the gene that is the one that will uniquely determine that gene. Another thing that happens is that the tools that you use to manage your gene list might introduce errors and the biggest problem is Excel. How many people have noticed when you type in genes in Excel that it might automatically change it to like a date? It might, like if you type in information in Excel it tries to be smart and if you type in Oct4 which is a pretty important transcription factor, stem cells, it thinks you're talking about October 4th because Excel is made for accountants and they, it's much more relevant, October 4th of this date is more relevant to them than the Oct4 gene. So how many people have seen this in Excel? Yeah, so it's definitely a problem. The, you can make Excel stupid by default. Exactly, exactly. So by default Excel does this but if you choose for instance to, like if you disable these smart options or if you choose to paste as text instead of general so the default format for a cell is general and that tries to guess what the format is and so it would guess a date. If you just say it's text and I'm just talking about text or just a number then it won't and the problem that comes up often is if you're working with a big gene list and you just paste into Excel and then you didn't notice that it actually made a change because it made the change off-screen. Like you didn't see that any, you don't, you know, you have a thousand genes and somewhere at the bottom of the list it's changing it. This is funny because there was a paper published a few years ago that kind of tracked these things, this paper here and actually some of the Excel auto conversions have made their way back into the databases as the gene name because people are submitting this back to the database. There's also problems reaching 100% coverage. If you have a lot of genes from your RNA-seq experiment for instance, some of those genes might not be very well studied. In fact, some of them might not be real genes at all or people are currently debating whether it's a gene and every time you go check the database the debate is, you know, moved on to the not a gene and then back to the it is a gene and back to the not a gene so it actually will change over time. So some of those, especially those ones that are less well studied you might not be able to get identifiers in all of your databases because the gene may be brand new. And so, but if you do have a desire to do this identifier conversion and want to make sure all of your genes are covered then you can go to different sources to increase your coverage. Okay, so these are some recommendations that really try to help with this. So this is only relevant for protein and genes. It doesn't really consider splice forms. Just a note about splice forms. Some experiments give you a lot of information about splice variants of genes. Most pathway analysis systems don't consider that differentiation between different splice forms because all of the databases we have pretty much I think it's safe to say that all of the databases we have about pathways and gene function is really focused around the gene and it will often consider the longest transcript. And it's a very active area of research to push those databases to update themselves so that they, as we know more information about functional differences between splice variants that that information gets captured and we see that that very important information biologically is actually captured. But right now the sort of state of the field is very gene oriented and everything, all the splice variants kind of get collapsed and probably even if they have very different function all those functions will just get collapsed to the gene level. And that gene is involved in these functions. So mapping out all of your data to entree gene IDs or the official gene symbol using a spreadsheet is good. I talked about 100% coverage and I talked about these Excel auto conversions. So you can turn your auto conversions off or format the cells as text before pasting. Okay, so just to summarize we just finished covering some of the basics of working with gene identifiers. It's not such a problem if you only have a few but when you have thousands of them these are kind of good housekeeping things to know about because it helps reduce errors. Okay, so I'm now moving on to sort of the pathway side of this. These are the two inputs into pathway enrichment. And pathway information is quite varied and present available in many different sources. One of the most popular sources is the gene ontology. How many people know about the gene ontology? Okay, a few people know about the gene ontology. And so I'm gonna talk about this gene ontology because it is a very important source of pathway annotation not just pathways, some other types of information as well for genes across many organisms and many pathway analysis tools use the gene ontology. There's some complexities with it that need to be explained to really understand how it's working. So the gene ontology is an ontology, the word ontology is a name, it means a system for describing knowledge. So it's a way of representing knowledge, in this case biological concepts which are terms or phrases and the relationships between those concepts. And in this case these are applied to genes. So protein kinase, apoptosis, membrane, those are three terms in biology that would be part of the gene ontology. It also is a dictionary because each term has a dictionary definition associated with it. So it's actually very useful just as a biological dictionary if you wanna look up a term. And it's made by the gene ontology consortium which is a group of databases, model organism databases and Uniprot which is a big protein database that are responsible for annotating function of genes and proteins. So gene ontology is a hierarchical structure with the most specific terms at the bottom of the structure and if you go up it gets more and more general. So here this example B cell apoptosis and then if you move up apoptosis then programs cell death, cell death, death and physiological process and all the way up to the top to more and more general things. There's a couple of different types of relationships between these terms. The major ones are is a and part of. So B cell apoptosis is a type of apoptosis which is a type of program cell death but B cell apoptosis because this one's right here, I know it's part of B cell homeostasis. So this, you know, you can say the nucleus is part of the cytoplasm part of the cell. So those are two different major types of relationships. They describe, it's organized this way so that it can describe gene function in multiple levels of detail. And it's important to know that this kind of tree-like structure can have cases where the term, the gene ontology term can have multiple parents. And that's important because, I'll tell you in a sec, that I'll mention in a sec. So there are three different major aspects of gene ontology. One is a biological process, which is where pathways are defined and also more general processes. Molecular function is a enzymatic function and cellular components are different parts of the cell. So those are three different aspects. Gene ontology has two different parts. It has the terms that I talked to you about and these terms are defined by database curators whose job it is to review the literature and type information into databases so that we can use it. And terms can be added by request, if you wanna add one, you can. And there are experts that help with major development of new terms, but it's growing over time and there are tens of thousands of terms. So this is the number of terms in each of these categories so that you can see there's 28,000 biological process terms that are defined, each one with the definition. So the second part of gene ontology is annotation. So what the database curators do after they've defined the term is they link it to a gene. So if I have a gene, I can tag it with any number of these terms. So the gene is involved, it's localized in the mitochondrion, it's involved in energy metabolism and its enzymatic function is a particular function. So that's the idea of annotations or sometimes called associations, gene associations or go annotations. There are multiple annotations per gene, so the gene can have multiple functions. Some of the annotations are made manually and some of them are made electronically. Some of the electronic ones that are made electronically are reviewed by people and others are not reviewed by people. So there are different quality levels of these annotations that's important to mention, I'll go over that in a bit. Just a quick note that the hierarchical nature of this gene ontology means that if you have a particular gene annotated to one of these terms, it automatically, you can infer that it automatically gets annotated to all these other terms. So there's actually lots, once you make this link of a term to a gene, it automatically gets all these other terms and it can also be linked to multiple terms. So there's this often, a lot of gene ontology terms can be associated to a gene that can sometimes make working with it a little bit difficult because for instance, if you just wanna take a thousand genes in your gene list and make a pie chart that lists the cellular, where in the cell the genes are, you have to recognize that the genes can be multiple cellular compartments, maybe every gene is in multiple cellular compartments and makes it harder to just come up with a quick summary. That's because they're not unique, the terms are not unique to genes. Okay, so more importantly, coming back to this point about annotation sources that some of the annotation is very high quality curated by scientists. They're typically smaller in number because it's time consuming to create these. So they also have reviewed computational analysis. Some types of computational analysis for genes are very accurate. So for instance, if you take a protein sequence and you run it through a transmembrane domain predictor, those predictors are like 97% accurate or more. So they're very, very good at picking out transmembrane domains and proteins. And as soon as you have a transmembrane domain, you can say it's part of the membrane somewhere. So you already get information about the function of that protein. And so that kind of annotation might be very high quality. In addition, some annotation is mapped from other species by sequence similarity. So if I have a new model organism that I'm studying, I know someone's talking about bullfrog and other types of fed people taking this course or studying bees or other things where they haven't completed a genome sequence fully. And I guess Frog has the Xenopus genome sequence, but there might be other organisms that don't. Often all the information will come based on sequence similarity to its closest genome. And someone ideally would review that. And if they review it, then they might identify false positives and remove that. And then there's the, you know, so that's the manual annotation. The other side is fully electronic. It's considered lower quality. And a key point is to be aware of the annotation origin. So there are a few bunch of different annotation types. So when the curator takes a gene ontology term and links it to a gene, they don't just say, okay, that's it, they add an evidence. Why did I do that? So it might be that they found a paper that said that this gene is part of this, is related to this function. And so they'll actually say TAS, traceable author statement, and they'll put the PubMed ID. Or if they say that I'm saying that this gene is part of this function because it interacts physically with a bunch of other genes that, a bunch of other proteins that are part of that function, they can say inferred from physical interaction and they can actually tell you what the other gene proteins are that physically interact it. So it's a lot of information in these annotation files. And these experiments, these evidence codes tell you the quality of the annotation. So IEA is inferred from electronic annotation, that's like the lower quality one. These are the higher quality ones in red here. So this is for your information. A key point, if you're working with an organism, okay, so one issue is that, all major eukaryotic organisms and human are quite well covered. Several bacterial and parasite species are covered. New species are added to the list. You can go to this, you can go to the gene ontology website and see the list of species. Here's a list of the top most annotated organisms in gene ontology. It includes chicken and other things that are not always standard. Model organisms. The green is experimental evidence codes and the blue is non-experimental. So the point here is that an experimental means that somebody does an experiment in that species. So it's likely to be better quality. The blue is probably here in electronic annotation. And so you can see that chicken, which is the one species here that's not probably not the least common model organism. It's not really as much of a model organism as some of these other ones is mostly blue, probably because people aren't doing as many experiments in chicken as they are in human. So the point here is that, one, be aware of the annotation origin, but you also might be working in a species that doesn't have experimental annotations. All the evidence is taken from another related species. And so in that case, you have to use electronic annotation. You might want to review it yourself if you're worried about it and so you may be forced into that. So that electronic annotation sometimes is useful. Yeah, right. So the curators are well-trained scientists that wouldn't make that kind of mistake if they're doing it manually. But if it was based on some kind of electronic text mining, then it might make that mistake. Sorry? Text mining might be used by these curators to help them queue up information that they'll review manually. It might be used in their pure electronic annotations. So that's why that's a reason that the electronic annotations, because they're not reviewed by anyone, they might have mistakes like the one you mentioned. The evidence code tells you how the evidence was collected. And if it's a paper, it will actually have the PubMed ID and you can go and look at that paper. So yes, in genotelogy annotations. Generally, the databases will have, if they're curated, they will tell you, I'd say almost all databases because they're responsible, they will tell you the paper that they read to get the information at least. And they might tell you more like we found the information in this figure or they might say that they collected a bunch of papers and they learned about the pathway that they were studying and they tried to make their best version of it. So they might even be kind of writing a review article almost at the highest levels. But yes, you should be able to go back to that experimental evidence, some of it may be high throughput in network databases. Like Quaid will talk about Jimania later. And one of the things that you can do is if you have gene expression data, you can go and see which gene expression data set and which paper generated the gene expression data set and so you can find that information. It's not always easy to do. So it is a good question. Sometimes it's easier than others. Does that answer your question? Yeah, any other questions? Okay, so this is just a quick slide for your information. There's a bunch of databases that contribute to this worldwide effort. A couple of additional gene ontology related concepts. There is this idea of a go slim set. So instead of having 28,000 terms for biological process, it will slim it down to maybe 100 and that makes it a little bit more manageable. So I mentioned this pie chart example. So this is kind of a standard thing that sometimes people want to do. That if you have too many terms, it's hard to make a pie chart because for the one reason that I mentioned earlier, but also because it might create too many slices of the pie. So you want a fewer possible slices of the pie and the go slim set might help you do that. So there's a couple of official reduced sets that exist. Just a minor point. There are also many tools and resources that are available that use gene ontology. And so by looking at gene ontology tools, you might be able to find interesting analysis methods. One tool that we recommend is a website called QuickGo. So this is at the European Bioinformatics Institute in Hingston near Cambridge, England, near the Sanger Institute. And they make available this nice browser for gene ontology. So you can type in a gene ontology term and you'll see this hierarchy here and you'll get the definitions, et cetera. You can also find the annotations because they load that up from all the gene ontology annotations and you can actually filter the annotations. You can say, just give me annotations for worm where the term is macromolecular complex for any of its children. And I only want, I don't want to include in electronic annotations. And so you can set these filters and you'll get a result. And that result that query that I just described will give you all of the protein complexes that have experimental support in C. elegans. All of the terms that are associated with that and the genes and proteins that are associated with those terms. There are other ontologies out there. Most of them are not very well used, but some of them are like the human phenotype ontology is an emerging ontology that covers a lot of information about human diseases. And increasingly that's starting to get used and it's present in G-profiler, which is one of the tools we'll talk about. Okay, so that really covers a little bit of depth on gene ontology. To summarize, it's an important source of gene function annotation and pathway information, especially the biological process part of gene ontology. But there are many other sources of pathway information, mainly pathway databases. So one of the sites that we have developed is called PathGuide. My lab keeps track of as many pathway databases as we can. And the last update last year, increased the number of pathway databases to 550. So there's actually a lot of pathway databases. Most of them are, there's a lot of specialized pathway databases as well. So there's pathway databases that just focus on HIV, human interactions or just focus on the innate immunity, for instance, and or specific categories of information like transcription factors and their targets. MCDB is a set of, it's a pathway database that's mostly, where all the pathways are represented as gene sets. And that is used by the GSEA software, Gene Set Enrichment Analysis, which we'll talk about as one of the most common pathway enrichment tools. And they make available a database that's fairly extensive for their tool. Pathway Commons is a sort, this is a project from my group in collaboration with Chris Sander at Sloan Kettering Cancer Center to collect pathway information from many different sources and try to make it easier to use. And so there are right now 18 pathway databases and major pathway databases in Pathway Commons. Okay, so I just wanna do a time check. I think I might be possibly ending early here. And okay, so I've covered pathways, so I've covered pathways, which are genotology, biological process, and also pathway databases like Reactome. So Robin, who's here in Lincoln, he'll talk tomorrow, we'll go into more detail about the Reactome database. It's one of the premier pathway databases. It's mostly focused on human, and it's developed, led out of the center actually. But there are many other types of annotation. So I mentioned genotology, molecular function, and cell location. So if you have gene lists that are relevant for those types of, that type of information, so for instance, you're doing purification of some part of the cell, and you identify a lot of genes that are all present in a particular cell location where you expect to, then you can do your enrichment analysis with the cell location aspect of the genotology instead of biological pathways. You should probably do these separately anyway. Sometimes people throw all the genotology terms together when they do their pathway analysis. I don't think that's a good idea because for two reasons. One, it generates a lot more redundant information. It worsens your statistical power because each one of those terms represents a different statistical test, and you have to correct for doing multiple testing, which we will talk about. And so it's best to focus, to think about what your gene set represents, your gene list represents, and then focus on analyzing using the information that's most relevant. So if you don't expect to see enzymatic terms like dehydrogenases from your screen, don't include that whole category of 15,000 terms in your search. Or if you're doing something where you really expect chromosome position to be important, you may wanna use a gene set database that only relates to chromosome positions. And the MCDB database actually has one. So I'll just give you an example where that was kind of interesting. Students that I'm on their thesis committee, they're doing some gene expression analysis and they found that olfactory receptors are really strongly enriched. And in their analysis. Whenever I see olfactory receptors, I immediately think that there might be some issue because olfactory receptors are often very highly clustered on the genome. So they're all next to each other on the genome. And so if you have some kind of problem with that segment of the genome, like an amplification or deletion, that will affect your gene expression results. And you will get a whole bunch of olfactory receptors differentially expressed in your sample. And because they're all olfactory receptors, they're all part of the same pathway. And so when you run your pathway analysis, you'll get olfactory receptors and all the related pathways like neurotransmission and you're all sensing and a whole bunch of general terms related to sensing odor. And you'll get a really strong signal. So what I told her is that that's a olfactory receptor. There's a few pathways like this that we know are clustered in the genome. Histones also and some immune related molecules, adaptive immunity related molecules. And so typically when we see these pathways, we might wanna check that it's not, just because of a segment of the genome is affected. And so you could go to the chromosome position database part of GSEA, MCDB, and you can run that against your gene list to see if any of chromosome positions are like highly enriched in your gene list, which might indicate that there's some chromosome position effect. Similarly, you can use disease associations if you think that might be relevant or any of these other sources. So the issue with these other sources is that they're quite varied. So fortunately a lot of them, like the first ones here, genotology, chromosome position, disease association, are present in systems like Ensembl. How many people know about Ensembl? Ensembl is a genome browser. It's like the UCSC genome browser. It's run by the Sanger Institute and the European Bioinformatics Institute. I really like Ensembl because it has a nice tool called Biomart, which allows you to make quite advanced queries of the system and you can give it a gene list and it will sort of work like this. So you go to Biomart and you select your database. So Ensembl Genes is the gene database. You select an organism. So in this case, I selected homo sapiens. And then once you select your genome, you can select filters. So one of the filters is protein domains, for instance, here. I only want genes to come back that have this protein domain. Or if you click and open this gene box here, you will see that you can type in a gene list and it will ask you what identifier it is and then that will select all those genes. And then once you've selected your genes out of the whole genome, there's a little count button, which I didn't explain here, but you can press that to see, to check that it recognized all of your information. And then you can go shopping, which is the Biomart idea and select a whole bunch of attributes to download. You can download gene ontology annotations, disease annotations, external identifiers. So this is one way of converting your gene IDs from one to another. You can find the protein domains that are part of that gene. You can download all the sequences. Quite a lot of information can be downloaded about your gene list and you can save it as a spreadsheet and use it for analysis. Entregene is a major source of gene attributes as well. Model organism databases often have information and there are others that we can discuss during the lab. So to summarize, there's a lot of information out there about function of genes, attributes of genes, major sources of gene ontology and some pathway databases that are a little bit more active and common. I talked about the gene ontology in depth, sort of summarizes some of the main take home messages and but there are other sources of data. Genome databases, genome browsers are like UCSC genome browser for human or Ensembl which handles lots of different organisms is a good source of additional information about your gene list. Okay, any questions? But it doesn't cover everything so there's always additional sources. So sometimes the only way of getting good information that you need is actually going to the literature, taking a paper, there might be a new paper that was just published like for instance a few months ago, the epigenomics roadmap just published a huge amount of epigenomics information about how many people, it's only relevant for people setting human but anyone see these epigenomic roadmap papers? Few people, so they published a massive amount of information about DNA methylation and chromatin immunoprecipitation and that data is only now filtering into the databases so you might be able to access early if you go out and look for it. Okay, so I'm almost done. Just coming back to this, just again to summarize, coming back to this analysis workflow and to repeat just to go over the concept again, the idea is you have your raw data that you've collected somehow, you normalize and score it. Often the one thing I didn't mention is that sometimes you might be lucky and have a core facility that helps you do this. Sometimes often a core facility is running next gen sequencing to do RNA-seq or something like that. How many people get their data from a core facility? Okay, so some people do it themselves, some people might have their own mass spectrometers in their lab. So it's usually, I personally feel so you can do your own normalization and the standard techniques are generally available widely used. I personally feel it's good for the people that are generating the data to do the normalization, especially if they're doing many of them because they usually have a sense of the biases that might be coming out of the technology that they're using. So for instance, when you're doing next generation sequencing, you could have lane effects, different batch effects per lane. And so they might be putting your samples in the lanes in a particular way that is actually useful for normalization method. And only they are the best at kind of knowing that information and knowing associated normalization methods that are matched with it. Similarly, if you're dealing with mutation information that is coming, getting called from next generation sequencing data, there's a huge amount of bioinformatics work trying to call these mutations and those mutations are not called very accurately all the time. So if you're thinking about single nucleotide variants, the best pipelines only call that at 80% accuracy. So there's a lot of single nucleotide variants that are coming out of these big studies that are actually noisy. And it's even worse if you're looking at indels and copy number variants. I think indels is like 40 or 50% accurate. But it might be quite variable between different sources. And so usually there's a lot of technical details involved in that and the people sequencing are best positioned to use the latest state-of-the-art methods. And so that's why I recommend doing that. So sometimes people come to us and they've worked with a core facility and the core facility charges them extra to do the normalization. I personally, and usually it's not that much, I usually say it's worth it. Just let them do it unless you're really confident that you can do it and it's very easy. Like gene expression by microarrays is very easy because it's 15 year old technology and it's all the same. Everybody does it the same way. So that might be easier, but otherwise I kind of recommend that you involve someone who's very knowledgeable that that might be you in that normalization. Okay, so raw data normalize, you have to create your gene list somehow. So that might be differential expression defines the gene list or it might be scoring methylation of promoters that defines the gene list or predicted targets of microarrays or you have a chip experiment that identifies potential target genes of a transcription factor or DNA binding protein. Those are different ways of generating gene list and there might be multiple ways of generating a gene list from one experiment. For instance, if you have a lot of gene expression data, you can compute the differential expression between two samples or between two classes, cases versus controls or you might have multiple different samples and so you can do all the two way comparisons or you can cluster the data like I mentioned and each cluster is a gene list. So those things might mean different things. Those different ways of creating a gene list will mean different things. They'll be trying to ask different questions. So if I'm doing tumor versus normal or disease versus normal, I wanna know what's specific to disease but I might have two subtypes of disease that I'm comparing in which case it's not disease versus normal anymore. It's a different question. Okay. And then drilling down, so what we'll do after this is we'll have a break, I think, is that right? And then when we come back from the break, Quaid will talk about the statistics behind this enrichment analysis or pathway enrichment analysis. Just another note, I always say pathway enrichment analysis because pathways are usually the types of information that people want to look at because it gives them some mechanistic understanding of their data. However, if it's chromosomal location gene sets, it's not pathway enrichment analysis, it's called chromosomal location enrichment analysis. So I always use the term pathway enrichment analysis because I think it's clearer than just saying gene set enrichment analysis because what's a gene set? You don't learn that in biology. You don't learn about pathways. And okay, so Quaid's gonna go over the statistics of these things and we're gonna focus on GSE and G-profiler and then there's going to be a lab where you do the enrichment analysis using these tools yourself. And then in the afternoon, I'll come back and Vernique is a TA who's going to, for the course, who's helping lead a lot of labs. And then I'm gonna come back in the afternoon and talk about Sight Escape, which is a network visualization tool that's fairly general for networks, but it's also useful, the way we've kind of structured this course, it's useful for creating these graphical representations called an enrichment map of the enrichment analysis that helps you, like the autism example that I showed where there's all these bubbles. That is a tool that is in Sight Escape that you can use. And then we're not gonna cover too much of this drill down, but we're noting these names of software here so that you can go and look at them and you can see how they work. Some of them are very simple. I think PathVizio, in this workflow here, PathVizio is the only one that we're not covering. Another one that it might be useful is Great. Great is an enrichment analysis that takes in genomic regions. So it also works for non-coding regions that you might have. So if your data includes genomic regions, not just gene lists, Great will take the genomic regions and then it will convert it to a gene list for you using some rules that are specified. And then tomorrow we'll be mostly focused on this network, identifying interesting networks, part of the workflow. Okay. Any questions? There's just one more quick slide, which is a lab, if you're interested in playing with the gene IDs and translating them, then you can use this demo gene lists, it's a bunch of yeast genes, and you can convert the genes to entry gene IDs using G Profiler and Ensemble Biomart. And I just put this up here as something if you wanna play with the tools during the lab session, any of the instructors will help you with these tools if you wanna do that, but it's optional. Okay, so we'll finish a little bit early, but we're on a break.