 Ready to go? Okay, welcome everybody. I'm going to be giving the introduction to the course and we'll basically go over why we want to do gene pathway analysis and what it is and what pathway and network analysis is and cover some of the basic concepts that we need for the rest of the course. Please interrupt me if you have any questions. So the basic idea for this course is that we are trying to interpret large gene lists that frequently come from genomics information, genomics experiments, and when I say genomics I basically mean any kind of omics, any kind of experiment that generates a lot of information about molecules in the cell, whether it's transcripts or proteins or metabolites even. Although generally we're focused on gene-related lists, a lot of the concepts will apply to anything that has a large amount of information. So the basic idea of this course is that we've done some large genomics experiment like RNA-Seq and now you've produced or a big genetic screen, you've produced a bunch of information, now what do I do? How do I interpret the information that I have? Because there's a lot of it and ideally we like to automate the interpretation and speed it up. So one of the main ways that we can interpret information coming from these large scale experiments is to basically tell me what's interesting about these genes in terms of pathways, complexes, functions, any kind of information that's known about the gene. So is there a particular pattern that's coming out? Like if I do a gene expression experiment, do most of the genes that are differentially expressed relate to the cell cycle? And is that particular biological pathway important in my experiment? So the general idea, and so I'll use this point, can everyone see this pointer yet? So usually the genomics experiment, you generate a bunch of information about genes for instance and generally in this course we're going to be using gene expression as an example genomics type but it's not the only one, there are many others and we can relate all the concepts that we're talking about are very general and are applicable to all the other types in general, some specific modifications. But we'll just use gene expression as an example because it's quite popular. So say we collect gene expression data, we can take the information and we can rank the genes so we can say these genes are more expressed than other genes or we can compare the expression between a condition of interest and controls and we can identify differentially expressed genes and we can rank genes by those, you know, how differentially they are expressed compared to control. If we have a lot of gene expression data we can cluster it, we're not going to cover ranking or clustering in this class but many people do that on a regular basis and clustering identifies groups of genes that work together, work similarly or have similar patterns across multiple conditions and that also can generate a list. So any kind, any way that you generate a list you can take that list and then ask are there any pathways that are more, present in this list more than expected and that's sort of the basic idea of pathway analysis. So we generally, this whole area is being invented because it saves time compared to the traditional approach. So if you didn't have pathway and network analysis tools to help you search gene lists you'd have to do that yourself and if you had a thousand genes you'd have to go through each of those genes one by one and look them all up in the literature and learn about them and obviously that's time consuming. So pathway network analysis has been invented to help you with that. The general idea is that it helps you gain mechanistic insight into genomics data or omics data. Mechanistic insight means some insight about mechanisms within the cell. So a pathway is a mechanism in general is a mechanism in a cell but you could have other more specific types of mechanisms. For instance the last day of the course we're going to talk more about transcription factor targets and predicting regulators so you might be able to predict gain insight into a particular transcription factor or a microRNA that is an important regulator in your system. You can consider that a pathway or just a specific aspect of cellular mechanism. However you want to consider it the general idea of pathway network analysis is to pull information out of genomics experiment of the mechanistic type. You can also and in my view it's this general area of pathway network analysis is involves any kind of analysis that incorporates pathway or network information and there are many different types. It's most commonly applied to interpret gene lists and the most popular type is pathway enrichment analysis but many others are useful. So we'll be talking about pathway enrichment analysis. I'll give you a short introduction this morning and then later Quaid is going to talk about the enrichment analysis in more detail going over all the statistics. So just to start I'm going to give you two examples from our own research. These are examples that are kind of the best examples from our own personal research of where pathway network analysis really helped us understand a genomic system much better than if we didn't have it. So the first example is in the area of autism spectrum disorders. This is a collaboration with Steve Scherer who is an autism researcher at the University of Toronto at the Hospital for Sick Children which is close by here. So in this project he's in general he's interested in studying the heritability or inheritance of autism spectrum disorders. Ideally identifying genes that potentially cause autism or relate to it or help identify pathways that are important in autism. So this disorder is highly heritable. People know from twin studies that it's at least 50% heritable and although the severity is different could be different between siblings that inherit the disorder. Certain genes are rare genes are known to cause severe types of autism spectrum disorder and for many years or in recent history in this field people have found that copy number variants and in particular de novo copy number variants so those are ones that aren't inherited cause explain some aspect of the heritability. And genome wide association studies look at SNPs have really not identified much. So what this project entailed was studying copy number variants in autism spectrum disorder and they had collected about a thousand cases, a thousand controls and had identified copy number variants using a SNP array. So the SNP array identifies the sort of amount of each SNP in a genome across it. So you have a million SNPs on this array and each SNP gets measured of how whether it's one version of the SNP or another and also it's relative intensity and you can use that to identify a region of low intensity which means a deletion a segment of the genome is deleted or a region of high intensity which means a segment of the genome is gained by some process. And a standard analysis of looking at this study gene by gene or copy number by copy number identified a few genes or copy numbers that are that are associated with the cases the autism spectrum disorder cases. And so we looked at and this was worked on by Daniel America who is in my lab and now works elsewhere in Toronto. We looked at the same data using pathway information and we found you know in contrast to just a few genes we found a rich set of pathways that were affected that seem to be affected in this in this disorder. And so the we'll learn how to make these maps in this class but this is an enrichment map. The circles indicate pathways the size of the circles indicate the number of genes in the pathway and the connections between the circles indicate pathway crosstalk. So if two pathways have a lot of genes they share a lot of genes they'll get a strong link. And so pathways that are related get grouped and you can see for instance if we zoom in here one of the set of pathways these these red circles is highly and so I should have also mentioned that the color of the of the circles or the nodes is proportional to the strength of association with the cases. So the more red it is the stronger associated it is. So if we zoom in here to one of the places with very red nodes it's pathways that are involved in central nervous system development and that makes sense given the biology of the disease that affects it affects the brain. This this analysis also it was a bit more complicated than we typically encounter or or or a bit more complicated analysis than typical because it also in sort of merge different types of pathway analysis. So in this case, a lot of the pathways didn't contain genes that were known to be involved in autism from previous studies. So there's a question are these new pathways or are they wrong? So we also analyzed did a pathway analysis on all of the known genes. There's about 150 genes that were previously implicated in autism spectrum disorders. And and those these triangles here represent pathways that are enriched in genes that were associated with intellectual disability and also previously autism. And there were a lot of you know, some of the pathways didn't overlap. But even in this in this zooming here in the central nervous system development section, you can see that a lot of pathways that are enriched in known autism spectrum disorder genes are also central nervous system development and they share a lot of genes. So even though the same pathways didn't come up, sorry, the same genes were not identified frequently in the copy number of varying data versus the previous previous type of data, the pathways that they that they were part of were very similar. And so that helped validate the pathway analysis. One interesting thing about the study is that the when you looked at in each individual pathway, it wasn't the same gene mutated each time in many cases and and kind of know that because kind of expect that because we didn't see any gene, we didn't see very many genes strongly associated with cases. So instead, it was multiple genes within the pathway mutated in different people and different individuals. So so that illustrates sort of a strength of pathway analysis, you can take rare information like or sparse information. In this case, you have genes that are infrequently mutated. But when you look at them in terms of a pathway, the pathway is frequently mutated. So we might have 20 genes in the pathway, and there's 20 different individuals that have mutations, and each one has a mutation and different gene. But and so normally, if you look gene by gene, you just see one mutation, one mutation, you can't really say much about that. However, if you know that they're all part of a pathway, you can say Oh, 20 cases are are are mutated at the pathway level. And this was all gene deletion. So we had a good we predicted that they would have an important effect on the pathway and all genes deleted. So so that illustrates how pathway information can help improve statistical power. So we've basically grouped all of the single counts together into a bigger count. The other way that it helps improve statistical powers we'll talk about later is it improves multiple testing correction. So basically, you there are, we expect that there are fewer pathways than genes, or fewer pathways and steps for for certain or mutations. And so working with fewer statistical tests limits the increases the power because we reduce the number of tests, and we don't have such a strong. I can't think of the right word. Really blank on the right word, but we don't have to make such a stringent filter given the number of tests, right for tests. Okay, so the second example that I'm going to talk to you about is the best example that we've worked with where we've really understood something something quite interesting from pathway analysis. So this is a collaboration with Michael Taylor, who's a neurosurgeon, again, at the hospital for sick children. One of the brain any studies pediatric brain tumors, one of the tumor types that he studies is called ependymoma. Ependymoma is a tumor of the ependymom, which is the lining of the central nervous system. And it's the third most common tumor in children, brain tumor in children, sorry, still rare, fortunately, cancer in general is rare in children, fortunately. And people had previously known that you could predict something about the tumor based on where it occurred anatomically in the brain. And in particular, if it occurred in the posterior fossa, which is the back of the head, and the brain stem in the cerebellum, it was counted by pathologists, predicted by pathologists to be the most serious type, and people get the most serious therapy, which is there's no targeted therapy, no chemotherapy available for this tumor. So it's radiation treatment and surgery. And that's quite devastating. We want to limit the amount of that type of surgery would be much better if you could get better, more specific targeted treatments. So Michael has been studying this, this, this cancer type. And a number of years ago, he collected gene expression data from about 100 subjects. And in particular, focusing on this very serious type that was in this posterior fossa type. But what he found is that it's not just one type, it turned out when he clustered the data that there were two types, type A, and type B. And type A had affected the youngest patients and had a very terrible outcome. And type B affected the oldest patients and had an excellent outcome. So this is very interesting, people would group everybody together and say, very poor outcome expected and get the worst, the most severe treatment. But actually, about half of the individuals are expected to have an excellent outcome. And they so it's basically two different diseases that have been previously grouped together as one. So focusing on this disease further he collected a lot of whole genome sequencing data, next home sequencing data. And strikingly, there were no mutations found, no recurrent mutations. Actually, in each individual, there were only a few mutations like two or three. And mostly I'm talking about single nucleotide variants in that case. So that was basically the first time that's ever happened to cancer research. We know that mutations are a hallmark or genome stability is a hallmark of cancer biology. And here's a cancer that basically doesn't have genome instability. One of the reasons for that might be it's a pediatric tumor, pediatric tumors are known to have fewer mutations, mutations correlate with age. So the older you get, the more mutations you get. And older tumors have more mutations, etc. So that could be part of it. But it doesn't explain that could be part of the reason why we don't see many mutations, but it doesn't explain what's going on here. So Michael, Michael's lab, his student Steve Mac, looked at DNA methylation and measured DNA methylation genome wide, and found that there were about 2000 genes that seem to be silenced by methylation, they had very high levels of methylation in the promoter regions in the CBG islands of the genes. And doing a standard pathway analysis didn't really identify what's in what anything interesting any mechanisms from these 2000 genes. So Scott, Zydard, I and my group did a pathway analysis using a more appropriate statistical test that was a little bit more sensitive, and also a bigger database of pathway information that we collect, and found a very strong signal for these 2000 genes that they we predicted are our targets of a particular complex, particular protein complex. Complex is PRC2, it's polycomopressive complex too, it's involved in methylating histones or methylates histones, and then DNA gets methylated. So it seems like it's related to methylation. I forgot to say that the methylation also clustered the two groups into A and B. So that was very clearly related to the gene expression results from before. In any case, so PRC2, it has subunits. The subunits have been studied individually, these EED and SUSE12 are subunits of PRC2. And so all of the pathways that came out really just saying that there's one, they're all related to the same thing, this protein complex. So this plot, I'll just explain briefly. The length of the bar here corresponds to the strength of enrichment, and it's the negative logarithm of the P value, which means that it is the higher the bar, the more significant it is. And so these guys here are very significant in the group A, and there's nothing significant in group B that came out. So when Michael started looking into this complex further, he realized that this is a hot topic of drug development. In general, a lot of drug companies are very excited about epigenetic drugs that are targeting methylases, DNA methylases and methyl transphases and protein methyl transphases. And so drugs have been actually developed for these processes in general. And so they were able to try a number of these chemical probes and drugs in various models and found that they specifically killed the tumor cells that were coming and models derived from the tumor cells coming from patient samples. And then further than that, so that was very interesting. This basically is the first time that a mechanism has been identified in this tumor. Previously, there was just nothing known about it and any kind of chemotherapy that had been tried had failed. The only treatment was the hundred-year-old treatment of surgery and decades old treatment of radiation that's basically generally applied and not very specific and very damaging. And so here's a potential mechanism that can be targeted. And not only that, you know, very interestingly, a patient was actually able to be treated based on this information. In this case, there was a patient that had reached the end stage of the disease. There was no more treatment options for them. The tumor metastasized to the lung and it doubled in size. So that's what this picture represents. And so they decided to take a general on-the-market drug that was the closest drug available. It's a drug that targets generally DNA methyl transphases. And so it's expected to remove DNA methylation. And one course of this treatment stopped the growth of the tumor and the child regained their energy. And that lasted actually for 15 months, which is quite amazing because there were basically no more treatment options for this child. So now this result has led to a development of a clinical trial, which is a couple of clinical trials which are recruiting patients. So the take-home message here is that, you know, this is a really good example of where we were able to take a bunch of different genomics data. So we had information from the transcript level, DNA mutation level, and methylation level. And within a short amount of time of collecting the DNA methylation data, like a year or so, we had a mechanism identified and very fortunate for us there are all these drugs that people are using to study those mechanisms. And then even more fortunately for patients, one of these drugs that's on the market worked and now they're studying this further. So again, illustrating that you can get information about mechanism and then it's sometimes actionable. It doesn't happen frequently, but when it happens, it's really great. Okay, so in general, the benefits of pathway analysis versus analyzing individual genes, transcripts, proteins, snips, mutations, other things that you could analyze that come from genomics or omics data. It's easier to interpret because it generally works with familiar concepts like pathway names, which if people learn in undergraduate biology, it can identify mechanisms that potentially are causal, predicts new roles for genes. I mentioned that it can be used to improve statistical power and it's often more reproducible than looking at, for instance, people who have studied biomarkers and have identified signatures based on gene expression data have found that if they look at the earliest results from this came in the field of breast cancer analysis, so people collected gene expression from different breast cancer cohorts and they found that the signatures that they learned from each cohort were non-overlapping, there were no genes in common. However, if they looked at the pathway level, a lot of those genes were part of the similar pathways, so from that, that illustrates how mapping things to the pathway level can sometimes be more reproducible. And it also facilitates integration of multiple data types because it provides a skeleton that you can layer on different types of information and that's something that we'll learn about later as well. Okay, so I talked about any questions or talked about pathway and network analysis. Often I use the term pathway analysis, I just think it's easier. Some people, there's different terms, terminology used, but pathways are, is a concept pathway, is a concept that biologists know about and, you know, I don't know if anybody could really specifically define it other than just say it's a process, but I consider any kind of mechanistic information to be pathways, but there are different types of ways of representing pathway information and two main types are sort of the typical pathway way of representing data is to have like a stepwise process and this is what most people learn in starting in high school biology if you study glycolysis. There are these, you know, enzymes and these steps and, you know, this trend, you know, conversion goes to that conversion and so that pathways typically have this sort of stepwise process. There's also a lot of information that we get from large scale studies that is not easy to represent in the stepwise process manner, so for instance people have been spending a lot of time collecting protein interactions or mapping protein interactions at a large scale and we don't know exactly, you know, the steps that are involved in pathways there, so we just represent it as a network of connections and so network information is, there's a lot of network information available and it's it can be very useful for helping us understand more about the mechanism of a particular process that we're studying or a particular condition that we're studying and so we'd like to use all of this information together but just so you know when we use these terms this is sort of what we really mean. Okay, so there are many types of pathway analysis as I mentioned. The sort of standard one that almost everybody uses is pathway enrichment analysis and usually that means we have a set of, we take a pathway and we just take the genes in the pathway, we forget about how it's, how those genes are connected, so we just represent the, we just take a list of the genes, these are the list of genes involved in the cell cycle or in glycolysis and we see if that list is over represented or you know present in our in our own gene list more than we expect compared to how many genes are annotated to that process in the genome so and I'll go over this a few times but you know the example is I have a hundred genes in my list, 50 of them, so 50 percent of them are cell cycle genes so that means half of my gene list is cell cycle when I look in the genome it might only be five percent so there's 10 times more cell cycle genes on my list than I expect so obviously that's that's more enriched that's that's statistically enriched so that's the general idea of enrichment analysis. There's also ways of trying to, one of the benefits of that analysis is very easy to do that's why it's the most popular one, one of the disadvantages is that it only uses information that we know about so it we take the genes from a known pathway like glycolysis but many genes are not known to be part of pathways we don't know what their function is they may be part of big networks that people have collected and so the second part tries to use this network information and to find regions of the network that are highly connected and also differential or change somehow in the condition of interest so if we're studying gene expression data we can find a region of the network that's differentially expressed or if we're studying mutations we can find a region of the network that's highly mutated and that might that might identify a part of the network with genes that we don't know much about so it's less biased but on the other hand then you have to figure out what those modules what those those genes do at least you know that they're probably they may be working together somehow and then there's a sort of more detailed modeling that involves sort of a more detailed representation of the data so we might have anything from understanding the relationship of the mutation to the gene expression to the protein expression to protein post translational modifications and we incorporate all that data and we we we understand that there's a you know mutation we try to say does this mutation for instance explain the expression the expression change and so that's using additional information like exactly how the DNA and you know mutations in the DNA is related to expression and these can be used ideally to so I actually just went through this so this is this is from a review that we do with cancer so these are cancer related but you know the this third type is useful if you have a specific mutation you want to under like in a specific individual that you're studying or sort of how it relates to the processes that you see that are altered and this we're not so let me say that the today we're going to focus mostly on this first one tomorrow we'll talk more about part two and we're not going to really cover the third type it's there are tools available but they're not as frequently used because they as you go from one to two to three you generally need more information like some of these methods require information on multiple levels of of of omics and so not everybody has that a lot of people just have RNA-seq data they don't always need that but it's or they may require more detailed pathway information which we have a lot of but again it's not as much as we have in the general network and and gene set area okay so any questions okay so this is the overview of the pathway analysis workflow that we're going to cover mostly today and tomorrow and the third day is mostly focusing on transcription factor analysis or regulator analysis so the general idea is that you've collected some kind of genomics data like gene expression data you we don't cover in this course exactly how you you collect that data but if you have questions about anything related to that we're talking about you can always ask and once you have that information you need to normalize so then score it somehow so for instance the gene expression data often people compare condition to control and compute differential expression and depending on the platform and exactly you're doing things it might be different ways of doing that and then that usually generates a gene list so the gene list could be the list of all of the genes that are universally expressed or it might be genes that are coming from a cluster as I mentioned before if you're clustering so after you have the gene list that's where we really want to focus on this course we want to learn about the underlying cellular mechanism using pathway network analysis and there are sort of three main steps so one is to run a type typical pathway analysis that are that we'll talk about and that helps you visualize and identify interesting pathways and networks so what does interesting mean it depends what you're interested in and so unfortunately that part can't be automated so at least unless you invent a machine that does your homework for you but the so what a lot of these tools try and do is use signatures of things that we think are interesting like this and you know if the pathways are enriched in the in the gene set then it's probably relevant for the condition that I'm studying and whether that's interesting to you might depend on whether it's novel or or known right if it's oh you it might be very enriched and you say oh I everybody knows that let's skip that one but then the next one might might be something interesting and new so these tools help you visual identify sort of a list of potentially interesting pathways and then we also have ways of visualizing the results and also there's an aspect of exploration you want to explore the results to try and figure out this difference between novel and new novel and known so this you know to really identify things that are interesting and once you've found something of interest then you can drill down to better understand it you might overlay your gene expression data on a so you might identify a pathway that looks interesting and then you can overlay and it just you know then you just know that this pathway is potentially interesting you could overlay your gene expression data on a picture of that a diagram of that pathway then you might see more details like you know the the you know positive regulation part of the pathway is is up and the negative regulation is down or something so that might give you a little bit more insight there might be genes that you don't know anything about and so we can use gene function prediction to identify potential functions and we'll talk about that tomorrow with gene mania the gene mania lab and then ideally once you have a model then you can you know that goes into a a reporter paper okay so so here's that's the overview here's a more detailed version of things that tries to cover a lot of different a lot of different cases so I'll just go from the from the top here so basically just to to tell you that the blue boxes are all related to different types of data and you can see how many different types of omics data there are the sort of orange boxes represent different ways of scoring and normalizing depending on the data type and then they all point go to a central box that says that produces a gene list and once you have a gene list then you identify interesting pathways or identifying interesting networks network modules and there's different ways different approaches to this and we've added I've added sort of little yellow boxes that talk about different tools different software tools most of which we'll cover in this course but in this workshop not all of them that relate to each area and then at the end this these ones relate to mechanistic drill down that I mentioned okay so I'll go through this in more detail so um gene lists are not all the same they come from different different types of experiments we talk a lot about gene expression in this course but you can also be doing a protein interaction screen and the gene list might be a set of proteins that are binding to my protein of interest or if you do a chip chromatin IP you might find regions of DNA that bind to the DNA binding protein of interest and um or you know what you do a similarity analysis for mic for RNA molecular interactions you could be doing a genetic screen or an association study and a cohort of individuals to identify mutations that are correlated with a phenotype and each of these things has a different meaning so we might be we might be expecting that our gene list identifies parts of the biological system aspects of the biological system like pathways or complexes but it also could be a screen that identifies tissue location or cell type or you know for for some of the genetic screens we might be identifying regions of the genome and um and that doesn't tell us specific you know we basically will include all the genes in the region even though we know not all of them are maybe providing signal so it's important to understand that and that's sort of this top box here so you can see just um we put this together because I put this together because frequently got questions like okay you're talking about gene expression data but I have protein expression data or I have gene you know GWAS data and um how does that relate so these arrows here try to explain a little bit about how you have to and a big aspect of confusion is how you kind of convert the raw data to a gene list sometimes very easy and sometimes it requires more steps so um so these these uh little yellow boxes try to make that a little bit clearer depending on the data type you can kind of directly go into like this one if you're mapping protein interaction networks you can directly go into network analysis because you're experiment is generating networks however if you are sequencing genomes you have to identify and filter variants and then you have to identify significantly significant or recurring variants or score them somehow and so there's two steps before you can get to a gene list and and you have to link those variants to genes which is sometimes challenging especially if you're working with non-coding variants so the the key point here which again we're not going to cover the details of but you're welcome to ask questions about it is that we assume that you are using standard techniques for normalization background adjustment quality control and using statistics that will increase signal and reduce noise frequently these are very standard and increasingly they are handled by core facilities used to be you know when when genics when omics started out a lot of people were measuring gene expression using their own hand-built microarrays that they you know cdna microarrays that they built themselves in their lab and there were tons of issues with that and you have to basically build the whole system and figure out how to analyze it these days usually you take your sample to a core facility they they run the they they collect the genomics data of interest for the main types like RNA-seq or whole genome sequencing and then frequently they have the capability to process the data using standard pipelines and I highly recommend that you take advantage of that usually the people the core facility are working on keeping their pipelines are on maintaining their pipelines and they identify new software when it comes out and evaluated and integrated they also I know about problems with their machines so there might be batch effects that happen where they got a batch from the supplier of a particular reagent and that batch was acting differently than others so you won't be able to figure that out if you're doing that yourself and you just did it one time but if they're doing it a hundred times they can see that there's some shift and they might be able to correct for that so generally as as long as the core facility is able to do all of these things I I recommend taking advantage of it sometimes they charge you extra but I think it's worth it and but it's definitely possible to do these things yourselves as well and a lot of the methods are standard so you could see if someone's published if you take a recent publication that has done some kind of analysis and they describe their methods well and we used various R packages and we did this workflow or various software tools you can do that yourself it's it's straightforward enough but usually it takes more time unless you're going to do it a lot compared to a core facility okay so so that's this this middle section here and then you know the last section here is really where we're focusing on this course it's about the biological question so what do you want to accomplish with your list so the simple version is we just want to discover pathways that are you know or other aspects of gene function that we didn't know about before related to the condition that we're studying so we're studying a you know model organism in response to some perturbation or a cohort of humans or or animals and we're interested in seeing how which processes play a role in a particular condition that we're studying then you know simply summarizing the pathways that come out of the gene list is a great first step in fact it's frequently a great first step for most kind of genomics data but there's a lot of other types of analysis that you can do so differential analysis is also very common so what pathways are different between samples you can also as I mentioned try to find a controller for a process like a transcription factor or a micro RNA and we'll talk about that again on day three you might be able to find new pathways and new pathway members that's the network analysis part that I mentioned that's going to be a focus of tomorrow and discover gene function new gene function we'll talk about how to predict gene function tomorrow as well. You could correlate the pathways that you have with a phenotype that might help with candidate gene prioritization so if you've identified a region of a genome a locus that is associated with a disease and it has 20 or 100 genes in it and you don't know which one is you know relevant usually you have to go through them and figure out how they might be related to the phenotype but if you've identified pathways that are enriched across your whole your whole genetic experiment you might identify specific members of the pathways and that will kind of help you prioritize and finally you might be able to find a drug we did that with the with the epenomoma study and there are lots of there's there are databases of known drug drugs and their targets and you can use those databases to try to identify given a pathway that you see or genes that you see are interesting you know are there any known drugs that target those and it's generally simple look up okay and then okay so I mentioned this a few times already I repeat myself sometimes just to emphasize key points and the you know so today we're going to focus on pathway enrichment analysis the summarizing compare part tomorrow is more about network analysis including gene function prediction and day three is more about regulatory network analysis okay and so that's that's that's all this part here this doesn't cover now regulatory network analysis so much but it's will be focused on in day three okay so the next part that I'll cover is just a general I'll talk about pathway enrichment analysis to get us set up for the second presentation this morning and I'll cover basic concepts as well that we some of you may know about but we're going to cover them just to make sure that everybody knows about them and hopefully you'll learn something new in any case even if you have already worked with this this type of data so again the pathway enrichment analysis idea is that you have a gene list from your experiment and that's represented by this blue circle so this is a Venn diagram that represent usually is used to represent sets and overlapping sets so we have a list of genes for my experiment and a list of genes from a pathway in this case say neurotransmitters signaling and they overlap somehow so there's some number of genes that are in both lists now how do I know if the that this overlap is significant or it's just expected by chance if I have really big lists I'm definitely going to get some overlap by chance if the or even if one of the lists is really big I'm going to you know the bigger these the bigger any or both of these circles are the more you know the more genes they have the more likely it is that I'll get overlap right so there are a number of statistical tests that measure this you know the sort of standard ones are Fisher's exact test related to Chi square test and Quaid is going to go over this in more detail and then but they're they're actually a range of statistical tests for this type of analysis I should also and so without going into too many details I'll just mention this as a general idea okay so the what we need to do this enrichment is we need a list of pathways so that's you know these circles so we need this is looking at my gene overlapping my gene list overlapping with one pathway but if you have a thousand pathways you do this a thousand times and you need a database of pathways so we need to get that information from somewhere and it also it also requires that you know that you've used you've treated this list in a particular standard way so in particular you have to use gene names or gene identifiers that are you know other people use because you can't get overlap if you can't if you can't even match up what you're talking about so if one person uses you know gene name A and another person uses a different name for that gene you're not going to get a match right so you do you need to understand something about how people name genes and how people represent genes and databases so that's what I'm going to cover first so I'm basically going to cover these two blue boxes in the intro and then after the break we'll do you know actual learning about the concept of how you discover enriched pathways and the statistics behind it okay so gene and protein identifiers this is something that many people know but you may not know all the all the details an identifier is ideally a unique stable name or number that helps keep track of database records so you could imagine your your social insurance you know your driver's license number also you know a typical gene identifiers an entree gene ID so entree is the gene database maintained by the US NCBI the NCBI is the National Center for Biotechnology Information that's part of the National Library of Medicine that runs PubMed so everybody knows about that and they maintain lots of different databases so many that you know actually it's probably over a hundred and each one of those databases has their own identifier their own type of identifier so that means that there there are lots of different types of identifiers for different for for for gene I mean it's not only that genes are represented when we say genes we often mean you know the DNA region the RNA transcript that's expressed from that region different versions of that RNA transcript that is expressed from that region different versions of the gene that has different start sites and all of all the proteins you could imagine each protein is that's post-translationally modified somehow that's cut up or chemically modified is a different molecule so you could keep track of each one of those and each one gets its own identifier and so that creates a complicated array of potential identifiers for these things and different databases so it's important to just understand that there are multiple identifiers for a gene and that there's a type of molecule usually associated with that so if you're working with entree gene it's really talking about genes you can't find the sequence in the gene database it just tells you this is a gene there's not one necessarily one sequence it's an information concept when you go to the transcript database then it has the actual sequence in the protein database okay there are a lot of identifiers just for your information these are some examples so you can look through those and see you can sort of see how people represent these things and you might usually you once you see enough of these you start recognizing them but sometimes you can't really distinguish them like for instance some database number of databases just use numbers and then integers or whole numbers and those you won't be able to distinguish like if someone gives you a list of numbers and another list of numbers you won't know what database it's from necessarily however usually the ones highlighted in red are ones that we kind of recommend so entree gene is a really good one because it's fairly stable and unique and it focuses on the gene and I'll talk about reference recommendations in a sec but the ones that are highlighted in red kind of recommend thinking about okay so there's a lot of identifiers and it it's also important to recognize that some software tools only recognize certain types of identifiers as time goes on this gets this gets better it used to be really a big problem a long time ago especially when people were working with athermetrics arrays because each version of the array came out with its own identifiers and you have to map them and it's very complicated and now with RNA-seq it's easier and but you sometimes see issues with identifiers that are linked to the genomics technology itself so and this goes over you know four main uses of identifiers I won't really go through them I guess one important one is I kind of explain why it's important already but you can you can think about these so fortunately there are identifier mapping services if needed so if you have if you and often these are integrated with the tools themselves so if you for pathway enrichment so if you have a list of identifiers that is not the default one recognized by the tool you can convert it to the one that is you do have to be aware of ambiguous identifier mappings so in this case this G-profiler tool which we'll cover in this workshop alerts you to say you know one of the identifiers that you put in your your gene list has could mean two different genes so which one do you mean okay so oops so right okay so the so there are some challenges relating to this and just to emphasize the point yes sorry I didn't hear the last part of the question you see us here you define this right like the so in case that beyond some of the you think it's better and if so what is this is it is it more excuse me excuse me um I'm not saying here that you shouldn't use a particular identifier type if it's convenient for you that's good it's just that these ones are ones that are most frequently recognized by pathway analysis tools that's the main reason why they're recommended here but it could other identifiers like this before duplicates so for some reason people prefer ensembles is it because there are no duplicates right and there are no duplicates the ensemble keeps track of their identifiers very carefully but this is not meant to be comprehensive it's just it for for certain reasons a lot of pathway analysis tools have used ensemble identifiers and the main actually the real reason is that ensemble provides a really nice they could easily being built on UCSC or something else but ensemble provides a lot of nice apis that are like systems that allow people to build software on and so people have taken advantage of those and then they happen to use the the ensemble so as you said probably does have more access to software but you just are moving more than back to the beginning yeah I mean actually entree gene I think is the the biggest one that's why this is bolded here and so I'll get into recommendations in a sec so the yeah so so there are some some challenges to mapping identifiers so one of the one of the main challenges is that you have you know that I mentioned is that if you don't have if you make a mistake with identifier map matching you could make a few different types of mistakes when you're comparing your data to pathways so when you could just miss a gene so if I have an identifier that's not present that I'm using a different identifier that's not recognized I won't match to the pathway so that's obvious worse is if you match the wrong pathway and that's means that you're actually going to make a mistake you might get the wrong information coming out and so in general it's the gene name itself is like often the gene name that's used colloquially in the literature is often not a good identifier because it's not hasn't been standardized the gene symbol and I use the difference between symbol and name symbol is usually is usually standardized by a community group like for human the human genome there's a human genome naming commission for model organism for model organisms it's the model organism databases and they usually say okay this is the symbol that we're all going use if we're going to use the official symbol for the gene but even then it's like when you have names it's sometimes difficult so one one case is that that people many people may be familiar with is that if you use excel which is very commonly used as a spreadsheet to manage gene lists excel converts automatically converts certain gene names it recognizes them as dates or other things how many people have seen this yeah so most of the most of the classes are familiar with this you know oct four is an important transcription factor stem cell biology and excel will now well we'll just convert it to october 4th so and this is really really quite insidious so the and people have written papers about this so in general another challenge is that you might not reach a hundred percent covered so you might have a hundred genes in your list and only 95 of them can be mapped sometimes there might be good reasons for that like the gene in your list that comes from genomics experiment and is based on an annotation that you used at one particular moment in time the genome annotation changes over time so you might have you might have a gene in your list that's no longer considered a gene it's being moved to pseudo gene status or something like that and some it's like good hundred genes in the human genome that every time we check they're like flipping back and forth between gene and pseudo gene and sometimes they're just a race to you know completely when people realize that they're probably not really genes so so it's not because of that you may not really expect to be able to always have a hundred percent coverage and they get that it gets worse if you're working with older data compared to current genome annotation so people have yeah so just with the excel thing there's a couple of papers there's a paper in 2004 that talked about how bad the problems are with excel they found that a lot of databases have started to internalize names like October 4th for oct 4th as a synonym because people kept on loading data that had October 4th is mapped to an entree gene ID and then even just in the past eight months I think it was just the end of 2016 a paper came out gene name errors are widespread in the scientific literature and it basically said don't use excel just don't use it although there are ways of using excel too you can't turn off that feature so you can you have to remember to paste your gene list as text because if it pastes general or then it will it'll automatically try to recognize the type and one of the types is date and so then that's what the problem comes in anyway just just one one little story of warning that's kind of interesting here's a paper that was published quite a while ago in nature that was retracted later because they made a mistake with their gene identifiers so they they there are two gene genes that were called HES-1 and they made a whole paper about HES-1 and then people afterwards as soon as the publisher said that's not the HES-1 that you really mean and it was actually that they made a mistake in the database search and they had to retract the paper unfortunately so it does get to that stage okay so these recommendations are for genes and proteins in general most of the pathway analysis methods that are out there in fact most bioinformatics methods don't really consider splice forms splicing is obviously important and it can be considered and it's we have increasing information about it but in general we don't have a lot of information about splice forms still we don't know what you know how to distinguish the function of one splice form versus another frequently and so mostly all splice forms get lumped together with a gene when you look up a pathway so the Wnt pathway will have a gene in it and one of the splice forms from that gene might not be involved in the Wnt pathway but it will still be lumped together and so that's a limitation of the resolution of where we are in biology today not exactly because you could really study the splice forms if you want but most a lot of technologies don't give you a lot of genomics technology don't give you a lot of very comprehensive and accurate information at that level so again this course focuses on genes for that reason okay so the main recommendation is to map everything to entree gene IDs or official gene symbols using a spreadsheet software preferably not excel but most people use excel anyway you can use it and just be careful if you're if if you really want have 100% coverage you can you can and you can't do a mapping you can't map one identifier to another you can just manually look it up and see if you can figure out why it's not there and you can curate it and then you know the recommendation of this last paper in 2016 was to use an open source spreadsheet system like google spreadsheets on on the web or open office which don't have this date conversion issue okay so just to summarize genes in their products can have many different types of identifiers genomics can sometimes require the conversion of one identifier to another ID mapping services are available increasingly as part of pathway analysis tools and the if you use standard identifiers like gene symbols and entree gene IDs it will reduce headaches for identifier mapping but it won't eliminate them all because of some of the issues that I mentioned any any more questions at that okay okay so the next part is thinking about pathways and trying to learn a little bit about where pathway information comes from okay so pathway pathway information basically mostly comes from databases there are other types of attributes for genes that you can use in pathway enrichment analysis like and some of these are really not pathways and so you'll notice that some of the pathway enrichment analysis are called gene set enrichment analysis tools and the reason for that is that they will work with any gene set doesn't have to be a pathway gene set could be a chromosome position genes gene set or could be a a set of genes that are associated with the disease or it could be a set of genes that are the target of drug so but the reason why again I focus on pathways is because I find that that's typically the first thing that everybody wants to do when you have the omics experiment is learn about pathways and those other things are often more detail-oriented things that you can do afterwards okay so focusing on pathways pathway information comes from databases and a lot of it and a lot of these databases like reaction which we'll learn about tomorrow stores a lot of detailed information about pathways all the biochemistry of the pathways is you know specified in a high level of detail um but remember that when we convert for the standard pathway analysis when we when we use that pathway information we forget about all the details and we just say what genes are part of what pathway and there's a really good source of information that stores pathways at that level and it's gene ontology um how many people know about the gene ontology or have used it okay so quite a few um so I'll just go over this quickly just so everyone's on the same page gene ontology uh sort of has two parts one part is a set of biological phrases or terms that describe gene function and they're applied to genes so you can have the word protein kinase a gene could be a protein kinase apoptosis the gene could be part of the apoptosis process membrane the gene could be part of the membrane or localized in a membrane it also is a dictionary because each of these terms has a definition so it's actually quite useful as a dictionary for biology and it's also an ontology and an ontology is a formal system for describing knowledge you know the first versions of this were like date back to ancient Greeks with Aristotle trying to categorize the world as like earth, air, wind and fire you know that's like the first ontology um so wasn't you know the the gene ontology is a lot more scientifically oriented and a lot more detailed so here's an example uh the um here's an example of an ontology so all of these each box here represents a term and the terms are related to each other and in general the the terms at the bottom of the at the top of the hierarchy are very general terms and the ones at the bottom of the hierarchy are specific terms so here's a term at the bottom B cell apoptosis and then it's a type of you know cell death it's a type of homeostasis and so you get different different branches up and then right to the top it's a type of biological process so biological process is just so general everything underneath is a type of biological process so within the terms these within the hierarchy these these lines indicate different types of relationships like apoptosis is part of you know this other process or it's a type of this type of process so it describes gene function and multiple levels of detail and terms can have more than one parent and usually when a gene is annotated to one of these terms it automatically means that all of the terms above it are also you know they're automatically annotated to that so what that means is that gene ontology often often creates a associates a lot of terms to a gene and sometimes that causes issues because you have to figure out how to deal with all of these terms so this course will definitely teach you about how to one way of dealing with that okay so gene ontology covers three major aspects of biology where things are in the cell the enzymatic function or you know chemical function of a of a of a of a gene and the biological process which is basically pathways so usually when we do pathway analysis we focus on biological process I find that that's it's better to just focus on biological process in the beginning because if you include molecular functions or their component you include a lot more information and it doesn't give you a lot of it doesn't give you as much insight as the biological process part so when I see the results of the pathway enrichment analysis and it has a whole bunch of cell locations that's not going to tell me as much about biological processes like pathways like apoptosis and wind signaling pathway and other things like that so instead of just including those other things and having them come up and make like add a bunch of extra noise and information you have to deal with just start with biological process as the pathway so again that's a general comment that I have general recommendation that I have with pathway analysis is that make sure the set of things that you're studying that you're analyzing in your database of gene sets are pathways at least to start with the other things you can add on later will just help it helps with interpretation okay so there are two parts of genontology there are the go terms that I explained and these are added by people who work in a genontology project you can request new terms they add new terms over time I noticed between last year and this year that they added very few terms even though over time before that they were growing rapidly so I think they're kind of reaching a like a a little bit more stable point because they have tens of thousands of terms they have like 40 almost 45,000 terms in genotology and and which is a lot and then the second part is annotations and this is which is this is the part that's still very incomplete when you take one of the terms from the dictionary and you put it on a gene like gene you know gene A is part of the cell cycle there is that's called an annotation and when you do that there's additional information that you that the curator adds including the evidence of why they why they let me that link and oops so a key point is that there are so I'll I'll talk about the evidence and a key point is that some of these annotations are created a lot of the annotations are created by with with human you know by by people and or people reviewing electronic systems and some of them are created automatically without any human review okay so I mentioned hierarchical annotation so a gene can be part of multiple multiple terms and even if it's part of one term it automatically gets all the parent terms so that can create as I said many terms for for a gene and and I also mentioned that this annotation is sort of there's kind of two types there's curated by scientists which are expected to be higher quality but unfortunately they're smaller in number because it's time consuming to to do that and this reviewed computational analysis which at least somebody's looking at the computer program and then some of the and then quite a lot of annotation is electronically annotated without human review and these are typically thought to be low quality compared to the manual ones now it's not always exactly true sometimes computational automatic computational prediction methods can be extremely accurate so an example is membrane transmembrane domain prediction so if you predict if you have a protein sequence and you want to predict if there's a transmembrane domain and thus it's going to be expressed in the membrane somewhere that's extremely accurate but others are are you know very inaccurate so a key point is to be aware of annotation origin and how do you do that fortunately genotology has made a bunch of evidence types and they're all coded so we won't go through these in detail but all of the ones at the top are you know these are experiment these are codes that indicate that there's experimental evidence these are codes that indicate computational analysis reviewed computational analysis and these are codes that basically indicate somebody read a paper and type the information from the paper and then IEA is all the other ones that are not human reviewed so often you'll see you know exclude a button to exclude IEA from a pathway analysis tool and I actually recommend that you do that in the beginning because you want to focus on pathways that we have higher confidence in the only example of where I don't recommend that you do that is if you are working with an organism that hasn't had a lot of curation and so and some some you know some model organisms are curated better than others as I'll just talk about in a second but also if you're not working with a model organism then you usually are using data that's all inferred from electronic annotation and then you have to work with that information and sometimes as I said it can be good but you just have to be aware of where it comes from okay so I alluded I briefly alluded to the point that annotation is dependent on on a number of factors including species so some organisms are better annotated some genomes organism organism organismal genomes are better annotated than others so genontology is applicable to any organism all major eukaryotic model organism species and some bacterial and parasite species have curated annotation and the full list is on the genotol available on the genontology website but it's important to recognize that there's variable coverage so depending on the organism that you're working on you can have better or worse coverage of genontology and also you can have better or worse coverage of the experimental or you know fewer or greater genes could be only covered by the IEA the inferred from electronic annotation so you might have you know different varying coverage of like the better higher confident annotations or the less confident annotations this is an example to show you for different model organisms what the variability is so you can see some of them have very few experimental evidence sources just to mention that a lot of databases contribute to this and there people that you know depending on the community that you're in you can actually communicate with and get things changed than genotology if you want okay so just a couple of additional concepts with genotology that you might come across one is the idea of a slim version of genotology so genotology is tens of thousands of terms sometimes people want to summarize the information and they don't want to use all those tens of thousands of terms to summarize their data like for instance you're making a pie chart and you just want to you know what fraction of my proteins are in the membrane versus the cytoplasm if you use the cellular component part of genotology you have 10,000 names for different parts of the cell where you're not going to make a pie chart with 10,000 slices so you know how do you reduce that to a simpler set so this go slim offers a official reduced set of go terms that has you know a much more manageable number for things where you need to for tasks where you need to do some quick summarization without dealing with all of the the complexity of the ontology there's also lots of different genotology resources that are freely available and you can you can find these online one of them that I recommend if you're interested in the gene ontology is quick go it just basically a search engine for go and it's pretty fast and easy to use so you can just browse ontology using that there are also just as a message there's lots of other ontologies but gene ontology is pretty much the only one that we typically see because it's the most popular widely used most comprehensive one although depending on what your task is you should know that there are other ones okay so gene ontology is a major source of biological pathway information in particular the biological process part of it so just again that's the part that I recommend focusing on and actually when we do pathway analysis I recommend starting with pathways that are you know annotated curated by people so eliminate the so I work I recommend focusing on pathways for instance like biological process and also the pathways that we'll talk about in these databases and also removing the IEA evidence codes from from gene ontology that's the the lower quality annotation except as I mentioned if you just don't have any high quality annotation for those species that you're working with okay so now on to pathway databases so pathway databases are another source of pathway information there are actually hundreds of them we put a website together that just lists a bunch it's called pathguy.org there's also pathway there's also some pathway analysis tools like the very popular GSEA or gene set enrichment analysis tool maintains their own database of pathways it's called MCDB which originally is stood you know was focused on collecting signatures so it's called molecular signature database actually increasingly they're focusing on keeping track of pathways and then pathway commons is a website that I'm involved in in developing that tries to collect major databases together to be kind of be a one-stop shop so that's just a potential interest there are also lots of other types of annotations as I briefly mentioned like you know disease association and drug targets and you know different protein properties like whether a protein has a particular domain like it's a protein kinase and these you can get from gene genome browsers like ensemble or model organism databases and if you're interested in these you can ask about them during a lab or during during a lab sessions just to highlight one of them um how many people have heard of ensemble ensemble is a reasonably popular genome browser UCSC is another popular one so UCSC is the most popular one for for human ensemble provides a lot of different organisms and also provides more services that that you can kind of search like this Biomart service so Biomart is a kind of like a shopping mall for gene annotation so you can say um and I'll just cover this very briefly you can if you use it it's I include this here because it's difficult to understand how to use it when you just go there it's very non-intuitive to start once you start and when once you understand the basic thing then it's very easy to use but the the basic thing is that you have to select you know the type of information you're thinking about like genes and then you have to select the genome and you have to kind of wait and then it updates and then it will say okay I recognize that you're talking about human now what you do is you you can do two things after that one you filter the genome down in some way like you can say I only want genes that are protein kinases and you can give it a gene ontology term for that or I only want genes on chromosome one or I only want genes that are close to this other gene and so you create these filters and then each time you create a filter you can kind of click this little a little button that says count and it will tell you how many genes you match that filter and then once you do that you can download information and you can download gene ontology terms you can download whole sequences so you can make a sequence database you can download structures you can download gene gene identifiers so this is a way you can kind of download a bunch of information and spreadsheet about genes of interest so it's quite powerful so I just cover briefly here okay so let's see how are we doing for time okay we started a bit early so I think we're going to end a bit early as well but you know I'm basically mostly done but we're going to I'm just going to summarize what we've learned so and then we can talk we can maybe spend more time asking questions and in general just any kind of question because I think I think we're supposed to end it officially on the schedule we're supposed to end it 11 this morning now so we'll end a bit earlier 20 15 20 minutes early okay so so we've learned that pathway information is available from many different sources and sometimes accessing all these sources can be daunting so we'll talk about during this course the rec kind of recommendations one of the things that we maintain in our lab is a like a gene set database that we use regularly and we basically wrote a script to collect the information from any different databases and filter it in the way that we like and but it's only available for human and mouse and maybe rat and not just because we haven't worked on those organisms but at least we provide all the documentation and you can sort of see what's there and so later we'll talk about that and that's sort of an easy easy easy place to go access gene list data from for certain types of analysis otherwise gene ontology biological process is usually the starting point for most people and is varies widely depending on the organism that you're studying so human is the most well annotated for pathway databases like databases like reactome which we'll hear about tomorrow focus on collecting pathway information for human and for some reason most pathway databases historically have just really focused on human however gene ontology historically came out of the model organism databases so if you study yeast for instance gene ontology annotation for yeast are the best of any organism because the yeast database was one of the creators of the gene ontology and they actually have collected have have gone through every paper that has ever been published for yeast and and they've they've annotated every gene against that so it's completely comprehensive a lot of the genes are still unknown but they actually say we you know it's unknown when we verify that it's unknown because we couldn't find any papers about it and they keep that up to date now that's easier for yeast because there are only tens of thousands of papers in yeast for human that's much more difficult because there are millions of papers for for for human and and so people a lot of literature a lot of information that's still present in the literature only and won't be in the pathway databases so I guess that's one interesting thing that we can discuss is that if you are doing pathway enrichment analysis and the information that you want is not there you can actually change the pathway databases you can create your own gene set whenever you want and you can also add genes or or remove genes from a gene set and that the gene set databases are text files that you can edit in a text editor and so they're actually fairly easy files to look at and edit the only issue with them is that could be big so that could be a challenge to kind of work with them if you're not used to working with very big text files but they are just the point I'm trying to make is they are editable you can create your own and sometimes that ends up being very valuable certain analysis that we've done have depended on you know someone coming in and saying well gene ontology or any pathway database doesn't cover the biology of interest for me so I'm going to make three gene sets that are like really important and then I'm going to incorporate that into the database and you know sometimes those come out as enriched as well you can also do you know one gene set against your database you don't have to against your gene list you don't have to you don't have to do what we so this whole course we'll focus on taking a gene list and searching a bunch of pathways like thousands of them you can also just take one pathway or gene set which could be like a gene signature or something doesn't have to be a pathway and compare it to your gene list and that asks that asks us a slightly different question not which is you know does my gene list match some known gene list okay so you know if you have questions about specific databases that might be available for your area of interest you can ask myself or Veronique or others in the class who are teaching a class and we might be able to recommend some but sometimes we'll find that that there just aren't and sometimes we work with people okay I work with a researcher who studies like a a stem cell model organism that I'm forgetting the Latin name of I always forget the Latin name of but it's basically this little this little flatworm that regenerates what's what's the planaria yes sorry so studies planaria and and there are no gene ontology annotations for planaria so basically he he mapped them via orthology from another organism sometimes he he used mouse even though mouse is far away from other organisms like you might think that there's C. elegans is a a worm and planaria is a you know kind of worm so you might take the information from from from worm however we found that mouse had better ontology annotation and a lot of the genes are conserved especially in stem cell related processes that he was interested in so so he took that data for mouse and the way he took that data is he mapped the information using orthology so you can it's a little bit more advanced is not an easy tool out there that always that does this for you that I'm aware of so sometimes it requires some scripting but but it's relatively straightforward you can download a data you can download a list of orthology relationships from ensemble is one place that makes those available and probably the organism you're studying is present in ensemble somehow or you can find those somewhere else and the worst comes to worst you have to compute them yourself which is also possible but then starting to get more advanced and then you can convert the genes basically from one organism to another bio orthology has anyone done that one person okay so yeah so uh biomart has a tool to do that oh they do have a tool to do that I don't know how many spaces it covers but it's there for me to do it to them okay hey you you uh you type in genes and in biomart and the thing that I mentioned yeah you put in all the gene ideas and then they're using if you want to like figure something you just oh when you when you ask for annotations you can say give me the human orthologs yeah right okay okay uh any general questions about pathway data where it comes from problems with that okay okay so back to this workflow just to summarize once more you know this is the the sort of standard workflow we want to learn about underlying cellular mechanism using pathway analysis visualizing identifying the pathways to identify interesting pathways and then drilling down once we find something interesting to a specific model it's not the only thing you can do with pathway analysis here uh you know I'll just give I guess because we have some additional time I'll give some additional examples and just see if the the last slide that I have is kind of a lab that you can do yourself if you're interested in working with gene identifiers so um you can it just gives you some pointers to to try and do that yourself if you have a gene list yourself so I just recommend playing with those tools just to to learn them a little bit we're not covering them in any labs so one other one other thing that you can do with pathway analysis so I should mention actually should emphasize this more in the introduction that oftentimes pathway analysis gives you a hypothesis so it's hypothesis generating frequently it doesn't give you the actual answer that you publish if you just publish the pathway analysis without doing anything else oftentimes it's just like publishing a bunch of hypotheses and then someone else is going to have to follow up on those hypotheses so it's important to know that pathway analysis is often a first step wherever you want to generate hypotheses it's very valuable for any kind of exploratory analysis but then obviously you'll get more understanding of the system if you actually take those hypotheses and follow them up with additional experiments there are when we work with people who when we show the people that we work with pathway analysis results I often found that there's I often find that there's different ways people kind of react to the pathways sometimes I say oh too many pathways but you know we have ways of that we'll talk to you today about how to filter and identify interesting ones and how to manage the results but also in defining what's interesting sometimes people say oh the top of the top of the list the most significant pathways those are by default the most interesting and if it's a novel pathway at the top that's very significant should probably try to follow up on that sometimes people say oh I can't really follow up on that because it's it's in an area of biology that I can't easily access with experiments that I know about like sometimes strange metabolic pathways come up and then you have to have specialized biochemistry skills to analyze those pathways if you really I actually encourage people to like try to think about how they can follow up on those because otherwise you kind of keep yourself in a little area of known you know of knowledge that is not branching out and but but you know it's not I don't want to be too harsh because you know frequently there's lots of interesting hypotheses that come out and you can take different paths and sometimes people select the path that's going to be the easiest path just be aware that there's a little bit of a trade-off there with the easiest path versus the most impactful path in terms of scientific discovery the other there are ways that you can use pathway analysis as the kind of result it's not in in the end hypothesis generating so one way that we've found it's quite useful is if you if you're comparing two different or two or more different conditions like in our research of frequently we frequently have to do this with like cancer subtypes so this a pendemoma example that I mentioned when the gene expression analysis was originally collected and clustering identified two different types type A and type B you know clearly they were differentially they had a lot of genes that were differentially expressed between these two types but to strengthen the but there wasn't really after that you know there wasn't really much else that you can say in terms of how different they were so doing a pathway analysis on each one separately and showing that there are very different pathways that are active in those two subtypes suggested or added support for the statement that they're biologically different and so that in in that way we could actually take the pathway analysis results and publish it as a result and it's it is hypothesis generating because it generated a bunch of hypothesis for pathways to study but at the same time the actual pathway analysis was just publishable to support a statement in a paper okay so that's that's just some additional insight any questions in general about pathway analysis or other things that people want to mention using geo analysis genome knowledge analysis is it possible to merge the molecular functions and biological process in the same yeah so the question is it can you merge different aspects of gene ontology together and yes you can it's very easy to do that most analysis tools will just allow you to select them or deselect them my recommendation just to clarify is to start with biological process because it it just gives you the most value in the results and then if it's not giving you value or you know it's not what you want to do you can easily just select the other boxes and include molecular function or cellular component very easy to add those or subtract them does that make sense yeah and just a note with geo so it's a bit confusing people say go or geo which is understandable but also people might know there's a geo database that's geo gene expression omnibus so sometimes people say geo and they're talking about go analysis but they the other person thinks they're talking about gene expression omnibus which is where all the gene expression data is stored and you have to submit it when you publish a paper people might know that but it's a point that comes up sometimes of confusion any other questions so thanks for reminding me about that I have one after one is it possible for a biological process and also in geo to see if it's an inhibitor or activator or the pathway in the biological process or not so the question is can you learn about inhibition and activation from this type of analysis the answer is sometimes you can if the pathway is annotated as the inhibitory part of the pathway and the activation part of the pathway and gene ontology does have a lot of terms about inhibition of pathways and activation of pathways and increasingly they're at those terms are linked to genes and they're annotated but it's not as widely annotated as the main biological processes so frequently you don't see that information but it is possible to see it and over time it will get better so that's one of the that's one of the things that you that I that sort of usually happens in this mechanistic drill down so yeah so the theoretically and you know it is possible but in practice usually we think about that when we're after we've identified a pathway and then you look at the genes in the pathway that are actually responsible for the enrichment and in GSEA there's a specific name for those which we'll cover later which is the leading edge genes and those genes are like the genes that contribute the most to the fact that the pathway is enriched and if you look at those on the pathway you might find that they're all inhibitors of the pathway so then that's pretty important to know because you thought the pathway was active but it's actually the inhibitors of the pathway that are active and so yeah it's good to know that yeah and a question regarding protein-protein interactions say when you use pathway analysis you were referring to your first study where you had different mutated genes and different individual patients but then you were able to find that there were certain pathways that were clearly significant is it similarly possible like in that in the gene list there's say a gene that's a protein-protein interactor and should be accounted for as belonging to a specific pathway and is it possible to do that with the pathway analysis software that you were describing earlier? So the question is can you consider protein interactions when you you are doing this type of analysis that I focused on today in general the gene list information does not incorporate so there are few answers to this question so when protein interactions are considered in gene function annotation like gene ontology will annotate a gene to be part of a process if there's a protein interaction and there's even a specific evidence code for that called IPI inferred from protein interaction and there's other ones that are similar like different types of interactions so some of those are incorporated but those are generally the well-known ones there's a vast data bit you know vast amount of information of protein interactions and other types of interactions which we'll cover mostly tomorrow I call them generally like the general notion is functional interaction like any kind of interaction between genes that could be of different types like protein interaction genetic interaction co-expression you know things like that that are similar protein domains that allow you to kind of know that that one gene is maybe function with those two genes might be functioning similarly so that concept will spend a fair amount of time talking about tomorrow but in general that information is not considered in the standard gene set based pathway enrichment analysis that we focus on today it is possible to consider it like you could create your own gene sets that are the set of genes that interacts with another gene and you can use that people don't do that too frequently it kind of sometimes generates too many gene lists the better way of doing it so you could you know sort of do that in this mechanistic drill down section here where's my pointer here but actually the better way of dealing with it is usually in this network analysis type of thing that we'll start focusing on tomorrow so one of the things that we do tomorrow is this reactome functional interaction network analysis that allows you to download a very comprehensive network at least for human that includes lots of protein interactions and you can also include your own ones and then if you take mutations it will identify regions on the network that are highly mutated and connected and so that is a different kind of approach so so the quick answer is not really networks are not really considered too much in the gene set world except they're except where they're like integrated into the gene set annotation and most people use different tools to to to handle rich network information sorry just one additional question is is there tissue specific data in these software tools? so the question it's a very important question is there tissue specific data and the software tools generally not generally all of the data I should definitely mention this in the in the slides but generally all the data that we use in pathway enrichment analysis and even the network enrichment analysis is what I call context free so it's it's the you can consider it as the the set of things that could happen at any given time in an organism overall developmental stages and does not specify what happens at an individual tissue or an individual developmental stage the way there is a lot of information about that and the reason why that the case is that there's even though there's a lot of information about a lot of knowledge about you know which gene is expressed in which tissue in the literature it's far from comprehensive so even if you took the time to extract that information from the literature and already that would be a huge amount of work what you left with what you'd be left with is a patchwork of information where most genes you know don't have any information about it because they haven't been studied in that particular context and there's thousands or infinite number of context so it's not really scalable to specify them all so what people do to approach that is you actually use the genomics data to help you define what's active in a particular tissue so if you have if you're studying brain or liver then you can you can and you have gene expression data or protein expression data that's actually the best definition of what's expressed in the tissue that's going to change a little bit as studies come out and there are some reference maps that like the big one is GTX people might have heard of that's GTEX that's a for anyone who's interested in tissue specific expression at least for human they take normal human tissues and they do a lot of RNA-seq on them like thousands of RNA-seq experiments and try to make a catalogue like an an atlas of tissues tissue specific expression for all human genes and then now these days there's a new project called the human cell atlas which seeks to use single cell RNA-seq to do the same thing to general to identify all single cell all cell types in the body at all development stages and and get an RNA-seq profile for every cell type and we don't know how many cell types there are it could be tens of thousands or millions or but there the technology now allows you to single cell RNA-seq technology and other types of single cell omics technologies allow you to actually do that at the single cell level and get like a very detailed profile per tissue like actually and I mean talking you know people in Toronto are doing that for liver actually I'm interested so actually we don't cover single cell technologies here but my lab happens to do a lot of work in that area so you're welcome to ask questions about it I just have a question about the methods to map our gene list of interest or to gene ontology database or other so what do you go like do you go from sorry can you repeat the question because I didn't quite understand the middle part I have a question on ontology data okay okay sorry so the question is how do you map gene ontology terms to your gene list the answer is it's it's it's it's actually I should have made this clear but it's it's the mapping's already done for you and you just look it up in a database so when you you don't actually have to do that step it's all automated as part of these analysis tools so when you so typically these analysis tools and I actually should should give mental notes to myself I should give an example of actually running through one of these pathway analysis but we'll see see it soon you you just give it your gene list and it does everything for you so the the gene ontology is mapping is done by the curators who make these association files and then they'd consider information like protein domains and things like that and then your gene is already there with its gene ontology annotations and so it's a database does that make sense okay okay yeah oh that's blasted okay yeah sorry so yeah so if you don't if you want to kind of map your own gene ontology terms the best thing to do is bio orthology mapping and what you said I just that was the point that I didn't that I missed best blast hit is a good way of doing that but there's actually better ways of defining orthology and so it's good to use the advanced orthology methods best blast hit is can give you some issues because sometimes there's gene expansion and the functions of those genes are very different and blast will not filter that but the orthology mapping tools which have a couple of additional rules on top of blast searching will help you result refine that and for instance give you like a one-to-one orthology versus a one-to-many orthology and it will also identify paralogs and so when you get into that level there's actually tools that help you define orthologs and a more evolutionarily correct evolutionary correct based on evolutionary theory way and and that's useful for gene function prediction because people when when gene when genes duplicate they usually diverge in function and so you don't always want to use sequence sometimes the function is known to diverge quite far even for sequences that might be similar so these orthology mapping systems are the best to use and usually they're pre-computed but if you if you want to compute them you use a tool and some like one of them is called in paranoid and another one's called ortho MCL and these are software packages that you can use to compute those yourself that's a time consuming compute operation requires a powerful computer usually or like it might take a few days and so that's why it's good to use pre-computed ones if you can find it and then and then if that doesn't work you can do your own gene oncology gene function prediction but that's a much more in-depth type of thing that requires more bioinformatics software to come together in a pipeline that's a good question to ask again after tomorrow when we talk about gene mania because gene mania covers gene function prediction and it can actually be applied broadly and the exact way that it works will be covered tomorrow and so think about that question again