 Okay, welcome everybody. As I mentioned earlier, my name is Gary Bader and our research is focused on pathway and network analysis, so we can answer a lot of different questions about different types of network and pathway analysis of any kind of genomics data. This workshop tends to focus mostly on analyzing transcriptomics data because it's a most common type of data that people have, but the concepts that we're focusing on are we try to make them very general, so you can use those concepts and translate them to any different type of genomics data and any type of organism. And then for people that are working on data types that we don't cover because we can't cover every type of data type in every way, we can answer all of your questions during the break and whenever we have opportunities to chat about things, we've probably analyzed the type of data that you have and we can point you to papers where we've done these types of workflows on different types of data. Okay, so although not every type of data and every type of organism is easy to analyze because sometimes there are tools available that are more focused on human or E. coli or something sort of major model organism, and if you're working with a newly sequenced genome or something like that, there might be more challenges, but we can help you with that as well. Okay, so I'm just going to start by talking about a number of basic concepts that will the aim of this initial section is to bring everyone up to the same speed, so some people may know some of these concepts, but they're probably new for other people, so we're going to go over everything and start from the beginning, so the point of this workshop is to help people analyze data that they've generated from genomics data and usually that ends up being some kind of gene list or metabolite list or some list of things or items and it's great when you get that data and it worked, it's amazing, and then you have 1,000 or 5,000 different results and now what do you do? So that's a very common problem and this whole workshop is focused on helping you in that. So the major question that we want to know is what's interesting about the set of genes or other molecules that you have and typically the sort of standard approach of ranking or clustering the data, if you have multiple samples, can help you identify things that have the strongest signal or patterns that are similar across your data, but it doesn't help you interpret the results in a biological way. So one thing, one way that people have found is incredibly useful for helping interpret large scale data is pathway analysis and network analysis and so if you say what's interesting about these genes, you can ask, well maybe they are enriched in some type of pathway that we know about, like, oh, all of the genes that I found are mostly related to the cell cycle, what does that mean? So that's much more sort of interpretable, easy to understand. So the idea is that if you didn't have some kind of analysis tool to help you analyze this data, you'd have to go through those genes one by one, looking at the literature and putting all the pieces together yourself. So pathway and network analysis in general is a type of a method that helps you gain mechanistic insight into large scale data and it may be identifying a master regulator or drug targets or characterizing pathways that are active in a sample and in my view it's any kind of pathway, it's any kind of analysis that involves any kind of pathway or network information and it's most commonly applied to help interpret lists of genes, but you can use it for lists of small molecules and anything else, you know, regions. The most popular type is pathway enrichment analysis and we'll talk about that today but there are many others that are useful and we'll go over different ones and different parts of the lecture. So I wanted to start, I wanted to next go over two examples of pathway analysis that were particularly successful, so these are examples that we've been involved in, sorry, from our own research projects in collaboration with others, but it illustrates some of the things that you can do with this type of analysis. So the first example is an analysis of autism spectrum disorder, so autism spectrum disorder is a, you know, mostly genetically heritable disease or syndrome that, or spectrum that affects people starting at young age. When I started working on this I didn't realize how heritable it was, but it's more than, you know, it's highly concordant among identical twins and it's more than 50% heritable. So one of the things that people have discovered is that with this disorder is that de novo copy number of variants that are rare play an important role in the genetics and this is work done with Steve Scherer who's at the hospital for sick children here who focuses on the genetics of this disease. So the Scherer lab collected about 1,000 cases and 1,000 controls, very severe autism spectrum disorder cases and genotyped everybody with SNP arrays and used those to call copy number variants. So SNP arrays measure the intensity of DNA signal at different positions and if you have a bunch of high intensity positions in a row that could be a game of copy number if you get no signal across a range of SNPs on the chromosome that could represent a deletion. So you could use this information to identify games amplifications and deletions in the genome and they found a number of copy number variants that were associated with autism but not that many about 10 genes in total from all their copy number variants if they just looked at genes that were enriched in deletions for instance in the cases versus controls and so we looked at this data and we tried to find not just if a specific gene is affected associated with cases, a specific gene is particularly associated with cases we compared to controls, we looked at whether specific pathways were affected compared to controls and what we found was a rich set of pathways that, let me see my mouse pointer here, so we found a rich set of pathways that were enriched in deletions in cases compared to controls, all these little circles here, each one of these represents a pathway, a set of genes and the lines connecting these circles are overlapping genes that are shared between the pathways and we can zoom in on one of these things and we can see that a lot of the pathways for instance were involved in central nervous system development which makes sense given the biology that this is a brain disorder or brain disorder or brain disorder or a disorder that affects the brain and what was interesting here is that it wasn't the same gene that was affected over and over again but if we looked at a set of 20 genes in the pathway they were affected, you might have one gene affected per patient say one gene deleted per patient per individual but when you looked at the pathway you found this pathway was constantly getting affected by deleted genes and in a statistically significant way and it's a pattern that you couldn't see at the gene level but at the pathway level you kept on seeing the same pathways over and over again but not the same genes so that's where pathway enrichment analysis can help when you have a situation like that in your data which is fairly common. The other little symbols on this plot represent pathways that were enriched in known intellectual disability genes and also known autism genes here and so there was overlap between path even though the genes that were known to be involved in intellectual disability and autism were not seen to be mutated over and over again, the pathways that they were affecting and the pathways that were being affected in the copy number variant, the new copy number variants that were mapped were very similar so that was also interesting and helped validate the results. The second example is a cancer poll so this is a analysis of ependymoma. Ependymoma is a brain cancer, it's the third most common brain tumor in children. It affects the ependymom which is the lining of the central nervous system and for many years people have known that and there's no known therapy for this disease other than radiation and surgery which is devastating because this affects young children and brain surgery young children basically leads to a poor quality of life so ideally you would be able to find some better therapy, more targeted therapy. So the only thing that people knew about this disease, they didn't know anything about any mechanisms was that depending on where it occurred in the brain it would have different outcomes and the most serious outcome is if the tumor appeared in the posterior fossa which is the back of the head, the brain stem and the cerebellum so if it's people knew for many decades that if it occurred there then that was bad and that also happens to be the most common location for this tumor. So Michael Taylor who is a neurosurgeon at also at Sick Children's Hospital here collected tumors and analyzed collected transcriptomics data, collected transcriptomics data on all these tumors and was able to cluster them which means that you identify samples that are similar and you group those together and there were two major groups that appeared. One called posterior fossa type A affects the youngest patients and has a terrible outcome and the other one affects type B affects the oldest patients and has an excellent outcome. So even though people thought just based on anatomical location that this tumor is very serious it turns out there's actually two different diseases in the same anatomical location. One's really bad and that's where the bad signal came from and the other one is actually has an excellent prognosis and so based on this they can already start tuning therapy. But we wanted to know more about the mechanisms and Michael collected whole genome sequence and exome data on a number of these tumors and interestingly there was no mutations identified no recurrent mutations like each sample had up to three mutations or something so it's very silent. So there may be various reasons for that because it's a pediatric cancer and they have time for mutations to develop but unfortunately it didn't help tell us anything about the mechanism of this tumor. So then Michael and the team moved to look at DNA methylation again with an array looking at CPG island methylation and if you have a whole series of very strong methylated signals in a promoter region there's a good chance that that gene is going to be silenced. So it turns out that the serious A type is much more transcriptionally silenced than the B type and there were about 2,000 genes that were differentially methylated between A and B. So we looked at that the sort of standard pathway analysis methods didn't actually work right away with this but we used a more statistically appropriate test and also because the data was very sparse and we also used a much bigger database of pathways than is typically available and we'll tell you about how to do that and so this was worked on by Scott Zydardine who's a postdoc in my lab and interestingly the only pathway that was was enriched in these 2,000 genes compared to what you'd expect where pathways related to was basically pathways related to the PRC2 complex. So PRC2 is the Polycoma repressive complex 2 it's involved in methylating histones and then DNA gets methylated so it's an epigenetic regulator and so you know I should explain this plot here so this the length of these bars is proportional to the significance of the pathway the significance of enrichment of the pathway and each of these pathways here EED targets and SUSE 12 targets these are actually proteins that are subunits of the PRC2 complex and this is another one that sort of combines a bunch of different targets of the PRC2 complex so all of the pathways here that were that were enriched past sort of a threshold here were related to the PRC2 complex and as you can see there was hardly anything that came up in the group B okay so this is really interesting because it represents the first target molecular target that we know about this disease at least in the group A types and and people have actually developed small molecules that inhibit the methyl transferase in the PRC2 complex and you can get these these were tried in cell lines and mass models associated with this tumor and they it showed promising results and then even more interestingly there was a patient at the hospital here who came in and had had reached the end stage of this disease the tumor a Pena Pena MoMA hit this type a Pena MoMA and metastasized the lung and it had in two months this lung metastasis here had doubled in size and there was no other treatment options for the patient so they were you know basically as I said reached the end stage so in compassionate grounds this patient was treated with a on the market anti-DNA methylation drug called 5ase acytidine and one course of treatment resulted in a the tumor stopping its growth and the patient were gaining their energy and feeling feeling better and that effect lasted for 15 months so that was really amazing because we were able to within a short amount of time move from basically very little known about this disease collected a few different layers of genomics data identified a molecular mechanism that seemed to be important in one type first of all we identified that there's two types that a molecular mechanism that's important in one type and fortunately there was a drug available multiple drugs available including an on the market drug and we were able to to see an effect in a patient and now there's a two clinical trials funded and are showing very very good effect yeah so in this in this case we had a very clear signal because there was just one pathway that was enriched in the DNA method in the differentially DNA methylated genes normally if you get lots of pathways you don't know which one to target so oh in this case there's only one methyl transferase that is known to be this complex people know how this complex works and there's an important methyl transferase that's important in the major function of the protein which is trans you know methylating histones and so you can inhibit that and it's an enzyme which is easier to inhibit and there's drugs available so that was the first one the only one in the first one that you would check normally if you're interested in in drugs we can I don't think we're covering this in the lecture but we can definitely talk about annotating pathway analysis enrichment analysis with known drugs so say you have a pathway that has 20 or 50 genes in it there's a data their databases of known drug targets and you can just you can just annotate those pathways with all the known drug targets so it will at least give you a set of a quick quickly give you a set of drugs that are connected to the pathway and it could be knocking down it could be and you know not promoting it drugs act in different ways so there's a lot of you know there's more work to be done to understand which drug you would might use in an experiment or how it works but at least you can very quickly get to that point of having a list of drugs there's just one option so it was it was the most clear example and the pathway enrichment result we've ever seen and we usually don't see examples so clear so okay so so those are two examples that just show you that just illustrate the kind of things that you can do with pathway enrichment analysis there are many kinds of things you can do but those are good examples and and and are informative okay so in general the benefits of pathway analysis are compared to analyzing data at the level of transcripts or proteins or SNPs is that the data is the results are usually easier to interpret because they deal with familiar concepts like cell cycle metabolism they identify possible causal mechanisms that you can follow up on so these represent it generates a lot of hypotheses testable hypotheses so that's very important is that a lot of these all these analysis that we do that we're covering in this workshop basically our hypothesis generating but they're fairly often fairly specific hypothesis generating analyses or they generate a fairly specific hypothesis you could use this to predict new roles for genes so you might find a gene that is involved in a path in a pathway it's sort of similar to genes that are involved in a known pathway but it wasn't known to be part of that pathway the the the analysis can sometimes improve statistical power so I'll just illustrate this with an example say you have a genome-wide association study where you look for mutate you genotype look for mutations in individuals and say you have 10 cases who have a disease and 10 controls who don't have the disease and your perfect signal if you're looking for mutations that associate with the disease would be finding mutation that's present in all 10 individual all 10 cases and none of the controls that would be perfect never see results like that that way usually it's closer to each individual has a different mutation and then you can't really do anything with the results however if you realize that those mutations in the in the cases are all part of the same pathway and you've done an analysis at the pathway level where you say which pathway is associated with this with the cases then you can say all 10 cases are affected by potentially affected by mutations in this pathway and none of the controls are affected by mutations in the pathway and you can compute statistics for that CEO significant it is and and that takes us from a situation where we have no signal to a situation where we have perfect signal and the way that that works is kind of one major way that's sort of a central concept which is that the pathway information is able to aggregate those single counts into one stronger count of 10 so instead of having a bunch of counts of one you have one count of 10 and that's just a stronger signal the other thing it helps with is reducing multiple testing because usually there are fewer pathways to test than gene so if you have to test SNP associations and GWAS you have to correct for the number of tests that you do and we'll talk about that later but it reduces your statistical power so the fewer tests you do the more power you have and pathways usually required to do fewer tests because there are only thousands of pathways not millions of SNPs okay so those are concepts that are used over and over again and we'll talk about them again pathway analysis can sometimes be more reproducible because it's there might be lots of ways to affect the pathway and each time you do an experiment you find a different way but then when they all they all connect to the same pathways so then if you look at things at the pathway level you might see more reproducibility across your data it also facilitates integration of different data types and because you can analyze all of your different genomic layers with pathway analysis and are all using the same pathways and you can combine them and say which pathways are affected do I see affected from each genomic layer and they're all they're all talking the same language at the pathway level and you can visualize them all together okay so we talk about pathways and networks what's the difference so pathways you know they're both representations of biological processes or think mechanisms that are occurring in the cell pathways tend to be more detailed highly confident consensus models of what's happening in the cell of a particular process they might they might contain biochemical reactions and small molecules or a mix of molecules usually there are there there's less information at this highly detailed level available it's sort of more textbook knowledge networks are more simplified usually it's just what you know a connects to be or a regulates be and you don't know as much about what's going on in the cell but you have large-scale information that has told you about how things are connected like you might have a large protein interaction screen that just gives you thousands of protein interactions and and that's in a database and you just want to use all of that information so both of these in from these types of information are useful the benefits of pathway analysis are sort of focused on working with pathways is that you might get more mechanistic understanding and more detail level the disadvantage is that it only works with known pathways so if you have genes that are not part of known pathways that doesn't it doesn't cover them at all so you have to look at those separately networks might cover more genes and you can get a nice network that connects genes that are say differentially expressed in your samples but then when you get that network you have to interpret it so what does that network mean is this a pathway that I know about is this a new pathway is this you know something that's not a pathway so that's extra work that's required so there's different types of pathway network analysis the first the first type that we're going to talk about is that we're going to spend most of today on is pathway enrichment analysis which as I mentioned is the most popular common method there's tens of thousands of papers but almost everyone who runs a genomics analysis today usually applies pathway enrichment analysis right away at the end if they're producing what type of list of molecules genes or molecules so and this basically says you know what biological processes are active in my sample we can also look at de novo subnetwork construction and clustering which we're going to talk about tomorrow with a reactome fi vis analysis this looks looks at networks instead of known pathways and as a result it might find new things that you didn't know about but you know known in you but again you have to interpret the networks that come out and then there's more detailed types of pathway analysis modeling that were that are less commonly used because usually you need a lot have they have high data requirements like you need very detailed pathway models and multiple layers of genomics data to run them and so we're not covered we don't cover them in this workshop but we can talk about them if you if you'd like and that might be you know looking at how mutations affect a phosphorylation site in a pathway and the prediction of the effect of that mutation and things like that okay so the general pathway analysis workflow as I mentioned we collect genomics data we normalize and score it according to what's standard for that type of genomics data so for instance for RNA-seq there's a standard pipeline that people run to take the reads from RNA-seq and align them to the genome and identify counts of transcripts and then normalize those counts and then the result of that is a list of genes for a sample and if you have multiple samples then you can compute differential expression between those samples and that's yet another analysis method we're not we don't cover those in this workshop that there are other workshops that cover those and we you know the results of those of all of those different methods no matter what type of analysis you're doing is a gene list or a molecule list and we start from there so but it is important to know that that there's a wide range of these types of normalization methods for established data data types like RNA-seq or many others it's very standard how to do this usually the core facility that runs the that collects your sample for you will run these analysis and provide you with a report so usually don't have to worry about doing it yourself although you should understand how it works and then if you're working with a data type that doesn't have that support you might have to learn and do these things yourself often these scoring systems are developed in a statistical programming language like are and the so so those are just a sort of general points the one important take-home message is that it's important to know that those methods are working well and you have to understand and look at the results of those methods so for instance if you compute differential expression between your sample and control and you don't you you get some differentially expressed genes but they're very weakly differentially expressed they're probably not going to get a good result for your pathway analysis because you don't have a lot of signal in your data there's not a lot of significantly differentially expressed genes and that sometimes happens so why that happens you might have to go upstream and figure out earlier parts of the process okay so so you generate your gene list and then you know we this workshop is focusing on the green part so we want to learn more about the biology cellular mechanisms that are important in the experiment and we can identify interesting pathways and networks and visualize them we can drill down to understand the mechanism and eventually develop some model and publish it so that's a general idea the more detailed version of this is this map that we put together that includes in the blue boxes more information more detail about different types of genomics and omics data that you can produce some of these and we just put this we made this very explicit so that you can see what steps in orange here are required to get to your gene list so some methods like protein interactions screening those generate the gene list right away without much you know you identify the proteins right away whereas other methods require multiple levels of scoring and and differential expression for instance analysis before you can get a gene list and then once you have your gene list you can look for pathways these are known pathways or you can look for interesting networks which might not be known and then there's different tools for doing this pathway and Richmond analysis is what we're going to talk about today again that's the most popular one these and we're going to talk about actually we're going to talk about all of these different ones here and then also visualizing and identifying interesting pathways and then the other parts of the workshop we'll talk about thinking with networks including transcriptional regulatory networks and then after you've identified a pathway of interest that's you know you might see 20 pathways or 100 pathways that that are that are significant in your and your analysis you have to filter through those to figure out which ones are interesting so some of them are going to be known oh I know that I know that so those are good because those are like positive controls you know that the analysis are working if you find things that are known and then others might be interesting it's like oh what's going on that thing I didn't know that autophagy was important or something in this process so you could focus more on that look at the genes that are involved in that pathway visualize them look at more detailed information like more detailed pathway and maps for that pathway and if there's genes that are coming up that don't have a known function you can predict the function of those genes we'll cover that with Gmania tomorrow and and you can put all of these things together so this this workshop tries to cover all of these these bases so so I usually start with the easy thing which is pathway enrichment analysis which we're covering today and then the the reason we do that is because it's a good question I should mention this in the in the in this lecture but the reason we do that is that working with known pathways is the most easy to interpret makes the most easy to interpret results so if you and so you can most quickly get to something like a story or something that that that you can think about if if you were to go directly to networks and you see these networks come up sometimes you can be overwhelmed again by networks and now what do I do with the networks so we usually start with this sort of simple easily interpretable pathway enrichment analysis and then we we identify interesting pathways and then we drill down on those but also understanding that pathway enrichment analysis doesn't cover some set of genes so we call that the dark matter of the genome you know genes that are not annotated and there's a lot of them like I mostly have usually half half of any given genome so those ones we also look at separately and we look for any strong signal like strongly differentiated differentiated genes differentially expressed genes in that section that don't have any pathways associated with them and those could then go for a pathway for more detail literature searching so maybe if they're just not in our pathway database but people know something about those genes or maybe we know nothing about those genes but you can look at what they they bind what they connect with in networks and that might tell you something about them and that's gene mania which we'll talk about tomorrow okay any other questions okay so a quick point about where gene lists come from so you know this workshop again is focusing on gene lists and molecule lists whatever types of lists you have but it's important to know where this data comes from because you'll you'll be able to answer different questions based on where the data comes from so so you can have gene lists that come from molecular profiling like mRNA or protein and you might just want to identify all of the proteins for instance in your sample with proteomics you might be able to quantify them so that's like another level where you have you know the expression levels of the proteins or genes and then you might want to look at differential expression and rank your genes based on how differentially expressed they are and also if you have lots of samples you can cluster things to find out if there's natural groupings in your data similar things and that that is is all biostatistical analysis methods but again very standard if you work with protein interaction data or any kind of molecular interaction data I don't didn't hear too many people thinking about that today here but if you're analyzing microRNA targets or transcription factor binding sites you immediately get a list of things that interact with your molecule of interest a genetic screen like a CRISPR screen will again identify a set of molecules or genes that are sensitive when knocked out and there's also genetic association studies so these are things that are associated with disease so you have to understand what the gene list mean so if I'm doing a protein pull down I'm going to get proteins that are probably in complex like a protein complex with my protein of interest if I am doing a some other kind of assay I might find things that are in the same you know tissue or cell or if I'm doing a genetic analysis I might find things that are related because they're in the same chromosomal location but mostly we we most genomics data does tell us something about pathways that are active so gene expression data tells us about pathways why is that that's probably you know we we think that that's because the cell has evolved or the biological system has evolved to express genes that it needs when it needs them in an efficient way so so if you see a whole bunch of genes expressed at the same time and also at the same time as each other those might be part of the same system now multiple systems could be turned on at the same time so it's not a one-to-one relationship but that is the reason why we think you know gene expression data can give us something about can tell us something about pathways and so if you think about your data like that and what type of information you expect to get out of your data based on first principles and biology I guess you can you can think about what it means what the with the gene list mean okay so okay so I mentioned that there's lots of different types of data that that generate lots of different types of genomics data that can generate gene lists again before analysis you have to do all the normalization and quality control that you normally do and as I said if it's not working then your pathway analysis won't work so garbage and garbage out and you have to think about a few things you have to use statistics that will increase your signal versus noise and that's sort of standard if you have a standard workflow you might think about gene list size if you get if your analysis results in three genes that's not going to give you much power to do pathway analysis if you get all the genes in the genome you're not going to be able to say anything specific so it has to be some sweet spot in the middle and usually tens to hundreds or thousands of genes you know if it gets into the half the genome you could still do some analysis but it's going to reduce your your ability to get signal out of a pathway analysis and you also have to make sure your gene identifiers are compatible with the software that you're using sometimes this is more of a problem than others we'll talk I'll talk about it more later and then as I said you you know so that's that's this part right here and then in terms of biological yeah okay so so this is sort of what we assume and just some tips for that okay when you actually get to your analysis as I kind of explained already and this is just repeating in a little bit more detail you have to understand what you want to accomplish with your list so hopefully that question is part of the experimental design but some of the things that you could do is summarize the biological pathways that are are active in your example the differential analysis might find pathways that are different between samples you might be interested in finding a controller molecule for your process like a transcription factor or microRNA that we think is important as a master regulator you might be able to find new pathways or pathway members and discover a new gene function you might be able to correlate a pathway with a disease or phenotype and find a drug so I think we've talked we kind of mentioned all of these so far so the you know this workshop will help answer all these types of questions today we're doing pathway enrichment analysis which is like summarize the data and compare sample A to sample B or case to control and then tomorrow we're going to get more into network analysis predicting gene function and then day three is focused on network regulatory network analysis and also the integrated assignment we used to have the integrated assignment in the evenings but that wasn't compatible with everybody's schedule so we moved it to the day but it's the third it's the third day in the afternoon hopefully that works out better okay so just a quick we're going to go into more detail on how this works later but just a quick intro to pathway enrichment analysis so the idea with pathway enrichment analysis is that given a set of genes that you've found from your experiment like you have a thousand genes that are different say let's say differentially expressed in tumor versus normal if you find that half of the gene list is our genes that are involved in a cell cycle that's unexpectedly a high number of genes because if you look at the genome half the genes in the genome are not cell cycle genes it's only about 5% of the genes in genome that are cell cycle genes so having 10 times more than expected of cell cycle genes in your list of a thousand is enriched and you can compute a statistic for how enriched that is and you can get a p-value that says this is you know really enriched it's very very very unlikely to occur by chance according to this this p-value and so this Venn diagram is meant to represent genes that you have in your list and then you have a pathway like the cell cycle that I mentioned and these are so these are all the genes that are involved in the pathway and then you look at the overlap between these two and then you look at the significance of that compared to the background universe of all possible genes like all the genes in the genome and you do this for every pathway that you have so if I have a thousand pathways I'd run the same thing over and over again for each pathway and then I compute p-values for each of these and then I can rank the pathways by those p values and there's more a little bit more to it than that and I will get into it more in more detail but the end result is a list of pathways that are enriched in your data excuse me okay so pathway so no I talked about cool examples of pathway enrichment analysis like our really great a pen-a-moment story and you know now you guys know the very basics of pathway enrichment analysis there's the sort of two parts to this one is your gene list and the other one is pathways where do you get the pathway information and so and then you put those two things those two inputs into a pathway enrichment analysis method like GSEA or G profiler and so I'm just gonna cover some of the basics of these two types of inputs quickly so so when you have a gene list there's a few things that you need to know about one is that genes are identified by some name or number these are identifiers or IDs ideally they're unique stable names or numbers that help keep track of these genes and databases you know social security or insurance number or entree gene ID these are examples of unique identifiers but gene and protein so that's great if you have that there are problems though with working with genes one is that gene protein information and any kind of molecule is stored in not one database but lots of different databases because everybody can create their own database and they all use their own create their own identifiers and now you might have a problem if one if you get a gene identifier from your friend and you're working with a different database and how do I know that I'm talking about the same thing and you have to map those those one from the other the other thing is that sorry just check yeah so the other thing that is a problem is sometimes genes you're using names that are not standard identifiers you might be using the common name for gene that no database uses but it's used in the literature and that won't that might not be the best name because maybe actually two genes have that name and then you're stuck because you don't know which gene you're talking about so that's why we don't use standard names of proteins for instance we try to use identifiers that are standardized and there's also the last thing is that it's important to understand that there's different types of identifiers for different types of molecules so even if we're talking about genes and this course is all about genes we know that that genes express RNA and genes are encoded in DNA which is a molecule that has you know we're thinking about a region in that case and then it's an RNA is expressed and it's translated to a protein and so the DNA region and the RNA and the protein have different identifiers and there's different databases for different types of information associated with those molecule types so entree gene NCBI doesn't store sequence information it just stores the concept of the gene and what it's all about the function of the gene and then it links to other databases that have the other more you know other types of information so here's a list of different types of identifiers just so you can see the variety the ones that are highlighted in that are underlined and in red are the recommended ones that we recommend it's not always possible to use these because you might be using organism or newly sequenced genome that doesn't have these identifiers and might only have identifiers that you've defined yourself but in general these are ones that we recommend in particular entree gene and species specific symbols like for human people frequently use gene symbols which is different than gene name sort of seem like a name but the differences is that the gene symbol is standardized so there's only one symbol per gene and we know exactly what gene you're talking about if you use that that symbol so that's you know why it's confusing so this is you know this could be very confusing if you have a alphabet soup of these things but so it's good to use standard ones that are commonly used so sometimes that's more of a sometimes people don't have a problem with that sometimes it's a big problem so there's lots of different identifiers and if you have this problem you should learn system called identifier mapping and there are identifier mapping services like in G profiler there's a tool called G convert so you can type in you can pick copy and paste identifiers from one type and ask to get another type and it will it will convert them it doesn't always do a perfect job because of a couple of reasons one is there might be ambiguity in identifiers because biology is not perfect so we don't even the human genome is not finished and every version of the genome meditation that comes out genes are still changing and sometimes they're called a pseudo gene and then they get promoted to a gene and then they go back to a pseudo gene and back to a gene and it's like it takes years before people really figure out some of these genes and that they're really you know encoding something and working in a particular way and so if you get different versions of those databases they'll disagree on what's a gene right and so that's one reason and so you might not always get perfect matching but usually you can get a good enough matching to for practical use if you want a perfect matching you have to go into detail and manually look at all those issues yourself so I mentioned you need to be aware of ambiguous mapping so if you again there could be cases where one identifier one database maps to two identifiers in another database for instance sometimes people think one region of a genome is a gene and then they realize there's actually two genes this happens relatively frequently over time so the so so you'd have cases like that okay so to avoid errors you need to just be aware of these things and check be aware of one to many mappings use identifiers types that are standard and unique per gene so that you reduce the problem of gene name ambiguity also if you're using Excel you you probably know that it automatically tries to guess types of data and sometimes it guesses certain gene names or our dates or other things how many people have had this problem okay so so this you know oct for is a pretty important transcription factor and stem cells and Excel thinks it's October 4th so you have to paste as text and make sure that you know the column type if it says general if you if you look at the column type it's just going to guess the type but if you set it as text or if you paste this text it will it will work properly this is a major problem if you have thousands of genes because you can copy and paste them around and then it's it's not even visible on your screen what's happening for all the thousands of genes right so you might only find out later that Excel introduced a bunch of errors because now your pathway analysis isn't working so that's why it's important to be aware of it before you encounter that problem and I mentioned there's problems reaching 100% coverage okay so here's an example of a really bad example just to scare you thinking this is important so this is a paper from nature from quite a while ago now where people were studying this HES-1 as a target of a microRNA and unfortunately they had to retract the paper because it turns out they're working on totally the wrong gene they did a database search they pulled out a gene and there's two genes named HES-1 ones like they had different cases so one is you know homologue of ES-1 and one is hairy enhancer of split so unfortunately they're using the wrong one and had to retract their nature paper so it does happen okay so recommendations for proteins and genes map everything to gene identifier entree gene identifiers or official gene symbols using some spreadsheet and if you want to get more if you're having problems getting coverage you can manually curate the missing mappings or use multiple mapping services and one thing to note is that usually this these these recommendations don't cover splicing in general this whole workshop is focused on genes and a gene can result in multiple splice forms and those splice forms can have different functions but unfortunately we don't know a lot about those functions and those splice forms so yes we have quite a bit of information about transcripts that are encoded in genes but that information is not great it doesn't have a lot of good coverage and we also don't all of our pathway information doesn't none of the pathway databases have information at the splice variant level they'll just they're basically at the protein even though they they list a specific protein it's usually like the longest protein that's expressed from the gene or something like that occasionally you would have things that are you know focused on the function of splice variants but because it's a rare occurrence at this stage in 2018 all of the gene and pathway analysis doesn't doesn't work with that if you do have splice variant data that you're interested in specific transcripts and you have a data source that is transcript aware I guess you can you can use the same systems that we're talking about just to note that the databases don't really have that information a lot okay any questions so far okay how's my timing we're supposed to have a break at 10 30 11 okay okay so we've learned about gene identifiers and that's sort of one part of pathway analysis the gene list part okay so now I'm going to talk about pathways where they come from how they how to use them okay so remember in our pathway enrichment analysis we had to analyze each pathway one by one so we have to get those pathways somewhere and those pathways are generally available in databases and there's more than one database that exists so so pathway information is available in databases there's more than one so for instance gene ontology is a popular place to get information about gene function and it includes information about biological processes which are pathways basically and then there's pathway databases like reactome which is the best example currently that stores a lot of detailed information about all of the genes and proteins and how they work together pathways are one type of gene annotation but there's a lot of other types of gene annotation so gene ontology also has molecular function which is like enzymatic function and cellular location you can have disease associations you can have protein properties like whether a protein has a given domain you can have information about interactions with other genes transcription factor binding sites so there's a lot of different types of annotations on genes annotation just means you kind of decorate the gene you associate some information with the gene so obviously lots of information is associated with genes any kind of tag it's kind of like a tag tag the gene with whatever information you have we're focused on pathway information some databases like gene ontology store pathway information and other information so you should just be aware that when you're using these databases sometimes you're pulling along more information than you might want we recommend starting these types of analyses just with pathway information because it's more easily easy to interpret and if you don't get any results or if you want to go deeper you can start branching out into these other areas okay so pathway information so that I'm going to talk about that next okay how many people know about the gene ontology put up your hand okay so pretty good fraction of the people in the class so I'm just going to go through the gene ontology fairly quickly tell you what it is so gene ontology is a set of biological phrases which are called terms which are applied to genes so for instance protein kinase is a term and it's applied to a gene this gene is a protein kinase where it has protein kinase molecular function or activity interesting each term actually has a definition of full definition associated with it so it's a dictionary people probably don't know that don't see that as much but you can use gene ontology as a full dictionary of tens of thousands of biological terms so it's useful and it's also an ontology which means it's an ontology is a formal system for describing knowledge usually it has relationship it defines concepts and relationships between between concepts and this is the website so here's a kind of example gene ontology structure so this top box says gene ontology and then it says biological process and then physiological process and homeostasis tissue homeostasis immune homeostasis B cell homeostasis and then B cell apoptosis so as you can see this hierarchy is organized sort of organizes the terms in a hierarchy that goes from more general to more specific so the bottom term here B cell apoptosis is very specific and these relationships are and any gene that's part of B cell apoptosis is also part of any of these other categories these other terms and based on relationships like here and there's different types of relationships so B cell homeostasis is a type of apoptosis that is part of B cell homeostasis so is a part of relationships so it describes multiple level gene function at multiple levels and terms can have more than one parent which is important so I mentioned in a sec sorry just wanted to check if I included that slide okay so yeah so one of the issues that you should know is that because a gene is part of is sort of associated with one term and that term has parents the gene is also automatically or logically associated with all the parents all the way up so this can create a lot of redundancy and you have to deal with it somehow so so gene ontology covers three aspects of gene function where a gene is expressed so where a transcript or protein is expressed that's a cellular component molecular function which is like the enzymatic activity type and biological process which is pathways pathway information these pathways it's called biological process because it could because it could be quite generic like metabolism but it's all organized into different types of metabolism all the way down to very specific terms okay so there's two parts of gene ontology there's the terms that I talked about and go terms are added manually by trained curators that work at different databases like model organism databases or Uniprot they can be added by request so you could you could say terms are missing for my gene and experts help with major redevelopment and I didn't update this table but it's pretty stable these days there's tens of thousands of terms biological process has about twenty thirty thousand terms and okay so those are the terms the second part of gene ontology is annotations so annotations are the actual type of information that we use for pathway analysis which is the link between the terms and the genes so if I have a gene and it's part of cell cycle I'm going to say I'm going to make an annotation I'm going to write down in the gene ontology text file or you don't need to do it but the gene ontology creators have done this they write down that this gene is part of cell cycle and so these are known as gene associations or gene ontology annotations there's multiple annotations per gene so you can have multiple annotation multiple terms associated with a single gene and some of the gene ontology annotations are created automatically manually and some of them are automatically so I'll talk about that there's three types of terms annotations are it just links the term to the gene yeah and you can link as many terms from whatever type you want to your gene okay so I talked with this already so okay so the annotation information is not just gene A is part of the cell cycle it has more information associated with it so it has the evidence associated with it of why someone made that connection so there's different and there's different types of evidence so in fact there's a lot of different types of evidence so it's important to just understand a little bit about these types of evidence for the purposes of using these for pathway enrichment analysis the most important thing to know is that some of the terms are manually annotated manually created and some of them are electronic annotation and the manual ones there's two types there's curated by scientists they're very high quality but there's not as many of them because it's a time consuming process and there's too much literature for the available curators to go through so they also have computational methods that help them automate this task and they manually review the results of some of these computational methods some of the computational methods are extremely high accuracy so for instance for many years people have been able to very accurately identify transmembrane regions in proteins based on amino acid properties and frequencies and it's like 99% accurate so applying that computational method can identify all the membrane proteins and those will all get tagged with a membrane cellular component term and go and you can be pretty sure that those are accurate some other computational methods are not as good but that's why people try to review the analysis results but there's also a category called electronic annotation which is annotation derived from computational methods without any human validation it just all goes in and the accuracy varies again some of the computational methods are better than others so you have to understand where the information is coming from if you want to understand the quality level but people put them in there because they might be useful so say you are working on an experiment where you don't find any pathways that anyone knows about but you look at the electronic annotation and you see interesting patterns that come up at least that gives you a handle something that you can go and a lead that you can follow where you couldn't follow anything from standard analysis so it can be useful but it's two key points one is just to understand that these evidence codes exist and that there's this electronic annotation which is typically lower quality and so we actually recommend starting without using this so start with just the manual annotation and then if it doesn't work you could extend it to include the electronic annotation so here are all these different evidence codes so they actually have names the electronic annotation is called IEA inferred from electronic annotation but there's other ones like traceable author statements which will have a publication reference associated with it and there's inferred from physical interaction which will actually, if you look at the annotation it will say I annotated this to the cell cycle because it interacts with this cyclin protein it's very well known and so you can have this in your notes just to understand the different types and IEA is the one that we recommend removing from initial analysis unless you're working in an organism that doesn't have any of these things so that's the next point which is that most major, all major eukaryotic model organisms in human are pretty well covered by genontology several bacterial and parasite species and there's always new species in development but there is a variation each one of these is hard to see here but each one of these is different species so this is rat, human, this is out of date now, actually it's only two years old but this changes over time so you can see that the blue is the sort of curated data and the green is the data that comes from evidence codes associated with experiment and the blue are the predicted ones and the main thing that you can see and this is the number of annotations the main thing you can see is that there's variation between species so some species are much more annotated than others and if you're working with an organism that doesn't have any annotation or a newly sequenced genome you have to use electronic, inferred from electronic annotation and the way that it's inferred always is by taking all of the genontology annotation from the closest well covered species and just transferring it all over by orthology to the species that you're working with and so those will all be considered electronic annotation that's unreviewed in humans the experiments are only does it tell you what the experiments are like in wilds or set lines? it will, it does give you information not a lot of detailed information like that but it will say it came from this paper or something like that so we don't usually go into that level of detail unless there's a problem occasionally we find misanitated genes or something like that and then we'd report them so just for your information here are some of the databases that contribute to the genontology all the major model organism databases okay so there's a lot of so there's one other thing that's useful to know about genontology which is that there's something called slim genontology sets genontology has too many terms for some uses so sometimes you just want to summarize a whole bunch of genes in terms of the function which cellular component which cellular locations are present in my in my gene list and if you just use genotology you've got hundreds of terms probably so go slim is an officially reduced set of go terms that exists to make a simplified view of kind of higher level terms so we don't use it too frequently but sometimes it's useful any questions, any other questions so far? okay so the schedule is really 11 and then 11-30 break break to 11-30, okay so we're still good okay so there's also a lot of genontology software tools these are freely available to anyone without restriction almost everything that we talked about and I should have mentioned this in the beginning this course focuses on freely available resources everything that we talk about basically is free for anyone to use without restriction you can download it and use it yourself and change it if you want so genotology is also like that there are ontologies there's the gene associations which are files and then there's tools that are developed by genotology so lots of groups have developed tools that use genotology one tool that's pretty useful if you're interested in looking at the genotology is called QuickGo this is just a search engine for genotology you can type in terms and you'll get information about the genotology just a quick conceptual mention that there's other ontologies as well so here's some examples there's like a cell type ontology that's being developed now and there's a human phenotype ontology that captures all diseases so there are lots of them and sometimes you might come across them in your analysis although infrequently and we don't cover them in the rest of the class okay, I talked about genotology that's a very important source of information of pathway information the one important thing to know about genotology compared to the other types of pathway resources that exist is that genotology only can define gene sets it doesn't tell you anything about how the genes are connecting to each other like A regulates B doesn't have that information it just says this gene is part of the cell cycle and then another gene is part of the cell cycle and so if you look at the cell cycle you can say what genes are annotated as part of the cell cycle and you'll get a list of 100 genes so that represents a set of genes and that's the input for the basic type of pathway enrichment analysis that we'll talk about today but there's also a lot of different types of pathway databases that have more information PathGuide is a website that we've actually put together that lists actually now 700 pathway related databases MCDB is a database that's made by the Broad Institute who also makes the group that made GSCA which we'll talk about today gene set enrichment analysis software and so this MCDB is a database of gene sets and Pathway Commons is one resource that we happen to work on that collects major pathways and actually what we'll talk about later we have a MyLab collects a bunch of pathway databases together and makes a gene set database that's easy to use with these pathway analysis tools and so we update it every month it's only available for human and mouse but that's another set that exists that's sort of friendly so I'm actually almost done I think we started a little bit early 10-15 minutes early in the schedule so we'll have a little bit more time for break I think we're going to go into more detail about Reactome tomorrow because Lincoln Stein who is one of the PIs for Reactome so we have a Reactome developer in the room he's going to teach tomorrow and he's going to talk about Reactome so analysis and then you have a gene set that you thought it wasn't reached it depends on where it is in the tree right on how specifically so let's say you go and sell the agent for example do you know it's like a pathway did the gene interact with each other or what kind of level of information do you get from that so the question is if you get some general term coming out like cell adhesion, what do you do with it so we'll talk about that more in detail with the pathway and respin analysis section next but the basic idea is that we like to focus the pathway and respin analysis on terms that are the most useful so very terms that have lots and lots of genes like a thousand genes associated with it usually they're very general so we apply a filter that says don't include genes higher than 200 or 300 don't include pathways that have more than two or 300 genes associated with them and also genes that are pathways that only have two or three genes associated with them or some small number we remove so we have a sort of sweet spot of like 10 to 200 or 20 to 300 or something in that range and that usually identifies the pathways that are the most useful to see as a result yeah so that's another question we'll talk about later in more detail but the question is if you have, you know, this is applicable to transcriptomics for instance you have genes that are upregulated and also genes that are downregulated and also genes that are not differentially regulated right so there's kind of two halves of that list and the question is do you just do the top half separate from the bottom half or altogether and the answer is that you can choose the basic idea the basic types of analysis by default they do the analysis separately but one of the disadvantages with that is that you might have genes that are both positive and negative regulators within one pathway and this analysis will separate those out and you might not get a strongest signal and so what you can do is you can combine those together by taking the absolute value of the enrichment scores and just having everything ranked from differentially expressed to not differentially expressed so that's an option that you can use but the default is thinking about things as up and downregulated the advantage of that is that the pathways then have a little bit more meaning to them because you could say this pathway is going up this pathway is going down or it might not be going up and down you could say this pathway is enriched in the genes that are upregulated this pathway is enriched in the genes that are downregulated and usually we might say that pathway is probably getting definitely members of that pathway are getting expressed or downregulated and then what that means could be dependent would depend on the pathway if it's a positive regulating pathway or negative regulating pathway does that make sense? do you have a question too? I just wanted to do do you consider gene ontology analysis as a part of the pathway analysis or it's more like yeah, so yes so as I said in the beginning pathway network analysis for us is any kind of analysis that involves however they're represented so gene ontology has three parts I'd only consider the biological process analysis with the biological process part to be pathway analysis the other two types cellular component and molecular function that's cellular component analysis and molecular function analysis so that's why it's confusing sometimes a lot of tools just throw all the gene ontology together it's not a good thing in my opinion to do that because it messes up the results you combine all those concepts together and you'll see things that are really related but they're just creating a lot more terms for you to deal with and also because you use more terms it harms the multiple testing correction problem which we'll talk about later so it reduces the statistical power and a lot of those are very redundant often you'll get a molecular function term that's all the genes are the same as a pathway term that just happens to be the pathway that is the only pathway that's associated with that molecular function so yeah we recommend just focusing on the biological process terms of gene ontology and then to answer your question everything that has anything to do with cellular mechanism of any kind like a process we consider pathway or network analysis I usually just use the term pathway analysis because if I'm just explaining if a biologist every biologist knows what a pathway is many biologists don't know what a network is where they think of different types of things when they think of networks but in this class we'll definitely talk about networks and teach you about networks any other questions okay so okay so just to continue and finish off the last few slides of this intro there's lots of different pathway databases you can go look to see what types of pathway databases consist of pathguy.org and then there's other types of annotations which is actually I'll propose a question that was just raised which is you know that are not pathways like chromosome position sometimes this is useful actually if you're working with data that can be affected by chromosomal changes which includes transcriptomics data so sometimes you'll get pathways that come up in a transcriptomics experiment like olfactory receptors and you'll also get related pathways like GPCR signaling and sensing and things like that some of those sound interesting like GPCR that sounds interesting it's a signaling pathway sometimes that comes up because there's an amplified region of the genome that amplifies it happens to amplify a set of genes that's all next to each other on the genome so olfactory receptors are all next to each other on the genome and if they get amplified then you get hundreds of genes that all have the same pathway amplified so all those genes will look like they're overexpressed and you'll get a really strong signal of olfactory receptor pathway coming out and you'll say olfactory receptor pathway is enriched so that's true but why is that happening? well one thing that you can do is analyze your data based on chromosome position and you might find that based on chromosome position gene sets so these are gene sets that are defined with they put genes together if they're in similar regions of the chromosome doing this of course but you could have different chromosomal standard locus numbers or just whole arms of chromosomes and then you can see if any of those positions are also enriched in your data so it's a totally different question it's not asking if there's a pathway that's changing it's asking if there's some chromosomal position information that's explaining my data in some way transcription factor so if you look back at binding sites that's another example so say we have a database of known and actually in MCDB one of the sections of MCDB includes gene sets that are predicted targets of transcription factors and also microRNAs so if you use that database as your gene set database you can identify genes that are enriched in the targets of a transcription factor and that might help you might find a transcription factor you might find that transcription factor targets are really enriched in your genes that are up-regulated so you might say oh that means maybe that transcription factor is a regulator in my system and you can go test it so you can use this type of analysis to answer different questions but it's very important to think about what the database you're using is it's frequently a lot of databases where they just have these defaults and many of them actually don't make sense to me because they for instance will just use all the gene ontology all together which as I explained I don't think makes sense and so but many of them will also allow you to choose which databases you want and so you can just choose depending on your question I would not recommend just clicking everything because you'll get less statistical power for your analysis and many more terms and then you'll just be overwhelmed with terms so you can always do that analysis but I recommend for starters just work on a pathway level and then you can branch out okay so there's lots of these types of other types of information about genes I mentioned transcription factor binding sites and some other things are in MCDB genome annotation databases like entree gene and ensemble and uniprot at least for proteins have a lot of information about genes that you can download in bulk so for instance there's a tool called Biomart if you select genes I'm not going to go through this in too much detail but you have it in your notes if you want to try it out oops gene database and then an organism and then after you wait a few seconds you can select filters and you can say I only want genes on a particular chromosome or I only want genes that are limited to my list and then you can select attributes which are like annotations which could be gene ontology terms but it could be protein domains, gene lengths all sorts of information like chromosomal positions so you can use this to go shopping for information about your gene in bulk of course the information is available on a per gene basis if you search these websites one gene at a time but this helps you get information in bulk so sometimes that's useful okay so we've learned about pathways and attributes or annotations in general are available from databases and there's lots of different databases we'll be focused on common databases that are applicable to most things here but during the breaks we can talk about any other resources that are available that might be useful for particular projects and you can try these I would recommend that you just poke around on these URLs and see what they're all about in these URLs as well just to understand that they exist and what they do okay so again you know the rest of this so we talked about the sort of general pathway analysis workflow for the for the next session after the break we're going to talk about list based pathway enrichment analysis where you have a list of genes and that's it we're going to talk about rank based pathway enrichment analysis where you have a ranked list of genes so not just a list but you've ranked them by something like differential expression and then this afternoon we're going to talk about this box here visualizing your results and and drilling down to identify interesting pathways and then we don't talk about all of these tools in the workshop but we talk about at least something related to each of these boxes and these boxes here will be more focused tomorrow and the next day so tomorrow will be reactome and gene mania and also here's gene mania as well which helps you predict gene function and then one thing that's not on here but it's listed sort of as a tool but it's listed here as transcription factor targets day 3 will be focused on gene regulatory networks and thinking about those okay so this is a lab that we used to do as an actual lab that is now just included as a something you can do if you want just to learn about identifiers and biomark so you don't have to do this it's optional but it just at least gives you a little bit of information to do okay so I think we're early we're running early so maybe I can take general questions from everybody and then after that we can go on a break and we can maybe come back at a little bit earlier time so any other general questions about anything at all okay so we'll be around during the break so we can answer questions but let's say it is quarter to 11 and I think that makes sense because we sort of started about 15 minutes earlier and so should we come back 15 minutes