 Okay, good morning everybody. Nice to see everybody including some familiar faces people we worked with and people we've seen before in other workshops. I am going to give you an introduction to the course and sort of lay some groundwork for some of the concepts that we will use in the later parts of the course. So some of you may be familiar with some of these things that we'll talk about, but we wanted to get through and make sure everybody's on the same page. So as Michelle mentioned, all the slides are creative comments. Okay, so just to give you an introduction to why we want to do pathway network analysis and what that means, I'll just tell you a little bit more about it. So as most of you know, the main motivation is that you are doing some kind of interesting experiment, usually a genomics experiment, something high throughput. And you get a lot of data is generated and then you get a huge list of genes and then now what do you do with the genes? So generally you want to interpret the experiment, try to figure out something that is telling you, something interesting. Tell me what's interesting about these genes. And so one of the most popular, frequently and useful methods for looking at finding out what's interesting about genes is to compare them to known information about cellular mechanism. So pathways, protein complexes, functions of genes like enzymatic functions, protein domains, lots of different information that we have that we know about how the cell works. For instance, you might find that you're studying cancer and the cell cycle is really strongly enriched in your gene list. And that's maybe not so surprising but useful information. Okay, so why does this work? Why are pathways the right way of thinking about this? Not the only way, but a useful way. Well, the bigger picture is that the genome is specifying a lot of information about how the cell works. Proteins are expressed and they come together and form pathways and mechanisms in the cell. And so if you're studying genomics, if you're studying DNA sequence type information and you're collecting mutations, the mutations that are in the genome might affect how these systems are working. And so if you look at a list of mutations, you might see some signature and say those mutations are associated with a particular disease. That disease may basically be a pathway related disease. Usually most diseases are somehow tied to a given pathway or a set of pathways. If you're studying cancer, for instance, there's the famous 12 hallmark pathways of cancer, multiple sclerosis and various other types of things are targeting the immune system. And so often the mutations that you see associated with the disease that you're studying or the phenotype that you're studying are targeting the same pathway over and over again. They might touch different parts of it, but there will be a signal with that pathway. So knowing the pathway information really tells you, hopefully gives you some finding out the pathway might tell you something very interesting. Also the cell when it's working is generating a lot of state information. So the cell is changing state over time. And when we measure state information, molecular phenotypes or phenotypes like gene expression data, it's the level of all of the genes that are expressed at a given time. And you might have that information across different time points or different conditions or different disease states. And obviously to everyone who's a biologist in the room, you know that these pathways will often be co-ordinately regulated. And so if you see, so it's a lot of differentially expressed genes might be related to each other and related to a particular pathway. And so if you have, there might be multiple pathways that you have signal for in your gene expression data, but thinking about it from a pathway perspective really simplifies things. You might also have other type of phenotype information, survival information, family phenotype information from, you know, genetics, and that information can also be incorporated into this analysis. So you might be able to find links between pathways and survival, for instance. There's also the environment. We're not really talking about that here, but obviously the environment plays a role and also touches pathways and pathways react to the environment. The information that's coming from, so it's very useful to kind of just know as much information as we can about pathways in the middle here. And this information comes from many sources, databases, literature, experiments, your own experiments, for instance. You might be mapping this information yourself. You want to use this information and study it. So we'll talk about all of these things in the course. Okay, so pathway and network analysis in my mind is a very general concept relating to any type of analysis that involves pathway or network information. Sometimes we call this prior information or just knowledge about the system. Whenever we're using that knowledge about cellular mechanism, I think of it as pathway and network analysis. It's very, very commonly applied to help interpret lists of genes, as I mentioned. The most popular type is pathway enrichment analysis, which we'll talk about this morning. And in general, as I mentioned, it helps gain mechanistic insight into omics data. So that's one of the primary goals. You have a whole bunch of genomics data. You want to figure out some mechanism that is either causing or explaining the cellular state or is caused, the change in that mechanism is caused by changes in the genotype. Pathway data is also very useful for improving statistical power. So if you were just looking at the raw data that you've produced, usually it's in the form of transcripts or SNPs or proteins. And so thinking about things in terms of pathways instead of those simpler data types is very useful for improving statistical power because there are fewer statistical tests that you have to consider if you're thinking about pathways. Pathways collect a lot of information about transcripts, for instance, all together in one concept. And when you do a statistical test about that one concept, it's better, you get more power than if you do a statistical test for the hundreds of transcripts that might be part of that pathway because of multiple testing correction, which we'll talk about this morning as well. Generally, pathway analysis is more reproducible. There's lots of famous stories about how people have looked at for gene expression signatures that predict the outcome of breast cancer. There was one of the first diseases that was studied with gene expression signatures. And what people found is even if they were studying the same types of breast cancer, they found that the gene expression signatures that they derived were different and basically didn't overlap in terms of genes. But when people looked at the pathway level, often it was the same pathways that were being affected. And so then you can find relationships and similarities across data sets. It's also easier to interpret because it's in the form of familiar concepts like the cell cycle apoptosis. And as I mentioned, very importantly, it can help explain mechanism. So it can help get into the cause of why you're seeing gene expression changing or why all these mutations are targeting this set of genes. That could really help you in understanding your system. And we'll see a number of examples of this and we'll really get into some of the details of this in the course. And I like to kind of talk now more about how pathway information is helping us to find the causes or getting a more causal understanding of the system versus if you didn't have pathways, you're limited to correlation type of analysis. So genome-wide association analysis associates a mutation with a phenotype, but that doesn't say that mutation causes the phenotype. However, if you're able to say that this mutation is affecting a pathway and that pathway is clearly related to the phenotype that you're studying, then that's potentially giving you some information about the cause and it can really help you narrow down to useful things to follow up on. Same thing with gene expression data. The pathway explains the gene expression signal and if we get a very simple explanation of that gene expression signal, we can basically have an understanding of what's happening in our system. So just to continue along the track of explaining pathway analysis and different steps, the first thing that we assume everybody has done coming into this course is that they know how to normalize their data. We're not going to talk about normalization in the course, but if you have questions about it, many of the instructors and TEAs can help you answer questions. But generally, each data type that you're working with has its own standard. If it's very well-established, it has a standard workflow for normalizing data like microarrays or RNA-seq. The newer the data, the less standard it is. It takes a certain amount of years before normalization techniques are very standardized. But in general, where you get the data from the people that you get the data from, often people get their data from a core facility these days. Those are the best people to talk to about what the state of the art is for standardization and normalization. The output of that is normalized results which we can convert into a gene list and that's what we expect to start with here. There's other workshops that talk about the statistics of the normalization, but just to let you know that that's kind of a background. Here's the workflow that we're thinking about for pathway analysis that's very general to all pathway analysis pretty much. We're collecting genomics data like mRNA expression. We, as I mentioned, normalize and score. So scoring might involve computing the differential expression. So we ask which genes are different, differentially expressed between normal and disease state or phenotype of interest. We generate a gene list. This might be the top most, you know, this might be, you know, all the genes in your list that are scored and ranked according to differential expression. Or if you have a lot of samples, you might be clustering your data and you might find that there's a certain number of clusters of genes that act similarly. And so each one of those clusters defines a list of genes that you could analyze and figure out what pathways are enriched in that list. And generally, after that, you want to learn about underlying cellular mechanism using pathway network analysis. So that involves visualizing and identifying interesting pathways in networks, often drilling down to understand more about the molecular mechanism and then developing some model that ideally you publish. So the interesting, this is where we really spend most of our time in visualizing and identifying interesting pathways in networks. What interesting is and how you define what's interesting is, you know, dependent, there's a number of ways of doing that that we'll talk about. And we'll also talk about drilling down to understand more details about molecular mechanism. So just to give you an example of a pathway analysis study that we were involved in that ended up being a very good example of how statistical power was increased in the analysis. I'm going to talk about a paper that was published a few years ago now led by Steven Scherer, who's a geneticist, a human geneticist in the University of Toronto at Sick Kids Hospital, who mostly studies autism spectrum disorder. And at the time, and so when I got involved in this project, I was surprised to learn how heritable this syndrome is and for people that don't know, this is a syndrome that affects, usually, appears in childhood and is a problem with social interaction and some other aspects of, there's other aspects of the phenotype. But I was very interested to see that it's actually highly heritable. I didn't realize how much heritability there is in this disease. The twin concordance is up to 90%, although there's a spectrum of phenotypes, so you might have one twin with very mild or almost non-existent disease or disorder and some one twin with very severe disorder. So people had previously studied this disease and found a number of rare, single gene mutations that were correlated that were thought to be involved in autism spectrum disorder. But more recently at the time, people had noticed that a lot of copy number variants, including rare de novo copy number variants were important and so they wanted to study this further. So they collected approximately 1,000 cases and controls and used a SNP array to identify copy number variants. So the copy number variants are identified from SNP arrays by looking at a series of SNPs that form a genomic region that the SNP array measures the level of DNA in the sample so you can see if there's a deletion. There's no DNA measured for a whole track of the whole length of SNPs that form a region and so you call that a deletion and similarly for a gain or amplification. They filtered out large chromosomal rearrangements and anything common focused on these high quality rare copy number variants that they thought were de novo. And they looked to see which genes were affected by these copy number variants were associated with autism spectrum disorder. And they found a few, but very few. And the reason for this is they had so many thousands of copy number variants and SNPs that they could look at that they lost a lot of statistical power in their multiple testing correction, which we'll talk about. So we looked at things from a pathway perspective. I worked on by Daniel Americo, who was in my lab at the time and now works in Sick Kids and Steve Sharer's group that we tried to find instead of looking at genes associated with this autism spectrum disorder, we looked at pathways associated with autism spectrum disorder. So instead of looking at thousands and thousands of genes, we're looking at, or say tens of thousands of genes, we're looking at hundreds to thousands of pathways and we found a very strong signal of lots of different pathways that were associated. And interestingly when we looked at some of these pathways, we realized that the genes that were in the pathways were not mutated often out of a thousand, they weren't mutated at a high frequency in our cases. But the pathway was mutated at a high frequency. So there were different genes in the pathway mutated differently in every person, so you could never tell that by looking at individual genes that there was a signal because you only found one person out of a thousand that was mutated. So there's no way you could distinguish that from background. But when you collected the monitor pathway, we found that there might be 10 or 20 patients out of a thousand that were mutated had that pathway affected by deletions. And so that illustrates how thinking about things, integrating pathway information can increase your statistical power. And in this case especially because the mutations were very rare and there was no way you could really distinguish them from background. A lot of these pathways were involved in functions that made sense like central nervous system development. And so that also helps verify that the experiment is even working and getting interesting biological signals from it. Okay, I might have time later to tell you about another really interesting story that was more recently published that goes more into mechanism and actually translation. If I have time, I will go into it. Okay, so gene lists come from a range of places and depending on where your gene list comes from there might be different types of analyses that you might do. So molecular profiling, looking at mRNA or transcript or protein expression. Some people may have seen a couple, I guess it was last week, two papers in Nature publishing the protein expression atlas across dozens and dozens of human tissues. So this is actually getting to be more, protein expression measurements are getting to be more mainstream. This allows you to identify transcripts and genes or transcripts and proteins that are relevant to your experiment. So you'll get a gene list out of that. You also can quantify the information so you might not just look at whether it's there or not but actually measure the level of it and that gives you a gene list plus some values. And those values can be useful in pathway analysis for specific types of pathway analysis. So mostly you'd want to rank and cluster your data, as I mentioned before. If you just have a gene list, if you have a gene list that is just identified, the genes present or absent, that's actually very clean because you have a gene list that's well-defined and you can go and do your pathway analysis on that gene list. If you have a list of genes that are ranked by differential expression, some score that you have, then that becomes a little bit more, there's a couple of additional steps that are required. For instance, you need to think about this ranking. Are you going to take the top set of genes and the most differentially expressed genes and analyze those, or how are you going to set that cutoff? And actually what we prefer to do is not set a cutoff and use all of the genes that are in the list and because setting that cutoff is usually very arbitrary, you don't usually have a good scientific rationale for choosing 0.5 or 1.0 or whatever value you're going to use for your threshold. So it's better if you can use statistics that consider everything, all the information, and those statistics are available and widely used and will teach you about those. Another way of getting gene lists is by clustering, as I mentioned, so we're not going to talk about it in too much detail, but many people who have lots of data cluster the data, unsupervised clustering, to find groups of genes that act similarly across your samples, and that defines a straightforward gene list. People also do protein interaction measurements. They identify microRNAs, transcription factor binding sites, increasingly technologies available to do these, like chromatin IP, followed by DNA sequencing. Did you have a question? No, I think it's best to work with the quantification data if you have it. It's just that sometimes you don't have that information. So for instance, if you're clustering your data, even if you have, well, I guess a better example is chromatin IP. So you have a transcription factor and you want to see where it binds in the genome, and you measure all of the binding locations. That's it, you have the binding locations, you link them to genes, and you have a list. So you don't have any kind of natural score. That's really to confidence in there, but assuming that you're making good confidence calls, which are much easier to make with a scientific rationale, you use statistics to say, these results are confident, and I'm going to go forward with these. Once you do that, you have a list. Does that make sense? Right. Yeah, so the question is, if you have quantitative data like gene expression measurements and you want to use a tool that doesn't use it, is that useful? Yes, it can still be useful. Generally, we try to go through workflows and tools in this course that try to make use of all the data that you have, but you might come across a tool that doesn't consider the data, all the data that you have. It's usually very useful for you to use that tool, and often you'll get similar results. Generally, when you use the quantitation information, you'll get better results, more sensitive. You might get more information coming out of it, but sometimes just taking the top 100 genes that are differentially expressed and putting them into a simple pathway analysis method, you might get some strong signal there that will be the same as the method that's more advanced. The method that's more advanced is likely more sensitive and gives you more information. If you're going to do pathway analysis and you have a choice, you should use the method that will recommend methods that are useful, that are appropriate, but if you just want to try out different tools, there's no reason why not throw the data in and see what happens. Does that make sense? But then you have to think about the cutoff, and if you're going to publish that, then you have to argue that somehow, which is increasingly, I mean still works fine, but over time that will be increasingly difficult because more and more people recognize that, you know, why aren't you looking at 1,000 genes, not 100 genes? You might have some natural way of doing it that works. Any other questions? Feel free to interrupt with questions. Okay, so genetic screens. You might be working with a model organism where you're looking at perturbations of genes with microRNAs or gene knockouts and you might have those linked to phenotype, so you might be doing a screen to see which genes are involved in a phenotype and that defines a list of genes. Association studies, as I mentioned, genome-wide association studies that link genetic markers with a phenotype, that associate genetic markers with a phenotype, generate a list of genes. Any one here who's working on, who has gene lists that aren't covered by these points? These cover most things that people do, but sometimes there's... Okay, right. But then you still relate those to genes somehow, so ideally, and we'll go over a method that's useful for methylation, but called great, so you'll see that later. I think it's in the assignments, right? Yeah, so that's a good point. So you need to, which I didn't make explicit here, but pathway analysis depends on a gene list, right? So somehow you need to get from your raw data to a gene list. So if you have... And depending on the type of data you have, that might be easy or complicated. So you might get a direct readout of genes like in gene expression analysis, or if you're doing methylation mapping, then you have these regions of the genome or copy number variance is the same thing. How do you relate those genes? So in CPG island methylation, there's usually an idea that you have... If a promoter region is highly methylated, then it's going to be silencing the gene that's downstream, and so that's how you might relate it. But there are probably a lot of methylation that's not explained by that and not covered. And that's actually a good point to mention that pathway analysis, also because we're working with genes, you know, on the positive side, we get all this useful mechanistic information. On the negative side, we don't know everything about all pathways for... We don't know the function of all genes. In fact, half of most genomes that are studied we don't have a very good function for. We'll talk about later in the class on, I guess, Friday how to think about gene function prediction. But pathway analysis only works with the genes that we have pathway information for. So you should always be aware that you might have very strong signals coming from genes with no known function. Those should be followed up because they might be very important for your particular study. Yeah? They might not. Yeah, so then... And that's actually an interesting point for later. You'll see in the pathway enrichment analysis lectures that most pathway enrichment analysis tools don't consider the difference between an activator and a repressor. And so you might find your activators and repressors are in different parts of your gene list. So one way to address that is to not consider the sign of change. You just consider the change, whether it's changing or not. But that's an interesting point that is not generally considered very much by pathway analysis. So we might have a chance on Friday to talk about the future of pathway analysis. It's still an active area of research that the goal of that research is to integrate as much of this knowledge as we can. You'll see some of the methods are very simple, but that doesn't mean they're not useful. They're still quite useful. Ideally, we would incorporate more and more of this information over time. I also keep on using gene expression as an example. That's just a communion example, but the last question is about any other type of analysis or data, sorry, any other type of data that you have. And we'll use different types of data in the course. Okay, so biological gene lists, because they come from different places, they might mean different things. So if I'm looking at gene expression, differential gene expression, it's probably the gene lists that I get of that are very much related to pathways. However, if I'm doing a mass spec experiment to identify everything that's in the nucleolus, now there's going to be pathways that are in the nucleolus, but my gene list is really related to the location of what part of the cell I'm studying. Also, we have chromosomal location, which I just talked about, where we can add methylation to this list. You have a list of genes that might be all the genes that are in a region of the chromosome. So it is important to just understand what that means, because the different... just think about it, obviously I think that's obvious to most people, but there might be different choices that you make with pathway analysis, depending on this. And so the questions that you... the question that you want to answer is important. So some of the standard typical questions that people ask when they're doing genomics experiments is what biological processes or other aspects of gene function are enriched in the gene list? So that's one question that we'll answer with pathway enrichment analysis. What pathways are different between samples? So you might have two different samples that you have some idea that they're different, but you might not really know how different they are, but if you see that there's very different pathways active in two different samples that could help tell you that they're different. You could be interested in finding a controller or a master regulator. So we will review all of these points. You might be wanting to discover new pathways or new members of pathways, discover new gene function. So you might find that there is a gene that's really important, really associated. You get a really strong signal. We know nothing about it in the typical databases, and you might want to learn about that. So we'll talk about that. Or you might want to correlate pathways with disease or some phenotype. So we will talk in this workshop about regulatory network analysis to find controllers, pathway enrichment analysis to summarize and compare, and network analysis to predict gene function, find new pathway members, and identify functional modules, which might be new pathways. Okay. So going into a little bit more, again, background information, but more details into pathway enrichment analysis, one of the standard methods that Quaid will be talking about later this morning is pathway enrichment analysis where we have our gene list and we have a whole set of pathways, and we basically ask, are there any pathways that are enriched in my gene list? So I have a thousand genes and half of them are related to a particular pathway, but that's much more than I'd expect if I look at the whole genome where only 5% of genes are related to a particular pathway. So 50% versus 5% is a big enrichment. So basically to do this analysis, and this is the simplest type of pathway analysis, but a lot of pathway analysis have the same parts. They have your gene list as input that you define. So we talked about that and you have your pathway databases, which could also be enriched by other types of gene function, not just biological processes. So I talked about... So I'll just go over some basics of working with gene lists and then some information about useful sources of pathway information, and more of that will be discussed later on in the course. Okay. So some people might be very knowledgeable about this if they've been working with their gene lists a lot, but one of the... just some important concepts from the informatics of working with gene lists is that you have to, obviously, figure out what your gene is, and usually there's a name or a number associated with that gene, and these names or numbers are called identifiers. For them to be useful, they need to be unique and stable, and usually they're keeping track of these genes and databases. So entree gene is from the NCBI, records gene identifiers as numbers, and the problem with this is that... So that's great. The problem with it is that there are thousands of databases that store information, and often they'll have their own numbers that they assign. So you have one gene, you might have thousands of numbers associated with it or names. There's also record databases for genes, DNA, RNA, and protein that are different, so even though you're thinking about genes and you're thinking, okay, it doesn't matter if I'm thinking about gene or DNA or RNA or protein that comes from that gene, it's all the same concept to me right now for pathway analysis, but there will be different identifiers and you need to understand the correct type. So a gene symbol, the name of a gene, is really about the gene. It's not about the DNA location of the protein, because there are multiple proteins and multiple transcripts, so you can't have a unique identifier for a gene symbol if there's multiple things coming out of it. So gene databases don't store the sequence, they just link to the sequence, they store the concept of the gene, so that's just an important concept to know. NCBI has dozens of databases that are really interlinked a lot by these identifiers. These identifiers are referencing different databases. Just for your information here's an example of a lot of different identifiers. The ones that are underlined and highlighted in red are we are recommending for general use because they're generally widely recognized by tools, in particular entree gene IDs are quite useful and gene symbols, but gene symbols are specific for specific organisms, whereas entree gene is general for everybody. And these blue numbers illustrate the variety of identifiers you can see and also the potential for confusion, you might have an entree gene ID and an RGD the rat genome database, they both use numbers, so if I give you a bunch of numbers you won't be able to tell what identifier it is. So that's confusing. So you just really have to be careful about remembering what ID you're using, what ID type. Sometimes that's a problem if you have these big spreadsheets that you get from someone with lots of columns and lots of different IDs. Maybe those IDs aren't even listed in that spreadsheet, so it's good to keep track of them, the types. Okay, so there's tons of identifiers. Most software tools and pathway analysis only recognize a handful of these types, and so usually if you have some specialized type of identifier that usually comes from a platform, an experimental platform that you're using, this was more of an issue with microarrays, so gene expression microarrays have each company has their own set of names for all the probes on the microarrays and you have to convert those to gene names that are recognized by pathway tools, and so there are online tools for converting those things. You might also have a gene identifier type that's not recognized by a pathway tool even though it's a standard type, and so you need to convert. So doing these conversions, these identifier translation is one kind of main thing that you might need to do and obviously these are also useful for searching for your favorite gene. If you're searching a database that doesn't recognize your identifier you won't find any results but that might not mean that gene is not there, it might just mean that you have the wrong name for it, so just to, you know, you probably, most people probably know this, but just to try different if you have a set of genes, you don't find a gene in some resource, but you expect to find it, maybe you're not using the right type. Gene identifiers are used for linking, so if you go to a website often there'll be links to other websites and usually they use these identifiers and merging from different data sources, so if you have multiple data types that you're combining ideally you have them all with the same identifier and you can easily merge them, but if you don't, you'll have to convert them to entree gene IDs for instance. Okay, so some of the challenges of working with identifiers are avoiding, basically avoiding errors, so if you're working with a few genes, it's not really a problem, you can always detect if something went wrong, but if you have a list of 20,000 genes and you're just copying and pasting them into spreadsheets there might be some big problems that are happening that you don't see. So that's why we recommend using stable unique identifiers because if you use and gene symbols count as that, but gene names don't and protein names, so there are a lot of names for genes and proteins that are not standardized, but they're used in the literature so in cancer people say p53, but the gene symbol is tp53, it's pretty similar, but sometimes you actually get very different names like LFS1 that's used in a paper, and what is that? Well, if you type LFS1 into a pathway analysis tool, it might recognize it as a different gene, not p53 even though that's what you meant. So it's always better to use these standard gene symbols for that reason. People also may have noticed that Excel can introduce errors, Excel is very commonly used in biology and especially if you're pasting a lot of genes some gene names are recognized as dates. How many people have seen this? Okay, so quite a few. So the you know, if you oct-4 is an important transcription factor if you paste that into Excel, just by loading up Excel and pasting it, it will think it's October 4th. So that's not good. And there's dozens of genes like this. So whenever you're pasting to Excel, you should paste as text. Use paste special and say as text, not general. Because general, it tries to be smart and it's not smart for biology. So there are also sometimes problems reaching 100% coverage. So you might have a thousand genes that you're interested in working with and you want to convert them to they're using afro-metrics gene and you want to convert them to gene entree gene IDs. But you might only get 950 of them converting and 50 of them are missing. So in that case, there's reasons for that. Usually it's a version issue. You might be using it like there might be out-of-date data in one place and then that doesn't map correctly for the reasons I mentioned before. So in that case, what you can do is you can take your 950 that worked at them aside, take the 50 that didn't work, and try different resources, try different ways of converting those, trying to convert them to another identifier and then back to entree gene IDs. Or worse comes to worse in the end, if it's important for you, you can manually go search them in entree gene and then you'll have a complete list. But that's something that happens fairly frequently. It's a paper that talks about all of these things about problems with using Excel. It's kind of interesting. And also some high profile papers like those paper from Nature from over 10 years ago now this paper was retracted because of a gene identifier problem. So they had a paper that so it says that they were interested in my car and a target that they looked at that was homolog of ES1, HES1. But they actually looked at the transcriptional repressor, Harry enhancer of Split, also called HES1. So because they searched HES1 and it wasn't a unique ID, they actually made a mistake and the whole paper was based on this so it had to be retracted. And what I do here. And so just a good example of what not to do. So the ID mapping service that we'll be talking about is G-convert. This is based on supports a lot of different organisms. I notice a couple of people in the audience are working with prokaryotic system. So this doesn't, I think, deal with prokaryotic system so we'll have to look at that. But the basic idea is that you type in a set of genes from say, gene names, gene symbols and you can convert them to entree gene IDs and we'll look at this in the assignment and there's an optional lab. We're not going to go through a lab on that. So general recommendations for protein and genes is to map everything to entree gene IDs or official gene symbols using a spreadsheet. I talked about getting 100% coverage, be careful of the Excel auto conversions, especially when working with large gene lists and that just generally make your life easier. Just to note that most again we're talking about genes. Genes encode transcripts and splice variants and multiple forms of that gene. Sometimes those forms are very different in function from each other. However in general at this point in time we don't have a lot of information about transcripts and splice variants. You might have a lot of information from your particular area of study. So you might be able to use that information but generally pathway analysis methods just consider genes and no lump all the transcripts together. The reason for that is that the pathway information is not generally available at the resolution of splice variants. Yeah? Right so that's a good question. The question is what do you do if you have all of your gene symbols and you have a nice list of gene symbols and you put them into a software tool it doesn't recognize them. The reason for that is a version problem so maybe that software tool is using a different, well usually it's a different version of the database than you use and databases are getting updated all the time so there's a chance for that. The one thing that you can do is you can convert those gene symbols to a gene symbol type that is if you look in the pathway tool they might have a primary type that they've used. For instance some databases are based on ensemble and ensemble, they might be better at recognizing ensemble identifiers. Generally, on-trade gene IDs are pretty good but that's always going to be a problem. If you really want to get full coverage then you have to investigate a little bit more and maybe try different translations. Probably the practical thing to do would either be reading about the resource to see what ideas may be recommended by them but often people don't recommend an identifier so then you could just try a few different common identifiers like try mapping to on-trade gene ID, ensemble ID, RefSeq, Unipro maybe those are working better. It's always a problem and sometimes you just can't solve that problem unless you were the person who's developing the tool and you update their database which is not possible for most users. So generally I should have specified even more that the pathway, so the question is what if your your gene is not a protein coding gene I think and how do you deal with that? There's two questions there really so one is protein coding genes and I should have even specified further that the genes that are generally considered by pathway and network analysis tools are only protein coding genes there's a lot of other genes being described now like RNAs sometimes are included in pathways but generally not at this point but that'll come over time there's link RNAs there's thousands of them right there's all sorts of genes that are expressed some of them we we don't even know that type of gene what it does but and so in general the more well-studied a gene is the more likely it will be to be incorporated in pathway analysis in general so over time as my carnase are better studied they'll be more and more incorporated. The other question was more about context so of your experiment so if you have a gene that is maybe expressed but never converted never translated to a protein but you're linking it to protein interactions that doesn't make sense biologically because that protein will never be expressed and now you're considering it in a protein complex right we can't really the pathway databases don't really try to incorporate any context so they don't know about that context they just try to get all the information they can of anything that could happen in the cell and generally your experimental method say it's gene or protein expression will help provide that context so if you have that information you know that the protein won't be expressed then you can use it otherwise whenever you're working with transcripts you're left with that open question of how it gets converted to a protein because it we know that in general mRNA level is correlated with protein level but any individual case could be completely opposite you can have high gene express high like transcript level and no protein or vice versa right because of many many factors so for any given microarray well in general people have looked at this people have studied mRNA expression versus protein expression and they find yes it's correlated but you know if there's no mRNA there's no often there's no protein stuff like that but that correlation is not great and so it's it's and then so there's lots of problems with it and any particular experiment will be very different so that's another another I think I will mention it if you have the data that's why protein expression data would be great if we could have it right like you could do you could get protein expression data then you could really get at that but since we don't have it we're limited to transcript levels because it's easy to measure so it's not what we want to measure it's what's easy to measure in that case and so it's a problem in genomics in general is what species are we met what you know chemical species are we measuring so it's it's a very good question because and I'll mention it actually tomorrow afternoon I think in terms of the pathway network databases what they don't cover we'll talk about that a little bit yeah so the question is if you have a genomic region in general you know points on the genome how do you convert that to genes we don't really talk about it too much we we are going to mention a tool called great that is useful for pathway analysis of regions and great has a lot of options for how to convert those regions to genes and so when you when we get and I think it's in the in the integrator assignment right so as I mentioned before when we get to that point in the assignment you can see how they're doing it but they're I I agree there's no standard and it's a problem and so there's you know most people will choose some reasonable way of doing it but that's probably missing a lot of important information so again part of the reason for that is we don't exactly know how SNPs relate to genes clearly if it's in the coding region it's easy so that's clear everybody will include that but if you know there's the the other extreme is an enhancer region 100 kb away from your gene and then you know it's it's it's not it's we just don't know about that so there's there another question okay I should move on because we have actually another section here about the pathway so we talked about the gene list and now we're going to talk briefly about the pathways we'll come back to this a few in a few places in the in the course but just to give you some background about a couple of sources of gene pathways and other gene function attributes in general there's a lot of information in databases pathways come from pathway databases like reactome that Robin works at and what we'll talk about and also the gene ontology and I'll mention what the gene ontology is so there's a lot of other types of annotations chromosome position disease association interactions with other genes and those are we'll consider those later so gene ontology how many people know about the gene ontology okay so about a third to a half of people gene ontology so I'll just go over it gene ontology is a dictionary of biological terms or phrases which are applied to genes like protein kinase apoptosis membrane all sorts of attributes of gene of gene function it's a dictionary because each of these terms has a definition and it's also an ontology which is a formal system for describing knowledge because there's relationships between the terms so there's a structure a hierarchical structure where more general terms are at the top of the hierarchy and more specific terms are at the bottom and the relationships include things like is a and part of so you might have the cleolus as part of the nucleus as part of the cell or you might say that protein kinase is a type of kinase and apoptosis is a type of cell death so multiple levels of gene of detail of gene function are described and you can have an individual term that has multiple parents and so when you have a gene that's associated to gene ontology to any term it's actually you infer that it's also associated to all the parent terms in the hierarchy gene ontology covers three aspects of function where cellular component, where the gene is expressed, molecular function what the gene does in a molecular level like enzymatic function and biological process which is the pathway of the genes part of or the global process and that's usually what we're most interested in for pathway analysis on the biological process side so there's two parts of gene ontology there's terms and annotation so terms I've been talking about they are added by editors at database groups and by request from users experts help with developing this gene ontology and as of a couple of years ago there were almost 40,000 terms with definitions so there's lots of it's a big dictionary most of them are biological process terms the second part is annotations this is where gene terms are linked to genes so I have a gene I want to say that it's it's part of cell division and it's glucose 6-phosphate isomerase activity and it's in the it's in the cytoplasm or something so I will take my gene and link it to those terms in gene ontology and when I do that there's some information associated with that which is the evidence that I used to do that maybe a PubMed ID, different types of evidence there's actually a bunch of little parts that go with the evidence of describing that these are known as annotations or gene associations and in general often there's multiple annotations per gene sometimes this can be problematic because it's difficult to work with multiple functions per gene some gene ontology annotations are created automatically without human review so it's very important to just know that in general there's a lot of annotations that are manually curated by scientists which are high quality they're time consuming so there's not that many of them compared to the rest there's a lot of terms that are reviewed computational analysis so sometimes computational analysis can be very good at predicting the function of the gene and people will review the results to make sure it's not broken so for instance if a protein has a transmembrane region it's very likely to be a membrane protein and 99% accuracy at finding those transmembrane locations so that's purely predicted very high accuracy there's also a section that's called electronic annotation that's not reviewed it's annotation derived without any supervision basically computational predictions that have varying accuracy and in general it's considered lower quality than the manual codes so it's important to understand this and be aware of the origin of the data in general the practical aspects of this are that when you do your gene list analysis we're going to start by excluding the electronic annotation we'll talk about that the evidence code is IEA inferred from electronic annotation here so there's actually a lot of different evidence types IEA is all the maybe lower quality ones and so often what we do is well we might start with just the manual ones we don't get a lot of good information coming out then we could extend our search to include these electronic annotation ones which you're kind of forced to use if you don't have good analysis coming out or if you just want to be more spend more time exploring into more potentially unknown space you could use those depending on what organism you're studying there's different coverage of gene ontology annotations all major model organisms are well covered in human a number of bacterial and parasite species the current list is on the web so you could look at it but there is variable coverage so Saccharomyces cerevisiae yeast is every gene has been looked at to see whether it could be assigned a go term or not and they keep that up to date whereas some of these other things will only have electronic annotation that's a case where you're forced to use an electronic annotation because there are no annotations that are manually curated for that organism so interesting to know there's a lot of contributing databases which I won't go over one useful concept in gene ontology is what's called a slim ontology so gene ontology has all these tens of thousands of terms sometimes that's too many terms for certain uses so sometimes people put a pie chart in there as a figure in their paper to summarize in this case the locations of all the genes express locations of all the genes or proteins but if you had thousands of terms you'd have thousands of pie slices and it just wouldn't work as a visualization so the ghost slim set is being created to map a lot of specific terms to a smaller set of general terms and sometimes you can map those you might see these in pathway tools where they're using slim versions of gene ontology there's some generic ones and also ones available for certain species okay so Go resources are freely available to anyone without restriction in general everything that we've chosen for this course is like that it includes the ontologies the associations and tools there's lots of tools that use gene ontology in fact almost all pathway analysis tools pathway enrichment tools are generally using gene ontology so that's why we go over it you can look up gene ontology terms on quick go so that's a site that I recommend for looking up Go terms and there are lots of other ontologies but generally they're not used found in pathway analysis tools but just so you know that gene ontology is not the only one okay so gene ontology we talked about it it's a very good source of information about pathways that your genes are involved in there's also lots of pathway databases and I'm not going to go into too much detail here because we'll cover them in more detail later in the course but we maintain a website called pathguide.org that tries to collect links to all of the pathway databases that exist and we just updated last year and there were 550 databases that have some information about pathways or cellular mechanisms whatever that might be there's actually a broad range of types of information protein interactions, signaling pathways metabolic pathways, trans gene regulatory networks small molecule interactions whether those small molecules are metabolites or drugs tons of information about cellular mechanism and unfortunately it's not easy to just ask, give me all the data for my organism, for human just give me everything that you know about cellular mechanism because these databases have not really all coordinated they're not standardized so increasingly they are becoming standardized and because of that systems like pathway commons which we developed collect major ones and so this is a good general source for pathway to see what pathway databases are active and it's always being developed so it will become more useful over time but we're going to talk more about specific pathway databases as well in particular reactome later I mentioned very briefly I'll mention that there's lots of other annotations chromosome position sometimes you might be interested in getting a lot of information for your genes whether it's used for pathway analysis or not like you might have a list of genes and you want to know the chromosomal locations just look them up in a database a good source for this information is the genome browsers or large genome databases the two major genome browsers are UCSC Genome Browser University of California Santa Cruz and I usually use Ensembl because it has a really nice search tool called Biomart we can talk about Entregene is really good and if you're working with a given organism like a model organism usually there's a go-to location for that organism and if you're interested in others we can discuss during the lab I'm not going to go over Biomart too much but one of the things that I found with using it is it's really hard to kind of understand how it works when you just start using it but once you figure out the couple of steps that are required to get it working it's really useful and this just goes through those steps so when you go to it and I'm not really going to go through this in too much detail you can try it out you can select your Ensembl genes your genome of interest and then you select your filters and then you select attributes to download and if we get time in the lab you can ask questions about it but when you go to the website just following these steps you'll probably figure out how it works okay so so we've learned about gene lists pathways very briefly talked about sources of this information and issues that you might have to deal with again some people might have known some of these issues but now everybody is on the same page with these issues and hopefully you learn something as well so there are many attributes and checking out this Biomart system is really useful for getting attributes for genes so you can try it out okay so coming back to this workflow we talked about where the data comes from assuming normalization is working talked about your gene list and the rest of the course is really focused on actual methods for these two things here visualizing and identifying pathways and networks and drilling down to understand molecular mechanism and then you have to interpret the results to come up with some model explaining your data and this year we created a nice workflow that is a lot more detailed than the one that we've had in previous courses in response to questions that people had so in the past we had a workflow that really just focused on the pathway analysis part and people were always wondering where their data comes in to the system and how to actually which path to take so this workflow tries to list all of the major types of data that you have and if you don't if there's a data type that you're looking at that is not listed here then you can let me know and see DNA methylation for instance is here so these orange yellow orange boxes here are the parts that you have to do to normalize your data and convert it to gene lists so some, like I mentioned some experiments just give you gene lists and so you have your gene list right away others require a lot of normalization and scoring and linking your data to genes which is where we kind of start and one side will be identifying interesting pathways and there's different ways of doing that eventually you'll find some interesting pathways and another side is thinking about things in terms of networks and we'll talk about the difference between pathways and networks later but all of the pathways and networks they all tell us some information about cellular mechanism but there's different ways of representing it pathways are just briefly or usually what we understand to be pathways a system of steps series of steps like a metabolic pathway whereas networks are just connections between genes that are related and so they don't really tell you as much information as a pathway but they're still useful for learning about cellular mechanism so I also in and so once you've found interesting pathways and networks then you kind of drill down focus in on on more specific versions of it look up genes of interest and eventually you're interpreting your results I will go over all of these things in the course the parts the tools that we are talking about in the course are highlighted in yellow to kind of help you orient where they are in the big picture of the analysis so hopefully this will be useful if you have any feedback because this is the first time we're using this it would be useful okay so there's also a lab here that we're not going to go over but you can try yourself if you're interested we don't go over this lab because usually a third to half of the people already know this stuff so anyone else can try it out so this just gives you some pointers about using g-profile or using the tools that I mentioned okay so I think that the copy we're supposed to start right now but we'll take some questions yeah so that's a great question what are success stories for how research is being enabled by this and the autism one is one that I mentioned the other one that I didn't have time to include is a new story where we actually found a drug based on pathway analysis and that drug seems to be working in cancer with one patient and it's amazing amazing story that we could do that people want to know does anybody want to hear that story okay so I knew that I wouldn't be able to fit it into an hour so I didn't include it but I will just quickly tell you about it okay so this is I think a really good success story of pathway analysis it relates to appendemoma which is a type of brain tumor that affects the appendemom the appendemom is the lining of the brain formed by glial cells and this is work done in collaboration with Michael Taylor who's a neurosurgeon and physician-scientist at Sick Kids he studies appendemoma in children it's the third most common type of brain tumor in children I'll pass these slides I'll make these slides available on the wiki because you don't have them the most common location is appendemom and cerebellum previously using gene expression analysis he noticed that there's two classes of appendemoma one that affects the pathologists studying this can't really tell the difference between these two but gene expression is very different and actually outcome is very different as well so type A affects the youngest patients and has a terrible prognosis type B affects the oldest patients and has an excellent prognosis so they're basically two different diseases as far as outcome to look into this further Michael and collaborators did a lot of exome sequencing and whole genome sequencing to see if they could find mutations that could explain these differences and strikingly there were no recurrent mutations especially in this class A there were some copy number aberrations in class B but the serious class A there was basically no recurrent mutations it's basically mutationally silent which is the first time anyone's basically seen that in a cancer cancer tends to have a people say there's a hallmark mutation genome instability in this cancer there does not seem to be genome instability however when looking at methylation arrays CPG island DNA CPG island methylation arrays there was a clear clustering into these two classes and so they determined a long list of genes about 2,000 genes that were differentially methylated and actually we looked at that this is work by Scott Ziderdine who's a postdoc in my group Steve Mack is the person who led the work in Michael's group we found that in general the serious A class was very transcriptionally silenced by CPG island CPG methylation looking at those genes in normal pathway analysis methods on the web didn't really find any pathway that was enriched nothing came up so we looked at it we used a more specific a more appropriate statistical test that's useful for rarer signals which was the case in this case and we used a much bigger pathway database that we've collected for our lab for our use and I think it's referenced that is also referenced in the assignment so you can see that pathway database and one specific pathway came up really really strongly no other pathways basically one pathway it was targets of a polycomb repressive complex 2 PRC2 which is a hot topic now and so this was an extremely strong signal in this group A all of these SUZ12 and EED are subunits of this complex so really it was just this complex seemed to be explaining why these genes are differentially methylated and in talking to people in the epigenomics group here realize that there's Michael realize that there's a lot of drugs targeting the methyl transferase of PRC2 and they did a lot of experiments to show that these are specifically killing a panemoma and what's important about that is that a panemoma doesn't have any therapy other than surgery and radiation which is the worst type of therapy for cancer especially in children and this represents the first mechanism that could be potentially targeted in this disease and not only that there are people developing drugs to inhibit this complex and GlaxoSmithKline is developing one and there's also drugs on the market that generally inhibit DNA methylation so DNA methylation and also this complex seem to be playing a big role in silencing a bunch of genes that seems to be causing this really serious type of tumor and a clinical because there's no known therapy a clinical trial can be set up very quickly but also a patient who's at SickKids who's in very advanced stages of this disease and had a mastastasis to their lung was given a on the market DNA methylation inhibitor and in the beginning before this in two months this tumor had doubled in size and then in three cycles of this DNA methylation inhibitor after actually now more than five months the tumor has not grown anymore and the kid felt fine actually basically very soon after getting treated with this drug so at least in one person there's a response it'll take a long time to figure out what the result is but this is a really great example of how pathway analysis helps make that link between this big list of 2,000 differentially methylated genes so we have no idea what they're doing because we're looking through this list and they have no idea and searching a big database of information about cellular mechanism pinpointing this mechanism and looking at it and also because there was only one pathway that came out we were able to track back where this data came from and it came from one of the databases, MCDB that we pulled in and we know those people so we talked to them and Arthur Lieberson is the person who's typing and who typed in this information before that he read in 2005 had he not done that we never would have made this link so there's a real excellent example of how having a lot of information and trying to collect as much information as we can about pathways can help make discoveries and it motivates us to collect more so that's this is one of the best examples that I know of definitely from our work there are many examples where people use pathway analysis there are probably 20,000 papers at least that are using pathway analysis at some level just based on citations that I've looked at recently maybe more than that whether they're really making it sort of vital for making that discovery or not is hard to tell in this case and in the autism case the reason I talk about these is because they're really good examples where you needed the pathways otherwise you would have not made the discoveries