 All right. Good morning, everybody. So, ready for a second day. Yay. So, you'll see today is quite different. So, yesterday, just a quick recap. I don't know if you had dreams about variants and, you know, patients and disease. So, the day started with Mike who talked about phenotypes and, you know, the importance of phenotyping and ontology. And then we went on between, you know, Mathieu and myself went on over sort of all the steps of variant calling and variant annotation. I hope we gave you some sense of sort of the complexity of doing that and the fact that there's actually lots of steps and lots of ways you can actually improve variant calling, different choices in how you do the annotation and so on. And after that, Mike came back and sort of showed how once you get that list of variants, you can go back and try to interpret that in the context of disease. And then at the end of the day, Carl sort of put that in perspective on how you actually use that in clinical tests. So, all of that was really focused on genetic variants, right, and how do you actually call variants, how do you interpret variants. So, today is going to be completely different where we're going to focus on other types of omics data sets. So, epigenetics and expression data and how that can also be used in the context of exploring disease and so on. So, the learning objective of my module to start today has the following objectives. So, we're going to try to understand a bit why epigenetic and epigenomics is important in the context of genomic medicine. I'll do a quick overview of some of the profiling technologies that just so that everybody has the same background. And then go into actually trying to use some of these research, knowing what's available and trying to use them. So, the lab that's just after this intro is all going to be just web-based, so not command-line like yesterday. So, why is it important to talk about epigenomic at this workshop? So, I already touched on that a little bit yesterday, but you know, of course genes only account for about 2% of the genome. There's a lot more going on around these genes in terms of how they actually get turned on, turned off, and so on. So, the genome is full of these regulatory elements. And understanding that those components will really help also understanding mechanistically how some variants might be associated with the disease. So, I showed you this slide yesterday already. So, again, some of you work in cancer, but this is also true from GWAS. If you look, the majority of mutations in the context of cancer are outside of coding genes. And the same is also true for GWAS that I've actually identified variants that are associated with the disease. The majority of these variants that are identified end up being outside of gene. So, how do you understand and can we do better in understanding which of these variants is important and what exactly it might be doing? So, this is really one of the applications that sort of follows on what we covered yesterday. How can we better annotate non-coding variants? It's like we were annotating the coding variants yesterday. The other important area where epigenetics is useful in the context of genomic medicine is in doing this type, and this is a famous paper by Chuck Peru in early 2000, one of the first microarray study that showed that there's actually different subtype of breast cancer that can be defined based on expression pattern. So, you can really sort of classify patient based on their expression pattern, and this has proven to be very useful to actually come up with, in some case, targeted therapies that really work well in particular subgroups. So, this type of work is exactly what we're going to be doing in some of the other modules by Andrea and also Anna. So, really sort of using these epigenomic data sets to better understand the context of cancer, the different subtypes, but also in the context of more, I think Andrea's example is just comparing disease to normal. So, you have these two groups. So, that's the other broad area of application of these technologies in genomic medicine, and again, most of the other modules are going to be covering that. So, in terms of the technologies that we're using, so again, I'm not going into any of the details here because, you know, that would take, there's actually a full, there's one of these workshops that's focused exactly on epigenomics and epigenomic technologies and downstream analysis, and if you're interested in that, I encourage you to look that up. So, that's another bindformatics.ca workshop. But so, one of the key technology is really chip sequencing. So, for this technology, what you're doing is that you're actually enriching for DNA fragments that are actually associated with a particular protein of interest. So, in this case, all DNA fragments that are associated with P53 binding region get actually pulled down and sequenced, just like you would sequence a normal DNA. But because you're enriching for regions of the DNA that are bound by the particular protein that you're pulling down, what you'll get are these regions that have more reads than that. So, here we're not calling variants or that's not the main objective. The objective is really to look at these regions that are enriched for DNA fragments that we can observe after we map them back on the genome, and that gives us a sense of where P53, in this case, binds. So, this is useful because now, again, no matter where P53 is bound in the genome, we'll find the binding sites for P53. And again, thinking about annotating non-coding variants, if you see a non-coding variant in the P53 binding site, that might give you a sense that it's a more important variant than others. So, we can do this. If the antibody is targeting P53, we can also target various histone marks, and that's another way of actually profiling regions that correspond to enhancer or that are repressed and so on. So, chip sequencing is to profile the chromatin and get a sense of, you know, where enhancers or regulatory elements might be in the genome. The other obvious one is RNA sequencing, which I'm sure you're familiar with. So, this is just, again, you know, a very powerful method depending on how you prepare the RNA. You can profile different type of RNA transcript, whether they're polyA transcripts or total RNA. You sequence these, and again, you get a sense of which transcripts are expressed in different contexts. The last type of data that I'll briefly talk about is methylation profiling. So, here, either using microarrays or again sequencing, the key is that methylation, so the DNA is processed using bisulfite treatment, which affects the methylated Cs differently from the unmetallated Cs, and there's a way to, after the sequencing, to extract or deconvolve which sites were actually methylated. So, again, without going into much detail here, but methylation sequencing or methylation arrays, another way of looking, again, genome-wide at which regions are methylated. So, again, continuing, and I'm finishing up with my introduction, but why is all of this relevant? So, this is, I'm showing, I guess, one example from the NIH roadmap that did this kind of profiling systematically across lots and lots of tissues. So, I mean, this is a complicated graph. So, if you, I'm trying to see, it's tiny. So, on this list here, you have various diseases, and each of these, it was actually a GWAS study that was done that identified region in the genome that were associated with obesity or LDL cholesterol and so on. So, each of these is actually a GWAS study that identified regions in the genome that are associated with these various disease. As I mentioned, most of these regions that were associated with the disease were non-coding. So, the nice thing is that by doing this profiling now of all of the enhancer regions and so on in the number of cell types, so on this axis, you have various cell types that basically define, you know, in that particular cell type, where are the enhancers. What you see is that, you know, it's tiny, so it's hard to see, but so all of these LDL cholesterol, for instance, GWAS hits were enriched almost only in this tissue, which is the liver, so which sort of makes sense. So, if you look at enhancers in the liver, that's where all of these GWAS hits are enriched. So, there's really, I guess, an encouraging correspondence between, you know, specific GWAS hits in regions that are actually an open chromatin in the tissue that's relevant. Yeah? Let's say there's a C variation, so that size, you see there's a methylation on the methylation, it could be due to methylation or it could be due to a single eucalyptus variation. Yes, that's right. So, it could be pure, it could be no genetic effect and just a pure epigenetic effect that's actually associated with the disease because of an environmental factor, absolutely. So, but you analyze this data, which one? Yes, sure, but in this case, we really use the GWAS, the genetic as the basis, so we know that there's a genetic factor here, right? So, but we don't know in which tissue and so on, but then if you look, you know, there's a very good correspondence usually between, you know, the disease and the regions that are open, you know, the accessible or the enhancer regions of the right tissue as opposed to in any other tissue. So, that's, you know, I guess a positive result. So, just to drive on that point a little bit more, this is one of the, I think, a very nice example of how, again, you can use epigenomic to better understand, you know, why a particular region, a particular variant might be associated with the disease. Because, again, what the GWAS give you is that there's a variant in that region that's associated with the disease, but you have nothing in between, right? So, you don't know at all mechanistically why that variant might be associated with the disease. So, FTO is a gene, so there were a number of GWAS done on weight and obesity and they identified within that gene that there was a haplotype that really was associated with the disease. If you have that variant, you tend to be like, I think, a kilogram heavier than if you're, and that the result, you're going to be, you know, in general people were one kilogram heavier than if you didn't have. So, that particular variant was found in this gene called the FTO gene, which is fat and obesity associated gene, I think, right? So, there's clearly in that region a genetic variant that associates with the disease. And a lot of work was done on that gene itself and to try to understand how that gene might be associated with, might lead to that increased risk of obesity, but by looking at epigenomic data in that paper, what they observe is that, so if you look at the bottom here, if you look in, so you've got two genotypes in that region. You have the normal genotype and you have the risk genotype. And you see that various gene actually, you know, there's no difference in expression of FTO between the two genotypes. What you do see though is that there's a big difference in genes that are nearby, IRX3 and RX5. So, it turns out that that genetic variant might be affecting the gene in which it's embedded, but it's actually affecting even more so some genes that are actually further down. So, again, there was a lot of work that was actually being applied to understanding how FTO might be, you know, modulated in weight gain, and but now there's more work that's also looking at some of these other genes that clearly look to be affected by the variant. So, that's, I mean, if you're interested in this, I definitely recommend this paper for more details on how, you know, epigenetics helps you mechanistically better understand the impact of these genetic variants that are non-coding. Okay, so that was sort of my intro and background as to why this is relevant. So, what are the resources now that are available to look at epigenetics data? So, I already talked about the roadmap. So, one challenge in, I guess, using epigenetic is that, you know, you only have one genome, but every cell type will have its own set of enhancers, its own set of genes that are expressed. So, just like, you know, we needed a human reference genome for the variant analysis and variant calling that we talked about yesterday, we need a human epigenome reference to know what's the normal state of all of these tissues to then be able to interpret what's, you know, differences from that basal state. So, this work really started with the ENCODE consortium that, you know, took on this task of trying to profile systematically what is the default state of the various cell type discontinued in the NIH roadmap that I talked about. And now there's a sort of an international consortium called IAC that continues on this work. So, IAC, which involves a number of countries, including these two U.S. consortiums, continue this effort of trying to map what is the normal state, epigenetic state of many different tissues. One challenge with these experiments is that in contrast to just regular DNA, you need to have access to these tissues. So, you know, profiling the normal brain is not easy in human and so on. So, part of the challenges in this consortium is really having access to, you know, quality tissues, normal tissues to be able to do this profiling. So, the objective of the consortium is to gather a thousand of these reference epigenome. And by reference epigenome, what is catalog is, so I mentioned that you can do chip seek on transcription factor, but you can also do chip seek on various histone marks, which is, you know, a very good way of getting a sense of where the regulatory elements of the genomes are in a given cell type. So, you can profile these two. So, as part of the reference epigenome consortium, there's systematic profiling of these two histone marks that correspond to regions that are transcribed, that allow you to determine regions that are transcribed. These two histone marks are associated with enhancer regions, while these two are associated with repressed regions in the genome. And then on top of that, there's RNA seek data and whole genome bisulfite sequencing data. So, that together really gives a good sense of the state, the epigenetic state of a given cell type. So, this is, well, maybe I, well, so quickly, so we're part of IAC in terms of actually the data integration and sharing. One thing that I didn't talk about, but epigenetic data just like whole genome sequencing data is identifiable, so you can extract variants from these data sets and potentially re-identify the people that contributed to these samples. So, the raw data have to be submitted to these secure archives that are put in place to protect this type of privacy. So, you need to request access to the raw data. But the raw data gets processed into process data, which is not identifiable and then can be more openly shared. So, the data that we'll work with today is really part of that process data that's not identifiable. So, the portal that was set up for this really sort of aggregates data from all of these different consortiums. So, it's got over 10,000 of these epigenomic data sets. It's got data sets on a number of other species as well. So, that's the main portal that we'll explore. And again, this is meant to be a resource such that when, you know, if you have a gene of interest and the tissue of interest, you're able to then go in and actually look to see what's happening, you know, at the chromatin and at the level of expression around that region. So, that's really what we're going to be doing. ENCODE has now just started its third phase, I believe, and they've also now have an updated portal that's got, you know, some additional features. So, we'll explore that as well a little bit. And finally, another project that actually collects quite a number of epigenetic data set that we're also going to look at a little bit is called the GTEX project. This one is interesting because it really links... So, while these efforts here were focused on trying to get as deep understanding of as many cell types as possible, here the goal is really to link genetic variation with in particular expression. So, you know, a bit like in the example that I showed you on the FTO gene, finding, you know, associations between non-coding genetic variants and changes in expression. So, here they take individuals that are actually cadavers because they profile many, many tissues in the same individual for which they have the genotype. And they do this systematically where they do this in hundreds of individual, so they are able to then sort of associate genetic variants with expression. So, if you have this genetic variant, you know, you tend to express... In this tissue, you tend to express this gene highly. So, again, we'll explore and play with that a bit. But here the key here is really linking genetics with epigenetics to find these expression QTL, so variants associated with changes in expression. So, this is, again, a portal that we'll play with a little bit in the practical. So, you know, one that we won't be exploring too much, but that also contains a very large number of epigenetic array and sequence-based data sets. It's GEO, so we won't be exploring that, but obviously there's a lot of interesting and useful data there as well. Okay, so in terms of... Before we move on to the actual practical, so just, I guess, a few more slides on the feature of the different portals that we're going to be playing with. So, the IAC data portal that I mentioned, this slide is not the latest version. So, there's less data that we'll be looking at, but you see that it aggregates data that's coming from different consortiums, that you can navigate these data sets then by the tissue that they come from. So, again, if you have a specific tissue that you're interested in where you think, you know, that may be a variant is important. So, you can navigate them by tissues, and then you have a whole collection of different assays, and you can also navigate based on the assay. So, we'll be able to really sort of explore that. So, here, you know, for the different tissues here, based on what you've selected, you'll have the various assays that have been done. And in this grid, the numbers in each box is actually the number of replicates, because a bit like GTX, typically, it's not done in a single individual, but it's done in a number of different samples. So, the number within the grid actually corresponds to the number of data sets that's available of that type in that tissue. So, from there, you can actually visualize the data in the genome browser. So, this is an example with the tracks. Again, that's what we're going to be looking at, but these are gypsy marks, if I can read properly, where, again, you see that, overall, there's not a lot of... So, what this represents is the number of reads in a given region. So, what you see is that it's mostly flat, but then you have an enrichment of reads that, in this case, would correspond to a particular histone mark at the five prime of that gene. Again, probably indicating that that gene is expressed in that cell line. In the practical, we're going to be using the UCSC genome browser, but there's other very nice browser for epigenomic data. The WashU epigenome browser is one of them that really sort of builds on UCSC, but has a lot of component to look at epigenomic data. Again, this is not something that we'll look at too much. Back to the IAC data portal for a second, so you can actually... So, I mentioned that the raw data is... You need to request access to the raw data because of identifiability concerns. The process data, though, is available, so you can directly download the data and load it yourself or do additional analysis if you'd like. The other portal that we're going to be looking at and looking at is the ENCODE data grid. So, you'll see that it has, I guess, a similar way of navigating through all of these data sets. Most of the ENCODE data is actually... Even the raw data is directly accessible and it's not... So, the patients have consented to make that data available so you can... Or if it's cell line, I guess that data is also available directly, both the process and the raw data. So, GTechs, again, so this is another portal that we're going to play with a little bit in the practical. So, before moving to the practicals, the one thing to note is that all of these resources are great, but some of the quality control that we were doing yesterday is also good to do in some cases. Just like you shouldn't be trusting anything you read on the web at some level, you shouldn't be trusting every data set that you find on the web. So, if there are ways to check the quality of the data, that's typically a good thing. So, again, so in some case, depending on how you're using it, retrieving the raw data itself, potentially validating it, so we definitely won't be doing that here, but there are ways to actually... Also, for instance, looking at correlation, if you have replicates, how good is the correlation between replicates and so on. So, we'll do a bit of that in the practical to make sure and assess that the quality is good because... So, for instance, in the context of the IAC data portal, there are tools to do this type of correlation between data sets automatically to make sure that you basically get the types of profile that you would expect. So, we'll look at that. If you do this type of correlation with different, all very small and tiny, but if you do it with different marks, for instance, obviously, you get the groupings that you would expect, but it is a way to check that at some level the data is good. So, again, this will be part of the practical. So, my last slide is just to sort of, I guess, talk a bit about another portal as part of IAC, which is called Deep Blue. And the advantage of that portal is it allows you to do, actually interactively, do various analysis using these data sets. So, maybe something else to explore if you're interested later on. But that's it for my introduction. So, happy to take a few questions or we can jump straight into the lab. Yeah. So, how can it be used to identify what a patient corresponds to? Yes. So, it's because of the variance, right? So, because all of this is sequence data. So, if you have enough to, so you know, you don't just get the profile. So, what we produce is the profile. And from that profile, you can't really tell who that individual is. But if you look in the individual reads, you can actually extract what variance and then that's a footprint of that individual. So, that's why the raw data is sort of masked and then process data is available. Yeah, I'm just curious. At this moment, we don't have a whole normal reference. So, how do I do similar research? Epilepsy. Yes. How do I, what kind of reference do I use? Well, so the question is what tissue do you want to profile in comparison? Well, it's in the context of epilepsy, you probably want to have access to brain tissues, right? So, I mean, there are brain, both embryonic, you know, from fetal, aborted fetuses, there are some profiling that's been done in brain tissues, for instance. So, the problem is also is that if you take after their deceased, there's also going to be some changes that are associated with that. So, it's, you know, it's very hard, almost impossible to get exactly what you want. But typically epigenetic landscape changes, but it doesn't change that much in terms of, or not dramatically, right? So, if you have very shortly after death, for instance, profiling of that tissue being done, and that's why these resources are being set up, then you, I mean, you can, there are definitely so some profiling that's done in brain tissue. That's right. So, that's why it's, I mean, it's a lot of work to collect all of that. And then you need to be, so that's why these consortiums are set up. But, I mean, as part of the resources we're going to explore are meant to be these types of references of what the normal state is. And then, if you do your own epigenetic analysis in disease patients, you can really sort of compare. And that's really what you're going to be doing in Andre's lab as well. So, this here is really just to get a sense of where you can find these data sets, while after you're going to be using them, and then looking at differences between group of disease patients compared to normal. Thank you. Yeah. Yeah. Yes. Yes. So, I think, so ENCODE includes quite a bit of high C and ChiaPET data. So, I know that this type of three, yeah, so that's a good point. So, one of the ways in that FTO example that they found that that region was involved in regulating these distant genes is that they had this 3D information, 3D capture information. So, as part of some of these database, they also have these types of additional data that give you some information about which part of the genome is close in 3D to which part, which helps you interpret the variants as well.