 All right, so good morning, or ohayou gozaimasu, friends from Japan. So yes, so hopefully you had a nice rest. So again, my name is Guillaume Bouch, I work at the Guillaume Center, I'm responsible for buying from Annex there. So I thought actually I would start with a quick summary slide of the workshop as a whole. So this is actually not in your slide, so you have to look up, it's not on your folder. So I added this slide just to give sort of an overview of what you did yesterday a bit and also of what we're going to do today. So there's really two components to disease, so there's a genetic component, which is really what we saw yesterday. First with Matthew who presented some basic processing and with Sergei who did more on the variant calling and the annotation. So this is really sort of focusing on the genetic aspects and we covered quite a lot of ground there. Again, part of this, if you're a bioinformatician, hopefully that's helpful. If you're more of a clinician, it gives you a sense of the kind of data processing that goes in to really extract these genetic variants and generate these VCF files that you can then look and to try to understand disease. So today we're going to focus on the more the epigenetic component of disease. So in my module, module five, I'll do an overview of some resources and then Andre will do something similar to what we did yesterday, which is sort of look at some examples of really data analysis and data processing for epigenetic data sets. Yesterday you also had I think a very nice overview of both sort of how we use these data sets really. So first the module one with Mike who talked about phenotyping and the importance of really using ontologies and sort of control vocabulary to really describe disease, which is what enables then a lot of downstream application and enables computers really to explore these phenotypes. You also had at the end of the day, if you were still alive, a module on clinical implementation and looking more at how do we take some of these processes that we do in the research context and really develop them in the lab. And today you're going to have another module also at the end of the day with Anna that's looking at again how do you use these genetic and epigenetic signatures to build classifier to say this particular individual. So I thought, you know, so again, we're covering a lot of things and hopefully you see the connections that I thought I'd put them out like this just to help you. But again, so like in my module this morning, we'll focus on just available epigenetic resources. So what are the learning objectives of this module this morning? So understand why epigenomic is important for genomic medicine. Be familiar with epigenomic profiling technologies. I won't go in detail, but just so that you're familiar with that. No relevant data and web resources. And then be able to find. So in the practical, this is going to be more like a web-based practical unlike yesterday, really just exploring available resources. Maybe before I start. So again, yesterday was genetics in terms of epigenetics and Arnie Seek. So maybe raise your hand if you have experience with epigenetic data sets or not a lot. Okay. So, yeah. So I mean, but this particular, especially my section will be really an introduction of some of these concepts. So first of all, so why is why are we including epigenomics and epigenetics as part of this module on genomic medicine? So I mean, I'm sure everybody's familiar with this, but of course genes and coding sequences only account for roughly 2% of the genome. The rest and the majority of the human genome is really about regulatory elements, things that actually control expression in different cells. And just to give you now a more concrete example of what that means, this is an example of a study that we did. So yesterday was mostly about, so this is cancer, unlike yesterday where we were looking at rare disease, rare genetic, but in cancer, if we now have the ability, of course, to sequence tumors. So this is a study where we sequence 100 kidney tumors hoping to look for patterns of mutations and recurrent mutations and identify pathways that are relevant. So you see that in this representation, every square represents 1,000 mutations that we discovered across these 100 tumors. So you see that we identified, because again, there's lots and lots of mutations. This is related a little bit to what Mike was saying yesterday, but there's lots and lots of mutations, and in cancer in particular, and so you see that we identify roughly half a million, over half a million mutations in these 100 tumors. But if you look at the fraction of these mutations that are hitting genes that are in coding sequences, those are the ones in orange, so you see that only, as you would expect, only a small fraction of all the mutations that we detect in these tumors are coding, and so those can be easily annotated in terms of which genes that they get, what's the impact, and so on. But what about all of the other mutations? And similar to yesterday, I mean, it's expected that the majority of these are passenger mutations or not important mutations. That's true for both the non-coding mutations and the coding mutations. But it's also true that probably burried somewhere in there, both in the coding and the non-coding, some of these mutations are important. So being able to annotate using epigenetics and profiling of chromatin, not just express sequences, but also regulatory sequences is going to be important to annotate these mutations. Another important thing and why epigenomic and epigenetics important, and this is more similar to what Andrew is going to talk about after me, is that epigenetics actually captures a lot of other things beyond genetics. Just profiling, if you're profiling gene expression, for instance, independent of the genetic mutations, you can see and identify, and this is a famous paper by Peru et al., you can identify subtypes of tumors. Again, this is in cancer, but this is also true in various complex disease. But you can identify subtypes at the level of epigenetic that, again, might be informative in terms of treatment and classification and so on. So the epigenetic signature, again, can capture both the combination of the environment, a combination of genetic variants. There are some genetic variants that lead to very extreme deregulation of the epigenetic level, and again, you can capture that. So it's useful for annotating, non-coding variants, and it's useful also, in some case, to actually classify patients in different groups. And that's a bit, again, what you'll see later today. So moving on a little bit to the technologies, and again, I'm not going into that in detail. There's one thing I didn't mention, but I think it was mentioned yesterday, but there are other CVW workshops that are specifically targeting epigenetic analysis as well. We're doing one next week, but there's one on RNAseq as well. So if you're interested in more details on both the technologies and the analysis, that can be an option. But just to give you a rough idea of the technologies, nevertheless, if you're not familiar with them. So yesterday, we were just shearing the DNA and sequencing DNA directly. Here, we're enriching for specific DNA fragments. So if we're interested in this plot shows, we're interested in P53, which is our protein of interest for this particular application. So we have an antibody that pulls down DNA fragments that are attached. So there's two P53, and we basically enrich for DNA fragments. So there's a different step of actually, first, you link the protein to DNA, and then you remove the protein, and then you're left with DNA fragments that were enriched for regions that were bound by a particular protein of interest. So same steps after that, you sequence the DNA, but what you're left are with these clusters of reads that actually correspond to the region of the genome that was bound by that protein of interest. So this is one way of starting to map what's happening in the non-coding regions of the genome. Another example that maybe is even more obvious is RNA-C. So this is really just a conversion of the RNA into CDNA libraries, which are sequenced, mapped. There are some differences in how the mapping and analysis of that needs to be done because of splicing and all sorts of things. But again, we're not going into the details here, but the idea is that this gives us and also lots of different ways of preparing the RNA that you're sequencing, leading to look at poly-A transcripts or looking at all or small RNAs and so on. So different ways of profiling regions of the genome that are expressed. The last, again, sort of as a quick overview, the last technology that's quite common for the profiling of the epigenome is bisulfite treated to look at methylation. So here, through bisulfite treatment, you have a conversion of methylated or unmetallated Cs such that you're able to identify post sequencing through informatic analysis which cytosines were methylated. Again, so not going into that in detail, but those are the main technologies that are used. What I wanted to do instead of going too much into the technologies is really to show you more, again, in the context of genomic medicine, how these chromatin maps or how looking at the epigenome is important. And I'm pulling out some example from the NIH Roadmap Consortium, which was an effort mainly in the US to do systematic profiling of different cell types. So I'll take a bit of time. I mean, this is a complicated slide. I'll take a bit of time. And I can't even basically I'm struggling to read it itself. But so here on top, you have the different cell types that were profiled. So what they did were they did different types of chip seek experiment. They did a different type of transcriptome experiment and so on. And they use that to define which region, again, non-coding regions, corresponded to enhancers in these different cell types. So on this axis, top, you have the different cell type. And on this axis, what you have are different traits that have map through GWAS. So yesterday, Mike mentioned GWAS a little bit. So GWAS identifying regions that associate with the disease. Many of the GWAS hits are not in genes and that being non-coding. So what the GWAS hits for these different traits highlight are regions, again, typically non-coding regions that are associated with a particular trait or particular disease. So if you look, for instance, one that has all of these red dots is inflammatory bowel disease. And what the heat map shows is where the hits from the GWAS, where are they enriched, and which enhancer are they enriched. And you see, well, it's a bit hard to see because it's very small. But typically, maybe a better example is this one here. So here, for instance, you see that this is liver cells, if I'm reading this correctly. And you see that enhancers, so GWAS hits for cholesterol, LDL, cholesterol, lipid, all of these GWAS hits are actually enriched in liver enhancers. So this is a way of refining and giving more information about these GWAS hits that are non-coding. And so there's a good association, and it can help pinpoint, in some case, the cell type that's relevant to that disease, but also more specifically the DNA region that's associated. The next example is hopefully even more concrete. So this is also an outcome of doing this deep epigenomic profiling. So this is a neat example because this is, hopefully, I remember. So as you see, so this is a GWAS hit for obesity. And the GWAS hit for obesity that was identified was in this region here. And you see all the SNPs that are in very close LD to that SNP are shown in color. So you see that the hit is, so there's a genetic association with obesity that is in this region. And following this finding, and I forget now what FTO stands for, but it's, I forget, but basically a number of pharmaceutical program actually were designed and built around trying to see what this gene had to do with obesity. Because again, there was this very strong association in one of the introns of this gene. What the epigenetic maps did, though, was to show, and you see this, well, this is now looking at high C data, which is looking at chromatin. So I mean, if you're not familiar with that, it's not the end of the world. But this is, if you see the shading here, this is showing that this particular variant is actually associated with a very big chromatin compartment. So all of this region, basically, is in close proximity to that particular. Because of course, here we're looking at the genome in two dimensions here. So here, this is saying that this whole region is actually potentially in proximity to this particular variant. And the key result here is looking at, so GWAS hits means that people with a particular genetic variant tend to be more obese or less obese. I don't know which one it is in this case. But there's an association between having a particular genetic variant and the disease. And now, if they're looking at in the right cell type, if they're looking at the mRNA level, expression level, stratified by risk, by the genotype, so whether you have the risk genotype or not, you see that the level of expression of genes in this region, FTO in particular, doesn't change. So the level of expression of FTO doesn't change no matter if you have the risk allele or not. But what you do have is some other genes, mainly IRX3 and IRX5, that depending on which allele you have, clearly there's a very big difference in expression. And so this was a big result in a way, because it basically said, all the work that you guys are doing on the FTO gene to try to understand its role potentially in obesity is not relevant. What's really happening in this region and the reason there's an association has to do with the control of those other genes and those other pathways. And so it was a big deal in terms of a result showing that annotating the variants, it's really helpful to have this type of epigenetic data in the right cell type to really annotate correctly, basically, because otherwise you're left saying, well, this variant is probably affecting that gene, but you don't really know it, sort of a guess. So the epigenetic information helps you to identify that. All right, so hopefully I've convinced you with that example and that intro that looking at epigenetic is relevant in the context of disease. So one way is to generate data yourself that's relevant, but the goal of this module is really to also talk about the existing resources that you might be able to integrate and use already. So one of the challenges with the epigenome is that there's many epigenome, right? So it's enough to sequence the genome of an individual, but each of his cell type will express different gene, will have different chromatin configuration, and so on. So just like having a reference human genome is helpful, having sort of the reference of what's expressed in what cell is also going to be helpful, because then you can say, in disease, you have a deregulation of that gene or something like that. But how do we do that type of profiling? So this work really started with ENCODE, which consortium that you might be familiar with. This was really at the level of developing these chip-seq and RNA-seq technologies. They were developing the technologies and then applying them mostly in cell lines. Since then, there's been other consortium, the NIH roadmap. I was talking about that, where they started to do profiling now in stem cells and various ex vivo tissues. And this work continues now in what's called the International Human Epigenome Consortium. And the challenge, especially if you want to profile human cells, is to get human cells. So I don't know if they're volunteers for giving out brain cells. But it's not easy to have cells from all the different tissues. So typically, this is linked to different operations or something like that, where they have access to tissues. And then at the same time, they try to do the profiling of these various different cell types. This is a large effort, it's an international effort that involves people in the US. So we're one of the mapping center in Canada. There's another one at the Genome Science Center in BC. But you see that there's partners in Europe, in Asia. And again, we try to try to basically coordinate that. We generate in similar ways epigenetic data in as many tissues as possible and then pull the data together. And so that's what I'll talk about a little bit. We can't do everything. So typically, what we call a reference epigenome in a given cell type includes two histone marks that are associated with transcribed genes, two histone marks that you see here that are associated with enhancer regions, two histone mark. So I didn't go in detail, but these are all chip-seq data sets that pull down histones that have these particular marks. And so two marks that are associated with blue press region, RNA-seq data, and hongi non-bicyclotype sequence, as I mentioned. So the idea is to have an overview of what's happening in the chromatin and in the transcriptome of a given cell type. As part of IAC, we are doing some data generation ourselves. And we're also responsible for collecting data that's been generated by different groups. So this is one of the things that we do. One challenge with epigenomic data and also with genomic data that we didn't talk about much yesterday, but these data sets are identifiable. So if they're coming from patients, we have to be careful about the raw data. So the raw data ends up being put in these controlled access repository, whether it's DBGAP or EGA. So to access raw data, you have to go through a data access committee. But then we also make available process files that are not identifiable that you can then use and visualize. So we've developed this portal that contains many, many data sets. And so we'll go into that in more detail in the practical. So this is the portal for the international consortium. There's also a portal for ENCODE, because ENCODE, which again was the initial project that started. A lot of this work continues to generate data. This is another very useful resource that I recommend. Finally, one that's, I think, particularly relevant for this group is called GTEX. And GTEX, they went through a different approach. So I mentioned that it's difficult to get access to tissues, to do profiling. So GTEX project actually uses cadavers. So this way, in the same individual, they can profile lots and lots of tissues in a systematic way. And so that's a convenient way to profile many tissues. The other nice component of GTEX is that it's linked to genotype. So they sequence these individual. And then they look for association. So, well, again, we'll see that more in the practical. But they look, so if they have three donors, so this is an example, they might, so if they have enough donors and they're looking in the same tissue, whether this is a gene in brain, I believe, so they can see, again, some genes will have same levels of expression across all individuals, but others will have a bit like GWAS, will have different levels that can be associated with genetic variants. So GTEX combines genetic information with expression, mainly hence the name, expression project. And so that actually allows you to identify genetic variants that in a given tissue are associated with changes in expression. So that, again, I think can be very useful resource and annotation. So that's what we're going to look at. So this is GTEX. Again, I think in the practical, we'll see this more. And the last one that I have here, this is just one, and there are many others, but this is one of their repositories that has a lot of the underlying raw data. So again, if you're interested in expression and your particular system of interest, it might be good to look to see if there are subtle data sets already that you can use or that you can compare your data with. That's one of the resources. OK, so I'm almost done with the overview, but just again, a little idea of what we're going to be doing in the practical. So we're going to be looking at the data portal and how to find data sets there. So I mentioned, but there's different consortium that contribute data sets, whether it's the roadmap or it actually even the ENCODE data or some of it is also included here. What's more relevant is that these come from different tissues and also are associated with different types of assays. So we'll navigate a little bit. This is a different view of the data. So this is really what we're going to be looking at. From here, you can directly visualize the data. So typically, epigenomic data, so this is a histone mark. So typical chip-seq data set looking at H3K27 acetylation. So this is the profile. If you remember the slide I showed you on these clusters. So again, this is an example of just visualizing some of the data. UCSC genome browser. There's other data browsers, specifically for epigenomic data, WashU being one of them. So that actually has some nice features that are missing from the UCSC browser and specifically for epigenome. As part of the IAC data portal, you can not just visualize. You can also download the data. We're going to do a piece of that. This is an image of the ENCODE data portal. Similar things. I mean, there's just so much data that a big challenge is to build interface that actually allow you to navigate and find what you're looking for easily. The ENCODE portal is quite nice for that. So again, that's one of the things that we're going to do. Gtex, a slightly different view. So I think this gives you an idea of how many data. So they provide 50 tissues. They have 500 donors. This was a short while back. I think there's been a new release since then. So we'll see that on the portal. I think they have more data now. But still, they have 500 individuals. They've profiled 50 tissues or whenever possible. Sometimes there are some failures. But it really allows so you see the number of samples in the different tissues. Some tissues end up being easier to process than others. My last note of caution before we go into the practical is to say that even though these international consortiums do as best as they can to quality control and to only release a good data set, there's still some variable levels of quality in data sets that you might find out there. So I think it's important to keep that in mind. Sometimes there's some very old data sets and some new. So of course, it makes a difference if it's a data set that's five-year-old when the sequencing technologies weren't as good and the reads were shorter. So you do have to take that into account if you're comparing older data sets to your data set. So in some case, it might still be a good idea to a bit like what was done yesterday to go back not just to the process data, but to the raw data if you can and really sort of redo the analysis and recheck that everything is good. But one, and this is one of the things that we'll do in the practical. One thing that we've tried to do in the data portal that we've done is to add a few QC metrics. And there's similar metrics actually in the ENCODE portal as well. So there are sometimes some QC metrics that are associated with the data set. One that we've implemented in the portal is to do just sort of simple correlation analysis because when you have replicates in the same cell type, you might want to see which data sets are most similar to which data sets and that might be one way of identifying outlier data sets that are not so good if there are replicates. So again, this is one of the things that we'll try to do. So with that, I think I'll stop here. Take any questions if you have.