 So, yeah, so my name is Guillaume Bloucq. I'm actually an associate professor in human genetics here at McGill, and I'm also the director of bioinformatics at the genome center, which is not too far either. So if you're not from Montreal, welcome to Montreal. As you see, we're very clean people, so we get rid of anything that's left behind. Okay. So yesterday, so you had an overview of chip-seq in particular with Martin and Nisha. So today, I'm going to start and cover methylation, and in particular, whole genome bisulfite sequencing. Well, we'll see this figure again, and at that point I'll explain more what this figure represents. So we have plenty of time in this morning session, so feel free to ask questions along the way. But the learning objectives of this section, of this module, is to understand a bit on the different technology used to measure DNA methylation. So even though we're going to focus on the generation sequencing and whole genome bisulfite sequencing, I mean, we'll first start by sort of reviewing one of the strengths and weaknesses of the different approaches, in particular, microarray approaches as well. After that, we'll go over the bisulfite sequencing workflow, so you'll see that there's some interesting twists in terms of the bioinformatic analysis of these methylation datasets. It's one of the most unusual, I would say, next generation sequencing analysis because of the way the DNA is actually modified in the process, as we'll see. So understanding the principle and the challenges of methylation analysis, be able to extract methylation levels from bisulfite sequencing data, know how to visualize bisulfite sequencing data. So in this section in this module, especially in the practical, we'll be using IGV, so the integrated genome viewer, to really look at some of the files, so slightly different than what you saw yesterday, which were mostly UCSC genome browser tracks. So here we'll actually retrieve the data locally and visualize and explore the data. And then be able to identify differentially methylated regions. So this will cover in the lecture, not as much in the practical, but at least we'll cover it a bit in the lecture so that you have a sense of how you identify differentially methylated. Okay, so starting with the very basics. So what is DNA methylation? So we'll focus today on the most common form of DNA methylation, which is really the five metal cytosine methylation. So it affects 70 to 80% of especially CPGs, so Cs that are in the CG, dinucleotide in the human genome. And sort of the basic principle is really that high level of this five metal cytosine methylation in CP rich region promoters is associated with repression, so if you have high methylation of promoters, especially in CPG rich regions or islands, that's associated with repression. When there's no CPG dense region, the relationship is a little bit more tricky in the context of transcription, but again, the types of examples we'll see are really looking at hypo and hyper methylated promoters. And again, we'll get back to that and look at that in more detail. What is, what type of methylation are we talking about here? So there's sometimes and in the chip seek lecture yesterday, we were talking more about histone marks and so the histone tails that can also be methylated and have different marks. Here, we're really talking about the DNA methylation. So the cytosines in the genome in some case have through these DNA methyltransferase have the addition of this metal group. And so that's really what we're able to, well, we're going to be analyzing and looking at today. But you'll see that in some case, there are similarities between the, you know, looking at histone chips, histone methylation marks and DNA methylation marks. So you have, again, we won't go into much detail and honestly I'm not an expert in this, but you have de novo methylation, which is really sort of the addition of one of these metal groups. But one other key feature of this type of methylation is that there's maintenance of this methylation. So if it's found on one strand, it gets added to the other strand. And this is one way you really have sort of the traditional epigenetics, which is this memory from one cell to the next, where that mark sort of stays on as cells differentiate or replicate. So why study methylation? So it's really the main mark that has really demonstrated mycotic inheritance through this, you know, what I've just described, this maintenance of methylation where when you have it on one strand, it actually gets added to the other strand as it gets replicated. And the reason why it's also very interesting is that it's really associated with a number of key processes in regulation. So genomic imprinting, whether it's the paternal allele or the maternal allele that gets shut down through methylation, it's quite important in the context of transposable elements silencing. So these elements are typically recognized in the genome as being where you want to repress them, so they don't make copies in the genome. So again, methylation plays a big role in silencing transposable elements. So there's a lot of literature around that. Metalylation in the context of stem cell differentiation, because again of this, these marks actually gets passed on to cells as they differentiate. So again, so embryonic developments, stem cell differentiation, inflammation, and also cancer. So cancer in particular. So this is the mini slide that I had in the beginning, and this is really the type of thing that hopefully by the end of the morning session we'll be able to identify more or less. So you have the normal state here, up here, where you see that genes that are actually turned on or tend to be associated with regions that are less compacted, so where the DNA has compacted and doesn't have methylation, while genes that are repressed have these high level of methylation, in particular in the promoter, and tend to be in regions where the DNA is more compact. In cancer, what you have is a disruption of that normal state, where in some case you have really the sort of on both side, you can have methylation that gets added on the promoter of gene that should have been turned on, especially so tumor suppressor and being repressed, while oncogene in reverse can really have be activated because of the loss of DNA methylation. So you really have both side that can happen. So this is really sort of relative to just the regular expression of gene. You have other factors or other implication of abnormal methylation in the genome. So down here you have additional examples. I talked about the transposable element, for instance. So transposable element, again, in the normal state, get shut down such that they're not able to make copies or they're not expressed and don't make copies in the genome. So in some case, loss of methylation there in cancer can be a problem. Some regions that open up in cancer, it's not just expression. You might actually have genomic accessibility that's associated with that. So again, well, so this is also on repetitive sequences in this case, but the bottom line is that so being able to compare sort of the normal state of methylation with the abnormal state that in some case happens in cancer is one of the things that we want to be able to do and why we're doing these assays. So now off to the different types of assays that are available to actually study methylation. So I'll explain my sulfite treatment in a second, but just so that you have a broad overview, so you have microarrays for detecting methylation. So again, we'll go into that in more detail. And then you have various enrichment-based methods where you're really sort of enriching for the anti-fragments that are associated with regions that are methylated. Part of this includes things that are very similar to what you saw yesterday in CHIPSY. So again, we'll get back to that. And then finally, the whole genome bisulfite sequencing approach. But in case you're not familiar, let's start again with sort of the basic tool that actually gets used in the nice molecular trick that actually gets used to do this type of profiling of methylation, which is really this bisulfite treatment. So bisulfite treatment, so if you see, so you've got two alleles, it's the same region of the genome, but this allele has a CG, which is methylated, and you have the exact same allele, but that's unmetallated. So what bisulfite treatment does is actually, well, it doesn't affect directly the methylated Cs. What it does is it converts all the cytosines into uracils, but actually all the cytosil that are unmetallated. So any unmetallated cytosil gets converted by the bisulfite treatment into uracil. And then while the C are sort of protected from the methylated C, sorry, are sort of protected from that treatment. One thing that will go over later that you'll see is that one of the things that's a little makes bisulfite, now bisulfite treated DNA a bit tricky is that it's not symmetric. So what's happening on the reverse strand needs to be taken care of sort of separately because of these changes, but again, we'll get back to that. But again, so the key thing to really remember is that in the bisulfite treatment, the unmetallated Cs get converted to uracils. And then once we map them back to the genome, as you'll see, we'll be able to distinguish the Cs that were methylated from the Cs that have now been converted when we map them back to the genome basically. But again, I'll sort of get back to that later. The key from the bisulfite treatment is again what's happening with the unmetallated C that get converted into the genome. So now off to the technologies, and I'm highlighting in red in each case what's sort of specific to that particular technology. A lot of the steps are actually quite common from one method to the next. So with the bisulfite treated microarrays, what happens is that so you prepare the DNA, you do this conversion, and then you hybridize on the microarray and the microarray has probes that distinguish specifically the Cs that will have been converted from the Cs that won't. So you have really sort of at the location of various CPGs that could be methylated or unmetallated, you have probes, and then it's very similar to any standard microarray analysis. You're just looking at the differential signal between the two probes. So you have data normalization and analysis that's not unlike regular microarrays, the only difference here again is that you're doing this bisulfite treatment and then you hybridize. So here's a sort of an example of a relatively old methylation microarray. So this is one of the early version with 27,000 probes. So this is an aluminum and aluminum arrays. Since then they have arrays that are 450K, which is a quite common array. They have the epic array, which is over 800,000 probes. So the density is increasing, but this I think illustrates the point still quite well. So the challenge with, well, if you could think about gene microarrays, there's only perhaps 20,000 or 30,000 genes, so you're able to actually cover the genes quite well with methylation and with the challenge is that there's lots of Cs that are potentially methylated in the genome. If you see in plaque, you have all the Cs that are in this Cg-gynucleotide conformation and you see that there's really at this level, which is probably a multi-megabase region, you really have quite a lot of potentially methylated Cs. So you see that and then in red you have the regions, these CPG-RIT regions, which tend to be associated with promoters. If you look at the genes down below, you'll see that in many case the promoters of genes are associated with regions that are dense with CPGs. And then on top you see where these probes from the 27,000 KRA were designed and you see that they're really quite sparse. So even though it gives you very accurate and pretty robust results because microarray technologies are very mature, it still gives you very sparse data throughout the genome. Now that it's 800,000 probes, of course, the density is much, much higher but it still ends up being definitely not a base pair resolution profile of what the methylation looks like. One thing that saves the arrays and still makes this very useful is that typically methylation doesn't really vary that much on the base pair level or it's thought that it's usually regions that are high or low methylation that are important, but again that's an assumption with the arrays that the sparse coverage will give you all the information you need. So that's one of the challenges. So a quick note, yesterday there were some questions on tools and that's always a challenge to figure out what tools to use to do the analysis and again we won't go over these tools in much detail because that's not what we'll cover here. But there's a lot of tools, I would say that again for the microarrays are quite mature, whether it's the Illumina Genome Studio package which really comes with these chips that are frequently used, RNA beads. So there's a lot of pretty mature packages to do the microarray bisulfite microarray data analysis, but this is not what we'll cover too much. So moving on from now, the microarray based approaches to some of the enrichment based approaches. So again many of the steps are similar. So here there's no bisulfite treatment. The main difference after the certification of DNA and the library preparation is really an enrichment with an antibody that recognize this five-metal cytosine. So this is what Misha was talking about yesterday where this is the equivalent in the way of chip seek on the histone marks but where you're actually pulling down using an antibody DNA that's actually methylated in this way and then the downstream analysis are actually quite similar to what you saw yesterday. Another method that's quite similar to that conceptually at least is metal cap which is another enrichment based approach where the enrichment this time is not an antibody but is really a methyl binding domain protein. So again you're just enriching DNA that is methylated based on this and then it'll be very similar to a chip seek analysis. The last enrichment based approach that I'll present to you this morning is this RRDS method where the initial step is a digestion with MSPI which is a restriction enzyme that basically cuts a CG in the genome. So the way of enriching for DNA such that you because one of the challenges we'll get to that when we do whole genome sequencing is that to cover the whole genome you end up having to sequence quite a bit. So here the advantage is that it specifically cuts region with CG's such that it enrich your DNA fragments for reads in CPG islands and in CG rich region. So this is just a strategy to enrich which region of the genome you're looking at and then you follow with the bisulfite treatment which we talked about, you do a vacation and you do high frequency sequencing. That's just really a way of enriching. It's sort of associated with challenging biases a little bit in terms of what regions are covered. So that's one of the limitation I guess which we'll get back to. But just continuing on the example that I showed before where you know so the bottom of this slide is the same thing as I've shown you before. So again there's tons of CG's in the genome. The arrays will only cover some of them. So what you see here are this last method that I talked about the RRBS and you see the reads a bit like you would see I guess chip secretes. So you see the reads that are that have been bisulfite treated and their location and so you see that they tend to indeed enrich. You see that most of the reads are not sort of uniformly distributed across a genome. They tend to be associated with CBG islands. So that's a way of sort of efficiently covering these regions with less sequencing in some ways. So that's with the RRBS data. While the midi-SIP and the metal cap again have this is now an enrichment base exactly like what you saw yesterday where you have profiles that are enriched basically for methylation it's different in the sense that it's not this base pair resolution but you definitely get a sense of which regions from this data have high or low methylation and so you can do analysis very similar to what you did yesterday where we're looking at DNA methylation. So what are the again this is another so we also won't focus on this but this is just to give you some ideas of how to do the processing of this enrichment base data. So depending on whether so again you're going to get thoughts so as opposed to the microarrays which was really software to do microarray data analysis and normalization. This is in some case similar to what you've done with the chip seek so where you're going to be using mapping algorithms where you're getting reads so you map them using BOTI and PWA for instance but then you'll have tools that are specific to methylation data and looking for enrichment in these data sets. So before moving on to the whole genome by sulfite sequencing and then maybe I'll stop as well to let you guys ask some question but this is from an older review but still gives an idea of some of the differences in coverage sort of more quantitatively than what we saw in the figure. So especially in different regions of the genome the coverage tends to be very very different. So if we if you look at the two enrichment based approaches you see that they're actually quite good so so the way to look at this you've got different regions of the genome whether you're looking at CPG islands or you're looking at promoter regions which again there's a good correlation between these two regions or whether you're looking across the genome completely and you ask you know from these enrichment based what's the typical coverage amount of reads you have to be able to call different regions so what you see is that these enrichment based approaches are very good in CPG islands and in promoters. This is also true at some level with the RRBS but once you move to the whole genome you really obviously have much less coverage regions when you have just a few reads you won't really be able to call or to say much in these regions in terms of an enrichment of whether it's methylated or unethylated. The arrays and this again is is sort of an older version of the array that you know didn't have as many probe this has improved but you're still left you know overall covering relatively well the promoter regions and the CPG islands but not necessarily knowing what's what is happening in in the whole genome as you would expect. One can argue and this is the same argument with variance and mutation is that you know the important thing is to know what's happening in genes and in promoters because anyway we don't know much how to interpret what's happening in the rest of the genome but obviously we don't have tools to to be able to look in those regions we also won't be able to to be able to to say much about them. Okay so so finally the the technology and the approach that we'll spend most time looking at is is whole genome bisulfite sequencing so this has the advantage of in many ways being simpler at least in terms of because you don't select any regions and it's really only the bisulfite treatment followed by sequencing so you're not selecting in any way what you're looking at you're looking at the whole genome the the big limitation here is really the cost because again you need to do a full genome sequencing which will cost you per sample you know likely a few a few thousand dollars so that's that's really the big limitation here. One sort of twist and I guess maybe that should have been in the enrichment base approaches as well one strategy to to reduce the cost of whole genome bisulfite sequencing is is really sort of mimics what you can do with whole genome sequencing versus exome sequencing is you do the same step as whole genome sequencing except that you have a step where you capture the DNA fragments of interest so you can actually have probes at the end like this again exactly with like exome sequencing where we would design probes across genes and then capture DNA fragments that are associated with gene so here for you can do the same with methylation except that you can design the probes not just on the genes and the promoters but potentially on on on various regulatory regions that have been pre-identified using something else so you can really decide basically where you look in the genome using these types of capture bisulfite sequencing approaches okay so so tools and this is now closer to some of the tools that we're going to be covering in the practical so some of the tools for processing whole genome bisulfite sequencing and you'll see well so I've listed them here in the next few slides once we actually go into the pipeline I'll explain in more detail some of the differences because there's really really different philosophy of how you process these bisulfite sequencing reads really the challenges that as if you remember we've changed some of the DNA bases in the reads such that the mapping step is going to be a bit of a mess because they no longer necessarily map to the reference directly okay so last slide and then and then I'll open up for some questions because we'll have covered sort of all of the intro in terms of the different technology this is the last slide on this where so microarray versus enrichment base versus whole genome bisulfite so you know at some level all of them provide pretty accurate DNA methylation measurements microarrays typically have lower costs and provide accurate measurements definitely across a large number of CPGs that typically focus on on the important regions that leads to that we know about whether it's promoter and CPG islands and again with the arrays that cover close to a million features you really get a pretty pretty dense and robust methylation profile enrichment based methods have relatively low resolution because all you're looking at is regions that are enriched or depleted for methylation but but again still have at least a low cost and can be applied and are not restricted specifically to the places in the genome that you for which you have chromes the bisulfite base methods provide so absolute DNA measurements and that's what we're going to see so one of the neat thing and I didn't discuss that yet but with the the reads you really are going to get the quantitative measure of how many reads you know correspond to a fragment that was methylated or unmetallated so that's as you'll see you know preview provides and and there's also additional application if there are variants and then again we'll get to that so that those are some of the challenges I didn't mention but you know these enrichment based methods do have biases and in some cases are are challenging to control and to to to remove and finally with with whole genome bisulfite sequencing you know the main limitation is really the cost but as the costs go down you know there's good chance that this will really become the method of choice for this type of profiling especially given that again you're not restricting um to to specific region of genome so you're really doing a an unbiased survey of what's happening so um so I'll stop here for a minute and I don't know if you guys have questions um you were you were already mentioning it I'm I'm a little missing the big picture so if you look at what we've learned yesterday whether these tones are methylated and acetylated yes here the DNA is methylated is it completely independent mechanisms or is it is it like you know working and so it's I guess it's both can I say that so it's like so there's differences and there's definitely some correlations as well but these are independent mechanism that that really in some case you know it's really the DNA methylation that is used for the repression and so it's uh I mean but that's these assays and these profiling is really to unravel these relationships that we need to be able to do this type of profiling but it's there are independent uh mechanism it's just that sometimes they do work hand in hand yeah that's right so but again sometimes uh you know sometimes you really have one mark that precedes the other and then the other one that's being put to reinforce the same state but um yeah yes so that's that's a that's a good point so I talked when you say limitation what you mean so the fact that's that's right yes right so there's so so as I said sort of we we focus here on just one type of DNA methylation which is the cytosyl 5 cytosyl methylation there's other types of of methylation so one of the advantage of of holding on by sulfide sequencing for instance is that it's not restricting to detecting that mark only on the on the cpg for instance so you're able to detect other c methylation outside of that context but to detect other types of methylation there you really have to have additional or different treatment that are not by sulfide or that combines so there's different assays that actually allow you to isolate or to to to really convert only certain types of methylation marks so there's lots of parallels to to what we're going to look at and what we're going to see here but but definitely in terms of the treatment you need to make some adjustment and use other treatment that actually I allow you to isolate other types of methylation gbs is very expensive because you're already being so yeah so I mean you can pool sample but but typically that's not going to help because you really need because especially you know to be able to to be sensitive to detect the methylation level every base that you want to be able to estimate you know you need to have coverage on so it depends what resolution you want so at the end actually I'll I'll I'll I'll talk about a paper that really does this type of analysis of what type of coverage do you want so it depends whether you you just want to be able to detect methylation sort of broadly in the region or you really want to have base pair resolution if you really want to have base pair resolution to know if that base is methylated then you need lots of reads to cover it and so then you're stuck having to do 30x which you know you can't pool and mix samples so much if you're happy with uh with having sort of uh not as refined a resolution and then and just you're looking for these large regions that are differentially methylated then then you can go down to coverage of 5x or or maybe you know and things like that and then uh then it makes sense to actually pool samples uh and and that does reduce the cost quite a bit yeah you mentioned a little bit that you can have any type of polymorphism yes you use these arrays so I guess that these arrays choose like specific Cs that are not polymorphic but if our question is actually if like you have polymorphism yeah absolutely yes so so we'll I'll touch on that a little bit so it's much better than but it's also challenging to use sequence based data because at least there you have the information um on the reads directly on potentially these alternative alleles and so you can do a lot more but the bioinformatics ends up being quite challenging as well but uh but I'd say yeah it's definitely much more powerful if you're interested in in variance on top of uh and methylation combined and then the sequence base assays are typically better but again for that you do need this base spare resolution so you would need either an enrichment based approach such that you have sufficient coverage or go for really 30x holding you down by self-height sequencing because if you have shallow read coverage you won't be able to look at these yes I'm just going to build on Marcus's question before regarding DNA methylation versus system methylation yes let's assume we have a system methylation for one specific side or one specific gene and then we have DNA methylation for the same side and I think if we want to see how much these specific gene or this specific region is actually regulated by these methylation so do they actually have some sort of aggregated effect for aggregated regulation effect on that gene or how do they interact so I don't know maybe Martin you're better than me to answer that question what's it you know we're getting into a little bit of biology and actually the best answer to that question so in some cases the relationship between the histone mods and DNA methylation is anti and mechanistically that's well understood so the enzymes that control the modifications are actually mutually recruited by one modification or the other and in other cases they they actually reinforce one another in and this is for example a retro viral elements where for example H3K9 tri methylation is present and DNA methylation and if you lose one and people have done these kinds of experiments and embryonic models are in early development you still have repression of retro viral elements so it's like it's a redundant mechanism ensuring those things are repressive they're not additive I can't think of any cases where the two marks work together to be that are additive and in that sense if I'm understanding your questions yeah so there are finite number of scenarios for how do they regulate but like this one and this one like how they regulate the expression level yeah well yeah so I mean taking taking time is going to ask ourselves that we talked about yesterday so they're marked by K27 methylation transcription factor pioneering transcription factors are recruited which then recruit the the enzymes the TET family of enzymes which then de-metallate that region that allow other transcription factors to be recruited so I don't think that they're working I mean they're together making a code for sure but I don't think they're working together in the sense that I think you're suggesting mechanistically I think I mean a lot is known about this yeah at the end of the day the point of what I mean is the function is function can be can be inferred from from the DNA methylation but at least not methylation we can infer which one is actually regulating the expression although I mean recent I mean so this is still an active area of research right so gain of CPG island methylation dogmatically has been associated with repression and in the context of malignancy has been thought to be mediating repression through suppressor genes but recent evidence suggests that's not the case in fact we see heavy methylation of the CPG islands yet gene remains transcribed I think we're still learning so I don't think these rules are fully worked out in all in all scenarios but I think it's fair to say that these are independent mechanism that were you know but it's context dependent in different ways to reinforce but so I mean there's but there's even marks that we are still you know discovering so we have it you know we're far from understanding the full range of combination and how they actually act together all right so so now off to the the actual data analysis workflow for as I said mainly the bisulfite sequencing data whether it's disenrichment based bisulfite sequence treated sequencing data or the whole genome bisulfite sequence data so the main steps and this you'll see is you know typical NGS workflow so there's the initial processing of the sequencing data some quality control then pre-processing the bisulfite sequence alignment and that's where I mentioned already there you know there's a few interesting twists and then the quantification of DNA methylation itself then the data visualization some statistical analysis so we're and then that's also what we're going to be doing in the last and visual inspection of selected regions looking at global distribution of values again that's that's quite similar to some other things that you did yesterday except that it's going to be done now in the context of methylation clustering of samples based on similarity and then after that some downstream analysis where you're looking at these regions that are differentially methylated regions okay but first things first so quality control and pre-processing so and then you know this was already highlighted yesterday but it's really important before you start the analysis to look at your raw data so I say it's really important and that's actually not what we'll do in the lab but usually you know it's really highly recommended that you look at the properties as you've done yesterday with the chip seed data using tools like PASQC you know where all your samples sequence using the same protocol and instruments once you you know this is going to be especially important once you start combining multiple data sets and so this is you know these are standard things to watch for are there any technical issues that are affecting some of the sample this will be important once you start comparing samples or different conditions obviously so so you can also run tools like PASQC that you saw yesterday and actually if we have enough time maybe that's something that we'll be able to do at the end of the or I mean we should be doing it at the beginning but I wanted to make sure we have enough time to cover other things that you hadn't seen but if we have extra time that's one of the things that we could we could add as well in the practical so running PASQC on on your data sets initially you know you'll see that the profile and the base composition with with different you know different these assays are slightly different so you know that's in part because of course of this the conversion that took place with the bisulfite sequencing in the context of old genome bisulfite sequencing so that's why the profiles are different in terms of so this shows you know the percentage of the TC's A and C's so again the C's have been converted so you really have a different profile that with regular old genome sequence data in the context of RBS the profile also look quite different because of the properties of the initial enrichment and cutting but but again the key you know you'll have to look at a few of these to see what it looks like the key is really to also make sure that all of your sample really have a similar profile so that's you know one one important thing to look at even before doing the processing itself so depending on so very similar to to what you saw yesterday you know you've got quality scores associated with the read and so depending on on quality you might want to do some some trimming of the read so simple trimming again I'm not going into as much detail here because you spend quite a bit of time on this yesterday and there's really quite a lot of similarities here so so you can trim based on yeah so simple trimming dynamic trim bill which is window-based or or you know trimming based on it's a mock trimming so so again hopefully you don't have a distribution that looks like what you see here on the left but even if you do you do have the ability to to clean up and and sort of focus on reads that are of a better quality but overall you know hopefully this as the instruments have been getting better has been is less than an issue the other the other important thing and I'm sorry I missed some of yesterday's I don't know how much you went over this but this is a very important point especially for for libraries that have PCR amplification especially because here we're going to be interested in actual we're going to be interested in the number of reads that have you know a methylated C versus reads that don't so the quantification of reads is going to be very important so if you have you know reads that are really PCR amplified this will really sort of throw off your estimates of the actual number of of reads that have a particular methylation state right so when you have a library that's of low complexity where you have many reads that basically have been amplified for you know accurate quantification of these methylation it's important that you actually deduplicate and remove reads that are identical and really represent the same read in from the initial library so I in an ideal case you have a complex and diverse library that doesn't have so many of these reads that are identical but so I mean you have it so I mean I guess I didn't explain so here you have this complex library but once you know you have just this you know one read that's been duplicated and amplified that obviously all has the exact same methylation status um it's going to make you know it's going to change and offset your your estimates of methylation you'll say this is methylated at 71 percent while actually you only had one read that was unmetallated and one read that was methylated so the actual percentage in this case based on the information that you've received is really is 50 so that deduplication and removing duplicates and and monitoring the rate of duplicates is quite important in the context of these analysis because again it'll offset your your rates so so things to look at in in these initial stay steps is really so looking at the read quality using things like fast qc for instance of adapter sequences we've talked quite a bit about duplication rate another thing that I guess I didn't have slides on but that's also critical is really the conversion rate so in in the bisulfite treatment if you don't treat sufficiently you'll have a low conversion rate of your library and that's also going to offset your estimate of of methylation so typically in as part of the experiment you really have additional controls of sequences that are fully methylated or unmetallated that allow you to to estimate the conversion rate so you need to have a high conversion rate in your experiment to ensure that that your estimates of methylations are correct but that usually is really done in the initial preparation stage of the library but but it's it's good to to also look at these estimates of of conversion rate from the analysis so this is all part really of the the initial quality control and pre-processing so now now we're going to get to the fun part which is really the bisulfite sequence alignment and the fun fun part is once we get to the variance but that's after you'll see that it's already fun when we talk about the alignment of these reads so so this is a similar slide to the to what I showed you when I explained the bisulfite treatment in the beginning somehow usually if I haven't had my coffee I get confused when I start thinking about this this alignment so take it so take it step by step right so you really have so methylated C's and unmetallated C's in the genome one thing again that that you'll will have to be careful here is that the two-strand this process is not symmetric the conversion is not the conversion that happens with the bisulfite treatment is different on the two-strand so you really so denaturation the two-strands are split the bisulfite treatment as we've talked about converts so the methylated C's are protected but the unmetallated C's get converted to uracil again this is slightly different what's happening on the two strands in this case because again on the two strands you have different different C's that will get converted so that's why you have to really think about what's happening on the two strands at the same time and then through the PCR amplification process these uracil get converted to back to T so this is really where you know this this new sequence which is whether it's you or T once you map it back on the reference genome you know this this sequence will no longer I was assuming that the reference is a C at that location so so if the read will no longer map at that location the methylated C will be fine but those unmetallated C basically will become mismatch in the amplification process you also get the the complementary strands again you need to to keep track of this because it's not the same thing that's happening with both strands on that so the the second strand on the right again we sort of follow methylated C were protected and then the unmetallated C convert to uracil following PCR amplification become T and and all of these T's are not part of the genome so once we map them they'll be viewed as as variance basically but again we'll need to treat that carefully yeah that's right but the fact that yeah so it's again assuming that the conversion rate is very high every C that you observe later on you can assume that it was it was methylated that's right but again every C that will see every C that will see in these PCR read were methylated initially and every mismatch of a T on a C will correspond to an unmetallated unmetallated C so in terms of so the data processing what so what's the approach that can be used now given that we have these reads you know that have these artificial so lots of artificial T's where you would have C's and A's where you would have G's so so there's really three main approaches that can be used bioinformatically to deal with these reads and then do this alignment or we're actually so two of these approaches are alignment based and the third one which we you know we'll talk a little bit about is this reference free processing but the two main approaches are really wildcard alignment and tree letter alignment so wildcard aligners so the trick that's used for wildcard with these alignment tools is really to you know replace so to avoid these mismatches you actually replace all the C's in the genomic DNA sequence by a wild character Y so you basically have a wildcard associated with all the C's which is going to match both the C's and the T in the read sequences so you can also potentially modify the the alignment scoring matrix such that so depending again on the mapping tool that you use either you can convert the genomic sequence to have these wildcards or you can modify the alignment scoring matrix to to really don't count these as real mismatches basically so there's a number of tools including this vsmap and and pash that use this this basic idea of having a wildcard so so all C's basically will be viewed as as wildcard and you're gonna allow and not count basically mismatch from T's onto these so that's that's approach number one the main the other approach is is this three base aligner and again that's I mean both of these things usually but I have a figure and then we'll see some examples so the second approach is really three base aligner so so you convert a bit like you do with the bisulfite treatment itself you convert all the C's into T's in the reads for both strands for both strands of the genomic DNA so sorry so you convert all the C's into T's in the reads and in the genomic DNA doing that on both strands so this is a strategy where you really sort of instead of forget about the four letter alphabet you sort of implicitly convert everything into a three base alphabet and after that we'll extract back for the mapping step and after that we'll extract back what was really happening so again we'll show an example because for sure I mean you hear those two things and in my mind in the first first few times I heard this it's really not clear you know what what really is the impact or how different these two approaches actually work and what's what how things change from doing these two approaches so approach number one you have this wild card where you don't count mismatch at that location and approach number two is that you convert everything into this three base alphabet and the tool main tool that uses this approach is Bismarck and that's the tool that we're actually going to be running so so how how does that work and what's the impact of these two general strategy and again I mean if you're a bit lost here that's okay I think because I think initially you know this is not not trivial you know what's the impact of using one approach versus the other so let's let's try to go through this so so suppose that you have you have this dna so you didn't get some cpgs this one is a hundred percent methylated this one is fifty percent methylated fifty percent zero percent right so so now we extract from this these tiny tiny reads that are by sulfite sequence reads but they're just very small so in the wild card strategy again the C's in this context are converted into this wild card such that whether it's a C or a T in your read if your read has a C or a T like in this case both of these will map equally well to the wild card the problem though is that once you allow that you know you end up with cases like like this read which you know so again here hundred percent meaning both reads were protected such that you still have a C right so both reads are methylated so you have a protected C such that in the reads the by sulfite treated reads you have a C so that's this scenario here you've got fifty percent methylation such that you have some reads that have a C some reads that have a T in both case these map equally well to that location because it's a wild card here so this is where it gets interesting and a bit tricky here you see that the C this the so the methylated C which remains a C actually maps back to this location but this the unmetallated one actually now because of this wild card does map here but it also maps somewhere else in that short sequence such that actually this would be a multi-map read and typically you might actually not count that read so this is a bit sneaky because at that same location the the methylated C maps and the unmetallated C ends up not mapping because it becomes ambiguous so that's also going to create some challenges in the analysis because it's basically not symmetric the mapping of the two so that's that's well I'll get to that but that's the tricky part with the wild card is that it you know it does work but it has in some case this limitation where it's not necessarily behaving the same for both reads both type of methylated or unmetallated reads that would help but it's still this does still happen so in some regions that are not very complex you still have so this is a cartoon example that's you know obviously if you have and we do have longer reads this is going to be less frequent but in regions that have that are low complexity and that have high methylation around you know you'll see you're going to have this effect happening in some regions of the genome and it's still going to lead to some some funny things so that's that's one of the challenge with these approaches so so I explain the wild card alignment strategy the three letter alignment strategy is you know is what we're going to be using so this one but has other limitations this one you see that everything in the genome and in the reads has now been there's no more c so we've converted all of this to a three letter alphabet genome and here you've got another problem which is now that that it's very conservative because you actually end up you've reduced the complexity of the genome quite a bit so suddenly many reads become ambiguous in terms of where they map the advantage though here is that this is the same for both the methylated and unmetallated reads so at least you don't suffer from a bias you know just just one type of read which would offset your your methylation estimates the problem though here is that you end up and again this is a cartoon example obviously if you have longer reads this is less of an issue but it's still illustrates the problem you end up with you know more ambiguity so more regions and that's what you see at the bottom there are more regions where you know there's not enough so you're not sure whether the reads are mapping here there's because it's only three letter alphabet they can map in many places so there's more regions that end up not being covered because of that ambiguity so hopefully so those are the two main types of aligners and hopefully that makes a bit more sense or it makes a bit of sense in terms of what it is so as I was saying so the treater the three letter alignments have lower coverage especially in these highly methylated regions so highly methylated regions have lots of C's and so these end up in these regions that are not being very low complexity given that we've changed you know because of the three letter alphabet so you end up sort of decreasing coverage a little bit in these regions of low complexity again the longer the reads the less this is an issue but you know it's still an issue in some regions of the genome wildcard aligners typically have higher coverage overall so they use they they keep more of the reads but but they do have this this bias in some case that was illustrated in the previous big previous lines so these problems if you look at normal regions of the genomes are actually not the big deal such that both approaches end up being quite quite comparable but if you look in in repetitive regions you know that's that's a bit more of a challenge and in some of the application you know those are the regions that are of interest because looking at at repression of these regions is what you're looking at so again so it's it's not it's not trivial but it's there's there's really no no perfect way of solving this except with longer reads potential so the the tool that we're going to be running in the lab is is this mark which is one of these tree base aligner so one thing and we'll have to do that as well in the practical is that so you need to convert the reads to convert you know on the on the plus from the c's to t's and the g to a's on the reverse friend and do the same in the reference genome and then what the tool does is really sort of collect the result of these four alignments simultaneously to decide really whether this is a usable read and then ultimately what is the methylation state at that location so you know so the the this mark tool really I guess takes that complexity out of your hand so it does all of that work of actually converting the reads and mapping them but I think sort of having an understanding of what's happening under the hood this is useful but there's really some some real limitation here because we're basically working with sort of modified reads they no longer just map to the reference genomes you need to to take that into account so the last sort of little little thing I wanted to touch on on these ways of aligning or analyzing the whole genome by cell five sequence reads is sort of a new and and I guess a promising approach but that's still under development and this is there was a question yesterday about reference free or what happened if you don't have the genome of reference so I mean so the idea is there a way because these reads don't typically map directly on the genome because of this conversion could you actually forget about the reference genome and and and work off directly the reads themselves so there's a parallel again to what actually gets done with with variant detection so variant detection typically again this I think is a I mean you guys are doing the advance bifurmatics.ca course exome calling is much easier than that because you just map the reads on the genome and you look for mismatch right so this is actually this is a screenshot of IGV which is exactly what we're going to be using but with methylation data but this is exome variant calling where you have a tumor and you have a normal samples of sort of two section in in gray you have all the reads you map the reads onto the genome again in the context of exome sequence or genome sequencing it's much easier because you know you don't need to do anything special and then what this shows it highlights the position that our mismatch on the reference and mismatch are exactly correspond to these variants in this case when you have you know lots of evidence that there's an A here there's a line you know so there's a mismatch so in the same way so there's also an approach for exome that basically skips the alignment step where all you do is actually compare basically reads so you can have you have reads that are tumor reads and you have reads that are normal you know from your normal sample and without doing this alignment step you just basically compare not the cameras but something similar to that so you organize all the reads into these sequence trees and then you compare and you look for for bunches of reads that are basically different in in the two sample so again this is not really what we're going to cover but I just wanted to give you some idea of that there's alternative ways of dealing with this so so sorry can't you know what the biological meaning of this if you can't point it to a specific genome or well so you could that's right so you could still you know so so the idea is that you could basically from this identify you could still after that map all of these reads together to identify where that difference comes from for instance but you don't need to map them individually right to the genome and then so you still need to know where it comes from for it to be interesting but you could identify first the differences so for instance well so here this is relative to a variance so you're detecting you know a whole bunch of reads that have two versions basically one with a variant one without and then you would do that step of knowing where it goes in the genome as a secondary step but you would also in a way map it probably the genome to know what it's doing yeah probably create something so that's right that's right so because if you're not doing it read by read maybe with multiple reads so one challenge for instance if you're trying to detect I mean this was also for indels for instance which is the same type of problem where mapping the individual reason the genome is difficult so same with methylation or indel so the individual reads are difficult but once you've pulled all of them together you can build sort of larger contact of what's really happening there and then you can map that on the genome so this is all on the variant detection side it's just that there are tools that are now starting to come up that are trying to do that with with methylation as well and that might be a way you know we talked about the fact that there is ambiguity in some regions of the genome right where if it's so again this is getting a bit technical but most of the genome is well covered with the alignment base method for pisulfite sequencing but there might be some regions where the alignment base is not working and so using these types of reference free approaches where you're just trying to group together reads that have interesting profile and then map them you might be able to salvage something but again this is definitely not a standard approaches right now but just an alternate okay so I'm getting gradually into the the fun side of things which is the actual quantification so again I said so the alignment was one of the steps that really is quite particular for pisulfite sequencing compared to the chip seek or variant calling which you know you have your reads match the reference so you're just mapping them to the reference because of the pisulfite treatment here the alignment the alignment was a bit weird so what after that we want to quantify the methylation base on the alignment so I mean we've been already sort of circling around this so so the the basic idea of course and we've talked about this but especially after you've removed duplicates you know what the objective is to take all your reads you'll have again even after the alignment the alignment gives you the location but you don't lose of course the information of whether it was a t or a c in your sequence but then you have you're able to actually just count how many yeah how many c's you had how many t's you have and that does give you a rate of methylation from zero to 100 percent so again as part of of the Bismarck pipeline for instance you can extract that as well after post the alignment so that's what you know that's the key thing that we're going to be doing in the lab as well so where it gets really exciting confusing interesting slightly almost impossible is is when you take into account the impact of snips so so again this is not what we're going to be doing in the lab per se but but I think it's important to to keep that in mind and and also it's it's one of the interesting application as we were saying of the sequencing based assay versus the arrays is that you can pull this information out so so far every time I was talking about you have the reference genome and you've got the c's that might be t's and you're converting the sequences and and all of that I mean arguably I thought it was complicated enough unless I explain it so well that it's all very clear to you guys at least it's in the morning it's not late afternoon but anyway but all of that sort of assume that the reference was what you actually had right in in your genome but there's lots of sites as you know they are polymorphic and and also you know that are the genome is is deployed so you actually might have two different alleles at some of these positions so so the the story that I've been saying so far is complicated by the fact that you also have snips at the same time that are that are happening so let's let's try to walk through this figure a little bit and again this this this can be all you know a bit of fun so so you have the reference genome right up here which again has you know this is c's and and t's in this case but like you might have some snips in the sample that you're actually trying to let your your your measuring so in this case you have a c to t snips so this c is is a t and and this this c this t sorry here is a is a c in the genome that you're looking at uh so again we we're not observing unless well so in some case you'll you'll have a separate genotyping asset that will tell you these these locations but in many cases and in many cases you know it's it's interesting to think about what's happening to the reads given this context and when you're mapping them on the reference genome so um so here um the c gets converted through the bisulfite treatment to a t so was that a methylated c or an unmetallated c if the c becomes a t unmetallated correct right so here um you know the t well so the t here is this is an actual t right but relative to the reference uh we're gonna also think so if you compare these t's and you look at it should be a c in the reference you're gonna think it's a methylate this is also an unmetallated c so you're gonna make a mistake here thinking that it's an unmetallated c if you just you know you don't think into account the fact that there are snips uh here uh this c this t this is actually a c uh you know which is in this case um you know so in this genome it's actually also an unmetallated c in reality but what you're gonna observe is a t so it's again you know i guess a little bit so all of these t's are very different t's basically that you're observing yeah so and also so you could do that but that would be quite expensive if you were to do that you might as well do a genotyping array potentially and then know these locations already right or sequencing or which is even cheaper so you could know these locations using some other method and then sort of annotate these as snips sort of separately in it but there's actually a or the reference pre approach potentially would be another advantage but but there's a neat trick actually that can be used here directly just using this data which is the second strand so what actually saves this or allows us to recover some of that information is the fact that on the reverse strand things are happening differently because of that methylation so um so if you see so so if you follow what's actually happening on the reverse strand these conversions of and when you actually had a snip what will you'll see that there's a mismatch so if we go back to the regular case you know on the reverse strand you're actually for the regular unmetallated c on the reverse strand you expect to have a g which will have remained as a g so so this is what you should be looking for here again this is not what you see what you see here with this even though on the on the on the c strand you observe the t but you see that you know you have a mismatch so there's a way by combining the information on both strands to actually recover the snip versus the actually unmetallated but again this uh again i'm glad i'm not the the programmer who coded the you know how to extract that information because it's it's really there's lots of cases and you need to extract and you also need to you know to to be able to then do this accurately you need sufficient reads on both strands to be able to do this accurately if you don't have sufficient reads on both strand you'll you'll be blind to this or you won't be able to very accurately make these these calls i mean think about the fact that there's also errors in the mix of this it's not always perfect so you also have sometimes so it's it's really i mean it's already challenging to do varying calling to do varying calling combined with this is is really quite quite challenging one thing that yeah the information that's that's right so you don't have information so that's why in so you don't have information on the strand anymore but you do have the two types so you're looking at a particular location to make sure that you have two types of reads and then so you but you don't know originally which one came from which strand which is why you have to do everything in parallel assuming that you don't know that information so one more thing that i you know that i'll say on this is that typically the approach unless that's what you're interested in typically the approach is actually to take the common snip and okay i think you blind spot so uh yeah so the last thing i was going to say on this is that what's typically unless you're interested in and extracting this information a safe thing to do is just to mask these regions that are common to have snips to just get rid of that because again if you don't have sufficient coverage you won't necessarily be able to know there's also the fact that there's again two chromosomes and so you have potentially two half you know two births as pretty tricky yes to do the variant calling or to do the motivation yes so so the simple way around this is to just ignore the location of common snips because those regions are indeed problematic in terms of of the output so you can just mask those regions of common snip and then anyway if the region is methylated removing one of these bases will you know hopefully not affect your your analysis that's right okay so that's a sort of a separate and also important point is that if you have a complex tissue this i mean that applies whether it's a variant site or not definitely i mean the fact that if you're looking at a complex tissue the estimate of methylation might be compounded by that mixed tissue for sure if you can do that that's good there's also tools that actually because typically separate tissues have very distinct metal sort of methylation profile there are ways to sort of extract the tissue composition actually by looking at different population of green so there's advanced tools that are trying to do that as well but yeah usually what you're getting is an aggregate of the methylation so if you have a mixed tissue so again so there are tools so I have that on the next slide there are tools that that really go in and extract the information of both the variants and the methylation status on these variants uh typically otherwise again you can input common variants and sort of mask these regions before you do your dmr analysis for instance or something like that typically the output that's going to be a straight output of the methylation status of these very location is just that the levels maybe will vary more at the location of these snips but you have to do that sort of as a post-process except of how you use the values there yes are there questions okay so we're coming towards the end and this was I promised sort of the most challenging part and maybe scariest part again there's there's ways around it and there's information that can be extracted but it's you know if if you thought this mark was a challenge this is this is even more so when you're trying to extract the information on both strands in that way to detect potential snips and you know allele effects so again there are some tools that systematically pull out that information and use the trick that I talked about of using both strands and so there's some you know and benchmarking on the accuracy but again for this to work you need higher coverage so this is what this is showing and it's a this that you need higher coverage to be able to really extract that information efficiently okay but unless there's other questions so I'll move on now to to the next parts and maybe we'll finish a bit earlier for the coffee break so then we have more time for the lab so this is now so we've we've done you know as best as we could the alignment step the quantification of the DNA methylation so now we can really move into the data visualization and some of the statistical analysis so well okay so I sort of lied this is another part that I find a bit confusing and challenging is the way these reads actually get displayed so so in in the browser that we're going to use the IGV browser it's quite nice because it has specific function to display by sulfite treated reads again part of this is because well and in the practical I think we'll we'll see that but again if you just the reads have been mapped using this mark using the neat trick but but if you look at the raw reads on the reference you'll see mismatch everywhere so if you didn't know that this was by sulfite treated you'd say you know what are these reads so there's actually function in IGV that actually allow to to basically display the reads knowing that that these are by sulfite treated reads and knowing to sort of represent you know under forward strand if you have a C or that corresponds to a non-converted cytosine you know while so again so it uses the information the underlying information about the bisulfite treatment to display reads in a way that that makes sense so you'll we'll go through that in the in the in the lab so so here again sort of a typical view of IGV which is what we'll do so but we won't have normal tumor we'll have actually two two replicates of step cell data set but what we have here we're looking at particular region of the genome here you have a profile that's that's not the metallurgy profile we'll have is a bit different but but down here is really what we'll have which are the reads and then you know sort of instead of showing mismatch of the reads to the relative genome what this is perfectly showing are methylated and unmetallated C's so what you see are the this unmetallated motor or methylated so I have it upside down so red so this is against unmetallated and methylated or vice versa I don't know we'll we'll go through it and see which one is which and this is this is really I guess an example of what Martin says is no longer true in terms of what happens in tumors potentially but this is an example of a promoter that really has sort of clearly a differently methylated state between a normal tumor so again we'll generate these these tracks and sort of navigate in the browser to to look for this as well so so another thing beyond sort of looking at the overall data in the browser like this another important thing which we won't do much in the in the lab but that's also useful is to a bit like you did with Misha yesterday is really sort of visualize a global distribution of parameters that you have and you can do additional analysis like like clustering but some of the basic steps I guess we will look at this a little bit in the context of looking at the output files of this mark but you can generate graph and look of course at the rate of methylation values so across the various cpgs you know what's the you know how many of the cpgs are highly meditated or unmetallated what's the actual coverage so the coverage across the cpgs to give a sense of the robustness of the estimates of the methylation status so so again getting these general statistics tells us whether the estimates and the experiment work and so on so so there are a number of packages and tools that that really allow you to do a lot of this post-processing a metal kit is one such example where you can really do these pairwise comparison if you have multiple samples this is now done across different cell type so really looking at again the the methylation patterns and differences in methylation patterns so so the key thing is also to you know again to ensure that that your experiment work to look at so if you have now this this array of of methylation values across all your cpgs you know what what are the similarity so you can do a lot of sort of comparative analysis between your sample whether it's clustering or pca this is really again across all of the cpg measurements so we'll have we'll have the output from from the pipeline where we would be able to look at these types of things so so finally and again I think we'll end early a little bit the session so that we have more time for the lab so there are some so all of this so far was really done at the level of individual cpg's and and typically a lot of the analysis on methylation are not done on the individual cpg's are more done on differentially are trying to look for regions that are differentially methylated so here's a here's a slide that sort of shows this a little bit so so you can you can look for so again the the the data that we've generated at this point from the whole genome by salt pipe sequence data is really you know we'll extract from after the alignment we'll extract these methylation profile that basically tell us for for every sample the rate of methylation that each of these cpg's so it's really base pair resolution so you'll have you know you'll have c's that are unmetallated c's that are that are typically much more highly methylated it's useful as we have in this example to have replicates to be able to see the level of variability so from this you can definitely do single cpg analysis to identify positions that are differentially methylated so you can do sort of standard statistical tests to actually estimate you know which ones based on you know are higher in the cases and get you know individual p values associated with individual cpg's in term of methylation what what happens you know typically what actually gets done is because there are there is sometimes some variability at individual cpg's so it's also helpful or interesting to to look at regions as well either using sort of a fixed window and tiling style approaches where again you can basically identify regions that are differentially methylated between cases in control in this case but it could be between two different tissues or you could do also the same thing but sort of inputting specific regions of the genome to ask whether you know so if you have annotation maybe from an orthogonal technologies about locations of of enhancers and promoters you can really build this type of map and then extract again from the individual cpg's in those regions whether there are regions that are and it's just that again depending on on your coverage and I'll get to that depending on your coverage you might not have very precise estimate of individual cpg's because again you need you need to have multiple regions covering the the cpg to be able to have an accurate methylation state so you might not have great resolution you might not have very precise estimates of methylation status when you're looking at individual cpg's but once you start combining them in windows or in regions you might have enough statistics to be able to make calls to do the the differential methylation analysis yeah so so like I have one like this BS smooth so there's a number of tools I think I had that in the initial slides as well in terms of different tools that that can do this differential methylation again one challenge is well so this this highlights the fact that there's very available that's extended things so there's variability between different samples so you have you might have sites that are you know so the the the amount of smoothing you need to do your resolution might also depend on the level of variability of the region so you you'll have some region like this is highlighting a region in in red that's actually quite variable between the individual versus the region in that's a little bit more consistent but but again having replicates is is needed here this idea of not necessarily relying on the individual cpg's to make the call so part of it is because again we talked about the fact there might be a there might be a rare variant at a particular location that's throwing off the methylation estimates or something like that so so again in some cases the individual cpg and then the estimates coming from single base pair might not be very reliable but if you're really looking at sort of a smoother profile from that data you'll probably have more robust detections of regions that are that are differentially methylated as opposed to to looking at any of which again for a number of reasons including technical reasons associated with the mapping and and various artifacts might throw off individual points on that point so that's that's another reason to do some some smoothing but again so so the hope is that you know starting from sort of the raw methylation profile that you get from from the bisulfite sequencing the whole you know bisulfite sequencing can you can you process this data to really identify that's one of the the typical goal regions that are differentially methylated for instance between sample so that's what you see here so there's this is an example in in development so similar to to what you know so different cell stage and you see that in the IPS in the embryonic stem cell you have actually lower methylation in these in these regions so these really sort of define some of these differential methylated region so whether it's between different cell stage between different cancer tumor stage or different tissues as we were talking about so so that those are the types of of comparisons that you would want to be doing using this this data so so i've talked about this in the very beginning and i i said i would i would get to that in terms i mean one one important question is what was the coverage that's required so so this is an interesting paper that that really looked at this question in detail that i think is quite quite helpful so you see that on the left side of this figure you see that you've got different replicates of the same cell type so what they did is they did high high coverage hold you know by sulfite sequencing and then they down sample to see the effect of adding less coverage so you see that the overall correlation is good between the the tissues and the replicates of the same cell type but what happens if you start then down sampling a little bit so one one one nice thing to look at for instance is this one here uh so you see that if you have so this is where you increase coverage and it determines what type of differentially methylated regions are able to pick up so if you only have one x or one to five x coverage uh what you see is that the types of dmr that you're able to detect differentially methylated regions are these bigger dmr so the mrs that are in the range of one kb so even if you have one x coverage you are able to detect some dmrs that are large and that tend to be highly different in terms of methylation as you increase coverage what you gain are are more subtle regions that are differentially methylated more subtle both in terms of the size so you're able to go down to to much tinier regions that are differentially methylated and and where the effect uh of the differences the difference in methylation is also sort of smaller so so again for some application and and you could argue that it's you know you pay a high cost to get you know just a little bit more information at some level uh with the with the 30x uh coverage this is also sort of you see this um so this is another sort of analysis looking at the false discovery rate uh that decreases as you would expect as you increase coverage so so this is just also i mean obviously the more coverage you have the more accurate the smaller the reasons you can you're able to detect but it's a balance between cost and then then what you can can afford to do um so um so i think i'm i'm almost done if i'm done um almost done so uh conclusions um so so by self-pride sequencing analysis is not not easy um again i call this sort of uh advanced by informatics that c a uh module uh so you need to choose the appropriate p and a methylation technology so hopefully i gave you some some of the some some strategies to pick with the appropriate uh based based on your mean so you need to check for quality and watch for for biases so we touched on that a little bit but that's really quite critical um so and and really and really important uh there's there's multi-step analysis workflow which is really what we're going to be doing uh in the lab so and just before we break um i guess a preview of what you'll see uh in in uh in david's presentation but whether you have some of your own data or there's also lots of available data sets out there what we're going to be doing in the lab is really sort of a dumb down version of one of these data sets that's that's really much smaller such that the processing will be reasonable another challenge i guess which i didn't even mention at all is the fact that it's especially we'd hold you know by self-pride sequencing the processing is quite intensive because again you have all these extra step of converting all your reads into these other reads mapping multiple times yeah you know getting more you know four alignments and then pulling them all together so hold you know by self-pride sequencing we're going to do a mini version that's really sort of restricted in some regions of the genome doing it with the full 30x data set is really computationally very intensive because of these are the fact that you're doing multiple times but with that i'll take i guess a few questions if you have any yeah so what kind of i didn't quite get oh smoothing uh so there's again there's um i guess there's lots of different approaches for smoothing many of them are quite um you know there's something there's basic approaches that really don't do much except having a sliding average you know as you're going through you know combining multiple CTGs whether you're actually um so i guess a very standard approach is really just use uh a sliding windows either fixed size or really because of the density of cpg varies quite a bit sliding windows on the number of cpg is also a popular smoothing approach so five cpgs and then you're you have this sliding average as a smoothing strategy but there's more advanced smoothing strategy as well um but uh yeah