 So I'll be going over whole genome bisulfite sequencing and analysis, more generally, bisulfite sequencing and analysis. But as you'll see, so as part of the learning objectives of the module, I'm not just going to cover bisulfite sequencing and analysis. I'm also going to cover, well, some background on DNA methylation first, and then the different technologies that can be used to measure DNA methylation. I'll go over a bit some of the strengths and the weaknesses of the different technologies that can be used to profile DNA methylation. And then after that, I'll jump into the domain topic, which is really bisulfite sequencing analysis. So whether it's, as you'll see, whole genome bisulfite or capture bisulfite sequencing. This being a workshop on analysis, most of this of my presentation will be understanding the different steps of analysis, what are the challenges, and so on. As a heads up, I do a lot of different types of analysis. Methylation analysis is one of the hardest ones. It's really, I mean, we'll go through it step by step, but you'll see that it's because of this conversion, it's really quite a challenge initially to wrap your head around it. But we have plenty of time that we'll go over that. But maybe before I start, so who's familiar with methylation to begin with? Easy enough. How about bisulfite conversion? OK, not bad. And then using arrays or sequencing. So arrays, OK. And then sequencing, OK, not so bad. So maybe you guys can explain it to me, because I have a hard time understanding sometimes. I'm kidding. And then how to actually visualize this type of data and how to be able to identify regions that are differentially methylated between different samples. So before I start, so all of the slides are the same slides that you have in your folder. But I've added a few slides just to follow up on some questions that we had yesterday. So for this workshop, we're using complete Canada cluster, which after a little hiccup was, I think, working fine. It was exciting yesterday morning when it wasn't, so hopefully it'll be smooth today. So this is really the resource that we're using. One of the reasons why, well, not only is it accessible and free if you're a Canadian academic, many of the tools and the pipelines and the genomes that you need for most bine-formatics analyses, we've already installed them. So as was mentioned yesterday, actually, if there are tools that are missing and you'd like us to install them, feel free to just send us an email and we can install if there are some genomes or tools that are missing. And David, in his presentation this afternoon, is gonna talk a bit more about some of the other tools that are also associated with the Compute Canada resources. But again, if you have any questions or if you wanna know what's on Compute Canada, feel free to visit this website. There was a question yesterday about who actually can get an account. So I don't have a full answer. They didn't quite get back to me, but I know that when you request an account, you can select this, so definitely if you're a faculty at the university or if you're in a group of a faculty at the university, he can request an account and then you can get an account as a student as part of his group. But I think these other categories also, non-for-profit, for instance, in this case, the NRC, which was asked, is even listed here. So I don't know the specific rules, but it's definitely worth exploring and it seems like there is an option even if you're in a governmental lab. Okay, but that was an aside, just again on Compute Canada and the resources that we're gonna be using. So off to the main topic. So DNA methylation, first it's sort of a brief recap of what DNA methylation is. The most common form of DNA methylation is five-metal cytosine methylation. It affects 70 to 80% of the CPGs in the genome. Even though there's other types of methylation, I'll focus my presentation really on this main type of methylation. You'll see that as part of the practical, we'll get all sorts of other outputs. I won't go into that in too much detail. I really will focus on the common form of DNA methylation, but many of the principles also apply obviously to the other types of methylations. But some of the basic things that are known with DNA methylation at high level of this five-metal C in CP-rich promoter is strongly associated with repressions. So promoters that have high methylation tend to be repressed. In CPG-poor regions, that relationship is a little bit more complex. And again, we'll see some examples of that. But again, in terms of the basic principle, you have the DNA that's wrapped around the histone, and the methylation of DNA is really this transformation where you have this five-metal cytosine. That's with this particular mark. This is a reversible process where in some case, you then lose that methylation, but typically this is a pretty stable mark. So why study methylation? Why is it interesting? Well, as I just mentioned, so it's perhaps the only epigenetic mark that's really demonstrated through mitosis. It actually is retained and it's been shown to be very important in the number of processes, so genomic imprinting. Metalylation is also one of the main mechanism that's used for transposon silencing. It's important in stem cell differentiation, in development, in inflammation. So it's been shown to be and cancer as well, I guess. So really, a very important chromatin mark to be able to assess and understand. So again, sort of finishing up my very brief introduction on methylation. Here's an example of sort of the canonical way of thinking about what's happening with methylation. So in the context of a disease like cancer, so you have promoters that are hypometallated, so no methylation in the promoter, these are active genes and there's some methylation within the gene that's being transcribed and that actually prevents a normal initiation. In the cancer disease state, you might have this normal methylation in the promoter, this turns the gene off and the methylation is completely messed up. You don't have the methylation within the gene and this leads to all sorts of normal transcripts initiating with the gene. So again, this is sort of a simplified view but just sort of shows the kinds of patterns that we expect to see in disease in some case. This is relative to gene expression. At the bottom you have some of the other examples where I mentioned that the methylation is used typically to shut down or control transposable elements so that's what you see here. Again, in a disease state, you might lose that methylation. This leads to a reactivation of the repetitive sequences and the transposon similarly. So that's why I guess we wanna be able to profile the chromatin and the methylation state and also potentially change between normal and disease states. Okay, so this is my short recap on methylation and in particular, five methyl cytosine methylation. So how do we assess now methylation? So there's three broad categories to measure DNA methylation. So you have this bisulfite converted microarrays go into this conversion, the enrichment-based method where you select DNA fragments that are enriched for methylation or alternatively unmetallated state and then you measure using next generation sequencing or you really do straightaway bisulfite sequencing. I hold you know bisulfite sequencing. So a key step for most of these approaches is really the bisulfite treatment. And this is sort of the key thing that's also gonna affect quite a bit the analysis downstream. So if you're not familiar with this, this is really the very key principle that's gonna affect the data analysis and leads to all sorts of funny things on the analysis side because we're actually modifying the DNA of the reads that we're gonna be sequencing. So this is gonna have a big impact on the analysis downstream. So this is just sort of a toy example that shows the effect of this chemical treatment on DNA but this is also what allows us to identify sites that are methylated. So the treatment, the bisulfite treatment, if we look at the unmetallated state first, will actually convert all the C's to U, which once we sequence will be labeled as the T. So the unmetallated states, if you fully convert using through the bisulfite treatment, all the C's will be converted to T's. And again, we'll need to take that into account later. And the difference and the reason why we're able to use this to identify methylated states is that methylated C's will be protected from this conversion and will remain as C's. So by analyzing the reads that come out of this, we'll be able to distinguish bases that were methylated versus the ones that were not just because the methylated C's will remain as C's and the unmetallated C's will be converted to T. So that's the basic principle that will follow us through the whole morning, I guess. So how is that going to affect some of the steps very similar to what we heard yesterday, but now we have this funny twist where some of the bases are converted and others are not. So how do bisulfite microarrays work? And again, I'm not going into a whole lot of detail. I just wanna give you sort of a sense of these alternative technologies. We'll be focusing on the bisulfite sequencing analysis, but just to show you the differences. So bisulfite microarrays, you prepare the DNA, you do this step of bisulfite conversion, and then you hybridize onto a microarray that is specifically looking at some of these CPGs that are potentially methylated, and this is how we're gonna be measuring the methylated state. So it's quite similar to genotyping microarrays in many ways, except that we're specifically targeting the methylated C's and we're looking at, in many case, differential methylation using these arrays. So to give you a sense of what you get out of these microarrays, and this is, there are now microarrays that are more dense than this, but it still will give you an idea already of what these microarrays are profiling and what they're giving you. So down here you just have, so the Huxlo locus, lots of genes, you see all of the CPGs in the genome, which are found sort of everywhere. On top of them, another thing that's important in the context of methylation are the CPG islands, so regions with particular dense CPGs, and they can be defined in slightly different ways, but you have really lots and lots of CPGs that potentially can be methylated. Specific CPG islands that are important because this typically or frequently are found in the promoters of genes and define whether a gene can or cannot be expressed. And then the microarray probes are pretty sparse. Again, this is an older array with just 27,000 probes. You now have arrays with half a million or sometimes more probes, but it's still a very sparse map on the genome. And for each of these probes though, what you get are these, after this data analysis and normalization, you get these sort of methylation ratios. So these would be CPGs that are not methylated and others that are then methylated at much higher levels. So it's a technology that works great and it's sort of has been around for a number of years. But as we'll see compared to some of the other approaches, the values you get in terms of methylation are quite sparse. So again, in terms of without going into all the details, there are tons of tools that can be used to analyze and process microarray data by cell-flight microarray data. The Illumina Genome Studio, for instance, which is proprietary software, but given that many of these arrays come from Illumina, it's quite popular. There's also some open access tools like RNA beats that are quite popular as well where you can basically reprocess and re-analyze the data. So this, again, is sort of some basics on microarray data and what you get out of that. Any questions so far? All good? Caffeine slowly flowing in. Yeah. Some of them can be used with some of the other tools, especially after, so the normalization steps can be a little bit different, but they a lot of times have features to actually compare samples and things like that. So many of them have features that are useful even with sequencing data, something. Yep? Well, a preferred. So typically, again, I know that we use normalization as part, that's part of Genome Studio, typically, as one of the options, but I mean, again, it's quite, I think, established in terms of the technology and how, and most of it is quite similar to genotyping arrays as well. Okay, so moving on to the enrichment-based approaches. And here you'll see that there's really different types of ways you can do this enrichment. And so we start with, and again, there's actually lots and lots of different ways you can do these enrichment-based methods. I'm only covering a few of the main ones to give you a sense of how this can be done, but there's much more than this. So, but in terms of, so midip seek is one of these strategies where you sonicate the DNA, you prepare the library. The key here is that the enrichment is an antibody for this five-metal C. So it's really, so it's very similar to some of the things you did yesterday with the chip seek, except that this time the pull-down really targets this five-metal C. And after that, you do library amplification and you do high throughput sequencing. So here, there's no bisulphite sequencing. You really have an antibody that recognized that particular state. So this is one strategy that can be used. It's called midip seek. Another strategy that's actually quite similar in the sense that it's also targeting specific, there's no bisulphite conversion. Here, now you're enriching for DNA that's bound by methyl binding domain protein. So here, you're enriching your DNA pool by targeting these DNA methyl binding proteins and then washing, preparing the library and sequencing. So these are just slightly different ways of preparing DNA, such that you're enriching for regions of the genome that are ventilated. Yes. Well, so I'll get to that just a little bit later. So another approach, again to sort of enrich, because the problem is that, well, we'll see that a bit when we get to the whole genome. If you don't do an enrichment base, you have to sequence the whole genome, basically. So that ends up being quite expensive. So another strategy to sort of target regions that are methylated is called RBS. And this is a strategy where you're preparing the DNA by digestion using this enzyme that specifically targets regions that are methylated. So again, so here, so there is bisulphite treatment step, but the way you're enriching for DNA that's methylated is by using a particular enzyme as targeting regions that are methylated. So in terms of how do these methods compare? So again, this is an analysis that was done a little while ago, but so here you actually have profiles of these three different enrichment-based technologies. So I have a hard time reading myself. You have MEDIP on top, metal cap here, different version. RRBS, RRBS, well, I went over that quickly, but RRBS compared to the other two. So the other two are really, you're enriching, so it's really like chip seed. You're enriching for using an antibody, and then you're getting DNA fragments from regions that are methylated. Here it's a little bit different because you're enriching for regions and DNA fragments that are methylated, but then you do the bisulphite treatment, and then you're looking at these bases that are converted from C to T. So that's why the profile of RRBS down here is a little bit different, but again, we'll get back to that more when we're looking at bisulphite sequencing, but these profiles really look like chip seed. This really looks like what you were looking at yesterday, except that we're getting DNA fragments that are methylated. So you're getting a profile that's quite consistent at this level, at least the profiles are quite consistent across a different technology. With RRBS, you're getting reeds, that's what you see in pale blue, so you're getting reeds from regions that are methylated, and then for each of them, you're able to see, and I mean, it's hard to see here, but you get to see whether the base was protected or not, and so assign it a methylation state. So at a high level, all of these methods are actually quite consistent. They're actually also consistent with the array-based approach, but as we'll see in some of the next slides, even though overall it's quite consistent, the different approaches do have strengths and weaknesses and do cover. You look, for instance, I mean, this is a clear, very strong peak, but it wasn't covered on that particular array such that you're missing it. So there are differences between the different technologies and I'll get back to that. So similar to the microwave-based approach, you have a lot of tools, and again, this is not all of them. You have some of the standard tools for mapping, especially for bisulfite sequencing approaches, but then you have different tools that are specialized on the different technologies and will have data normalization. If you take this one, for instance, for media, so where they're fine-tuned to normalization to take into account the specifics of the different technologies. But again, I mean, the whole thing could have been about the enrichment-based data sets or the microwave-based normalization, for instance. This is not what we're gonna cover directly. This is really sort of one of the key slides, and again, it's a bit dated, but it still gives you, I think, one of the main take-homes from these technologies. So the technologies are actually quite consistent in what they provide, but the big difference is what they cover in the genome. So if we start maybe with the microwave, so it's down here, so the microwave really was, well, this is, again, an older version. The microwaves now cover much more, but if you look, they cover only some of the CPG islands, mostly promoter CPG islands, but if you look across the whole genome, you have very few probes. So the microwaves are quite stable and cover gene promoters, especially the more recent microwaves cover the promoters very well, but if you look genome-wide, they really don't cover much of the genome. RBS, which was, if you remember, so it's an enzyme that's cutting DNA, so you get reads from regions of the genome that are methylated, so you get very good coverage, very, very good coverage of CPG islands, which, if you remember from the beginning, most of them are methylated, so that's why you're getting data from these CPG islands. You get very good coverage of promoter regions. Overall, you're not getting much of the genome, but you could argue that you're getting much of the genome that is methylated, so RBS is interesting in really sort of concentrating the reads that you're converting and then analyzing to the regions of the genome that are actually methylated, but it's still not come perhaps. Describing RBS as a means of enriching from methylated regions, I thought it was a means for enriching in CPG rich regions with the methylated and the nones. Correct, correct, sorry, yes, yes. So, but those are the regions where, especially this type of methylation, since we're focusing on the five-metal C, those are the regions that are interesting for, correct, so I didn't say that well. If you look now to the other two approaches that we're actually targeting regions that are methylated through this pull down, either the pull down of the five-metal, using the five-metal C antibody or the methylated binding proteins, you see that, again, you get very good coverage of, well, you get very good coverage overall. One of the challenge with these is that you lose the base bear resolution. You don't know which Cs are methylated, you just know sort of aggregate profile. So this, I didn't go over that in much detail, but here what you're getting are DNA fragments that are enriched in your pull down, but you don't necessarily know which Cs are actually methylated within that, you're just getting DNA fragments. So these methods have the advantage of covering quite a bit of the genome, you're getting reads everywhere, but you don't have the same level of resolution of knowing exactly what's the methylation within the regions that are getting pulled out because you're just getting fragments. So it also has, we'll get back to that, it also has other challenges, but one of the big difference between these different technologies is really the area of the genome that they cover. So the last technology, and that's the one that has many advantages, but one big disadvantage, which is the price which I have here. So you see here that one of the advantage it has is that it's got a very simple protocol, it's not just that that doesn't only mean that it's gonna be, that it's easy to do, it also means that it has fewer biases or fewer constraints compared to the other ones that are gonna depend on the antibody, that are gonna depend on the performance of the RRBS protocol, for instance. So this is a very straightforward approach where you take, you really, you shear the DNA, you take all of the DNA, you do the bisulfite treatment, and then you do the high throughput sequencing. So here the drawback is really the cost. Another technology that I've added actually just recently, again, but there's many other, I'm going back and forth between enrichment base and whole genome bisulfite sequencing. So here's another technology that's actually quite similar to the whole genome bisulfite sequencing and is very similar to exome capture sequencing. So exome capture, if you know how that works, is very similar to whole genome sequencing. You just have this extra step where you capture DNA fragments that are coming from specific regions of the genome, the exome. So the approach here, MCC, which is an approach that's actually been developed at our center, is quite similar to that where you prepare the DNA, you do the bisulfite treatment, but then you actually, using a capture array, capture DNA fragments from regions of interest. So this is a way of reducing the cost of sequencing, but potentially targeting only the regions that you're interested in. But you get to basically design and choose which regions you wanna consider. And this, for instance, we use to capture specifically not just promoters, but also known enhancers and so on. So this is just sort of a twist on the whole genome bisulfite sequencing. Yes. So, yeah, so actually there's two variants of this protocol, either you do it before or after. So there's two different technologies. I'm forgetting which one this one is. So you can do it either way. There's two different approaches. Okay, so this sort of, in terms of my introduction on these different technologies, this is sort of the end. So these are now the tools that can be used for whole genome bisulfite sequencing, but bisulfite sequencing technologies in general. So I mentioned a few approaches, RBS, this MCCC or whole genome bisulfite sequencing, and there are others. But these are, when you're getting these reads that are bisulfite converted, how do you actually do the analysis? You see some of the tools that we're gonna be using and describing in more detail. This mark, this snip. But there are many others, last, past, so lots and lots of different approaches. So again, we'll go into a subset of these tools in more detail. These are really tools to use these bisulfite converted reads, map them on the genome and assess the methylation status. So just to sort of summarize some of the things that I talked about in terms of strengths and weaknesses of these different approaches. So they all provide reasonably accurate DNA methylation measurements. Microarrays are lower cost. They can easily provide accurate measurements across lots of CPGs in the important regions. So they're quite valuable in that sense. So most of the dataset that are available of this type, I would say at this point, because again, it's really targeting regions where we know that the methylation is important. The enrichment-based methods, well, so the first two that are capture-based have relatively low resolution, as I said, and they can really be a challenge to analyze and normalize because there's all sorts of factors that actually play into the measurements that you're making. So they can actually be quite challenging. The bisulfite-based methods have the advantage of really providing base resolution, base pair resolution. They also have some drawbacks and we'll get into that when we're looking at the tools to do the analysis. Again, the fact that these genomes, these reads have these bases that are converted lead to also funny biases in the analysis. And then the main factor, the enrichment-based methods would not exist if sequencing was so cheap that whole genome bisulfite sequencing was affordable in the way to go. So there's currently, I guess, a shift, a slow shift between these approaches that had some advantages ultimately to the whole genome bisulfite sequencing which has less biases but really involves sequencing the whole genome because a whole genome bisulfite sequencing experiment, you have to sequence the whole genome at that significant depth to be able to really call the methylation state accurately so it ends up being even more expensive than sequencing a whole genome. So any questions on this? Yes. So ideally paired in, especially because, I mean, as we'll see in the analysis, the challenge is that because there's only three base potentially you're losing information. So the longer reads and the paired in help you quite a bit because otherwise the genome is even more ambiguous than it is now. So it's already ambiguous in certain regions of the genome now that you're losing in the way information on one of the base, it becomes quite a challenge. So the read length is a big factor. Okay, you're all warmed up and ready to jump into the challenging analysis of bisulfite data. So here's the outline, I guess, of a typical analysis workflow for bisulfite sequencing data. We'll go over each of these steps in some amount of detail and then in the practical we'll also do some of these steps, not all of them in much detail because this would take even more time than what we have. So, but at least in the lecture part, we'll really cover all of the steps that are relevant. So you'll see that there's some initial processing of the data, some data visualization and statistical analysis. And then we'll end with some downstream analysis to identify regions that are different, differentially methylated between different regions. Okay, so let's get started. So the initial quality control and pre-processing. So this was true yesterday and is also true with bisulfite data. Before you start an analysis, it's really important to look at your raw data to get a sense, to know if your samples were sequenced using the same protocol, ideally, same instruments. Are there any technical issues that are affecting some of the sample? This is gonna be especially important if you're doing a sample to sample comparison, which is usually the case. So very similar to what you did yesterday. It's a very good idea to run. So these, you're getting reads where, so even though some of the bases have been converted, there's still the profile. That's why you can just basically, there's not much difference in sequencing bisulfite treated reads compared to regular reads. You can run them on the same instruments. So you can also use some of the same tools as yesterday, looking at the overall properties of your reads, the quality and so on. So that's, so we won't do that in the practical, but this is something that you should know how to do if you followed well yesterday. So the profiles of the converted reads and so on can be a bit different than regular DNA reads because of this conversion. So you get many fewer seeds, for instance. And so this is probably after the reads have been converted. But again, it's good to look at some of these profiles and make sure in particular that between different samples, you get similar profiles. I'll get back to some specific things to look for in terms of quality metrics. So if going back to this, if you see this is a good profile, you have that most of the reads is actually of good quality. And then as you would expect, the quality degrades. Maybe you would want to trim the reads a little bit to improve the read quality. Yeah, yes. So this is an image of the sequencing slide itself. And sometimes, and this displays the quality scores of reads in different positions on the plate, basically. That's right, so more yellow or more red in some case. So sometimes you have that one side of the image is you see that all the reads coming from one particular position or like a whole line are of bad quality. So that's, I had another one where you had some, well, you see here that there is like one particular region where it seems like some of the reads are bad quality. Because potentially, again, this is just trying to troubleshoot why you have some reads of bad quality. You might have, most of the reads are good quality, but then you have a subset that's horrible, but they're all coming from the side or something. And then you kind of know, well, you can go back to the sequencers and tell them maybe you need to tune your sequencer, because I'm losing such, you know, 10% of my reads coming from some regions. So it's more on the troubleshooting. So if you do identify that there are some reads that are problematic and that you have lower quality, there's different trimming strategies that you can use. So simple trimming where, you know, you scan from the five prime end or the three prime end as soon as the quality falls below a particular threshold, you trim the read. There are some slightly more advanced trimming strategies. And, you know, there's different papers that, you know, benchmark the difference trimming strategy. In the end, you know, if you have reasonable quality data, you know, it's not gonna make a huge difference which specific trimming strategy you use. But if you have, you know, lower quality data, maybe that's something that's worth exploring and checking in more detail. I borrowed this slide from somebody else, but this is an example of a pretty horrible dataset where you see, if you remember, this is bases and then read average, the re-quality of all the reads. So you see that these reads are horrible and all over the place, lots of reads of very bad quality. Then you can try to salvage your reads a little bit. But typically, if you have to do this amount of trimming, it's still not a very good sign in terms of, I mean, hopefully you don't have to do this and really your sequencing quality is better. So if you start it like this, what do you see when the reads were limited? Well, here it looks like, so you see, this is the overall distribution of quality. So you see that there was a big, big bunch of reads, basically with zero quality. So they probably remove half the reads or something like that. And then you're just left with a much smaller subset. But I mean, this might be even more than 50%. I mean, the worst would be if you actually have different samples, some that have a very bad profile and some that are a very good profile because then, you know, there might be other effects by trimming a lot, some of the reads in one experiments and not so much in the other. I mean, what you're hoping is that it's quite homogeneous across your different samples. Otherwise, it's probably gonna affect the different methylated analysis that you're doing downstream. You had a question? Oh. Okay, so retrac quality and trimming, but again, I mean, most of the, I'll get back to some of the key things to look for in terms of quality metrics, but, you know, older data sets are more problematic than some of the newer data sets in general. So another important point that we also touched on a little bit yesterday, but that's important for DNA methylation. So, and this is, so here's, I think, a good example of that. So if you have a complex library, meaning that you have plenty of DNA fragments that you've sequenced, you'll get a distribution a bit like this where the reads are all spread out and then you don't have much of a problem. The problem comes in if you have a library of low complexity such that, and that was, you know, amplified. So you have lots and lots of fragments that are actually the same fragments that you've sequenced multiple times. And the problem with that is that, as we'll see later on, we're gonna use the methylation state whether the C is converted to T to estimate the methylation state, but we're gonna count how many times we observe a particular position as being methylated or unmetallated, but this counting, if we're counting multiple time, the same, so for instance here, you know, you look at this one, we have two copies of this read here that's unmetallated, and then you have lots of copies of this read that shows up as being methylated. So we would estimate that that C is methylated with, you know, 70% of the time. But this estimate is being thrown off by the fact that we have lots and lots of duplicates of that read. It's even worse here where we're saying that this is methylated only 17% of the time, but again, it's because we're multiple, you know, more than double counting here. The idea of removing duplicates before we're making these estimates is that, you know, this is just one molecule that was sequenced multiple times, but you will only have one representation of it. It's gonna improve our methylation estimate. So removing duplication, duplicated read, especially if you have a library that's not, that, you know, so typically, Roginon-Bysulfite sequencing should be more complex. Some of these enrichment-based strategies sometimes are less complex and you have more duplicates. So definitely looking at the rate of duplicates is one of the quality metrics to look for as well. And if you do have a lot of duplicate reads, you need to remove them, and then that's gonna affect your estimates, yes. So it's really at the level of counting duplicates that you see that. So how many times you have the reads really starting and ending in the same position? If you have any, yeah. And so you're only going to go on right now. So it's always good to remove duplicates, and there you expect to have, you know, sometimes 20% duplicates or something like that, which is fine, but it's still recommended to remove the duplicates. So you're losing 20% of your read, but you're gonna be getting more accurate measurements of methylation. And depending on, yeah, the capture strategy, it's not uncommon to have duplicate rates in that range. So on that, you know, some of the things to look for when you're looking at the quality of your libraries, so the read, the overall read quality, the presence of adapter sequences. So this is similar to, so I didn't cover that, I guess, in the slides more than this, but this is similar to what you saw yesterday. Again, FASQC or other of these metrics will detect the presence of adapter. That's the same thing. So we talked quite a bit about checking the duplicate rates in your libraries and whether they're, you know, more or less homogeneous across your samples, or if you have some samples that have very high duplicate rates, you know, because you're gonna be compressing them, is basically you have fewer reads in those libraries, maybe it's worth trying to redo them such that it's more homogeneous if you still have material and can do that. Another thing that I didn't discuss much but actually is quite important is this conversion rate because the bisulfite step can hopefully is more or less complete, meaning that all of the unmetallated C should be converted. So as part of these experiments, usually there are spikens of where you can measure the, you know that these are fully methylated or fully unmetallated DNA fragments and those are used to estimate the conversion rate and you want to have also a high conversion rate to make sure, so is that clear? I mean, I should maybe should have had this. So there's really, so there's DNA fragments that are incorporated into your library and they're unmetallated and you can check their level of conversion and to know whether the experiment actually sort of worked at the end. Like the standard, right? That's right, that's right. So it's really, so and there you expect to be above 95 range conversion because otherwise if the bisulfite step didn't fully convert the unmetallated C, it's also gonna then limit your analysis. Yeah. Is there a widely accepted threshold for conversion rate or the ones you bring in? I think it's really, it's quite high. So it's supposed to be above 95. I don't know if there's a specific 98, 99. What's the conversion rate that's acceptable? Okay, so this was on quality control and pre-processing. The next bit is the best bit. So this is, that's the part where I need to have had coffee to be able to understand it otherwise it gets confusing. So here we go again, the bisulfite treatment step. So if you're, so similar to the slide I shown before, so you'll have, you have metalated Cs and you have unmetallated Cs. So denaturation, you're getting the two strands separately. This is the same representation with the metalated C and the unmetallated Cs. So the bisulfite treatment as we've been saying will convert the Cs to uracil and then the metalated Cs will be protected. So the part that gets confusing is that you really have to look sort of separately to what's happening on the Watson and the Crick strand because it's slightly different. So I mean after, so the conversion will happen sort of differently because on the negative strand the C is a G which will not be, will not be changed. So you really have to consider the two strands separately. It's gonna have sort of different effects on the two different strands. So that's the conversion itself. So how do we actually analyze the reads that we get after this? So there's three different strategies for aligning or processing these reads. The two first ones are really the main ones. I added this one as a sort of upcoming bonus. So this is the third one which we'll see at the end is not as standard or developed, not nearly as much. So the two main ones are really these two approaches. So, whoa, let me just, there's something weird. So let's just look at that slide like this for a second. So the wildcard aligners, because these Cs will be converted, right? So you replace in the reference genome, you replace the Cs with this wildcard letter Y which is gonna match both the Cs and the Ts. The problem is that sometimes they'll be converted, sometimes it won't. So you sort of mask the Cs in the reference, you put the wildcard that the aligner will align to it no matter what. You can also modify the scoring alignment such that mismatch between the C and the reference won't matter. So again, here you have reads that sometimes have a C, sometimes has a T at that position. So you put wildcards everywhere in the reference where you have a C, either physically changing the reference or by changing the alignment score such that you're able to align the reads even if it's a mismatch because otherwise all the Ts will look like snips or variants and you're gonna lose a lot of reads. So there are software that use this strategy, BISMAP, GSNAP and so on, PASH. So that's one broad category of strategy. Let's try to go back to this. So the other strategy are these three base aligners where you convert the reads, all the Cs into Ts in the reads on boats for boat strand of the genomic DNA sequence. And the strategy that the software tools that use that are this mark and Brad. So we'll, again, I think using an example, hopefully will help a little bit. This is, again, it's sort of hard to wrap your head around this, but so looking at example might help here a little bit. So here, again, at the top we have sort of the overall example in what's happening. So you've got an example of a CG that's 100% methylated, a CG that's 50%, CG that's 50% or CG that's not unmetallated. So each of these will be converted into, and these are example reads where, so if the read is coming from this part because it's 100% methylated, it'll be protected. But if the read is coming from this part, the C will be converted into a T. So that's just the motivating example. So the wildcard alignment strategy is really, as we said, all the C's in the reference genome are converted into a C, into a Y, I'm sorry. And then you're mapping the reads that are coming from these different regions. So here the CG were methylated, so that C was protected, but this one, the C was not protected, so it was converted into a T. So you're aligning all the reads to this converted reference with wildcards with the Y's. The problem is that by having this, reads that used to be mapping uniquely start mapping in multiple places. And that's where it gets a little bit tricky is that whether the, it's gonna affect, so reads are gonna start mapping in an ambiguous way, depending on their methylation state as well. So here, if we take this example, well, okay, so here again the CG was protected, the C was converted, these reads mapped to here, so that's all good. And we get an accurate methylation assessment of that position, so it's 100% methylated and protected. If you look at this example here, so this one is 50%, so you get the C, the T is a T, so you get one of each, one is protected, one is not. You get unique mapping, the other C here is converted into a Y, it was unmetallated. So you get those two reads mapped, again here it works. Here it doesn't work well because, so A, C, G, again this is, I mean, a bit intense, but here what happens here is that the read that is unmetallated, though that was where the base was not protected, suddenly that read not only maps here, it also maps somewhere else in the genome, and this tends to happen quite a bit because again, now that we have these wildcars in the genome, you have more ambiguity in terms of where reads map. So in this case we lose one of the two reads, we lose the read that was unmetallated. So suddenly we only get unique mapping of the read that was methylated and protected, and this leads to a false estimate of methylation of that particular state. So in the wildcars strategy, it works roughly, but you still have some reads that become ambiguous and you're losing some reads in a certain place. That problem is even worse, in a way, with the three-letter alignment conversion. So the three-letter alignment conversion, again now we're looking at the genome basically on both strands, only on the three-letter alphabet. And with that you have even more ambiguous reads in some case. The advantage, and I'll get to that in terms of strengths and weaknesses, this is a very conservative strategy in the sense you only get reads when you get, because every read that's ambiguous gets thrown out and you're only left with the reads that even though you were using this three-based alphabet are unambiguously coming to that particular position. Yeah? I assume that the discordance between these two methods goes down as you have longer reads. Absolutely, absolutely. And so is there some read length at which there's 90% in cordless work? So I think if it goes down with the read length and it goes down, there's some regions of the genome that are not problematic at all. There's some regions that are repetitive and then in those regions, no matter the read length, it's a mess, but you're absolutely right that these problems disappear with longer reads in most of the genome. But it's still, it's gonna be affecting certain regions. It affects regions that are CG rich for instance, which are what we're interested in. So there are definitely regions that are CG rich that are low complexity, where this kind of ambiguity because we're not even using the four-letter alphabet hurts us a little bit because we're losing reads in those regions. So have there been any systematic comparisons even perhaps in silicone? Yes, I think so, absolutely. So I don't have that here, but yes, absolutely. People have looked to see the effect. And the bottom line is that for most of the genome, if you have long reads, it's not the end of the world. But you'll see, it gets worse with some other things that are coming. I mean, this is just one level where it gets complicated. But absolutely, I mean, it's not, like here we have reads that are four letters long, right? So of course you get easily, otherwise the ambiguities are lower. But the problem is that there's definitely regions of the genomes that are outlier and where it's always gonna be a problem and these things show up there, right? So it's still gonna affect subset of the genome. Okay, so this was my toughest slide to go through, but hopefully, I mean, again, the sort of the details are not super important. So you get different strategies to use this. I mean, the take home message is that remapping in this sort of three letter alphabet is a bit of a challenge. Problems that are creeping up, even regular alignments are worse in this context. And it also leads, the thing that's a little bit nasty, I would say here, are cases like this where it's the methylation state that actually affects the alignment. So then you're biasing your estinates. That's what I'll get to, that's where it gets really nasty. So that's what I'll get to after, yes. So overall strengths and weaknesses of these different approaches. So three letters aligners have lower coverage in these highly-metallated regions because it decreases sequence complexity so they become more ambiguous. So that's what we saw in that example. We tend to lose reeds in regions that are CG rich, but at least we don't make bad calls. Wildcat aligners typically have higher coverage in these overall, but they do introduce some bias towards increased DNA methylation because those, when you have the seeds, it increased the complexity. The reeds do map uniquely. And again, that's an example of what we had over there. But these problems, as we were saying, are more prevalent in repetitive regions and go away with longer reeds. So these are sort of subtle differences, but they still affect the analysis in some regions of the genome. So the program that we're going to be using together in the practical is this mark, which uses this three-base encoding strategy. It converts all the reeds, and so we'll see that as we're doing. So it converts all the reeds on the two strands, maps to the reference genome in the four different configurations to make the assessment. But this is what we'll cover in more detail in the practical. So it's one of these three-base encoding. The last, and this is sort of on this topic, the last, this is really an aside, but I thought it'd be interesting to just mention this. So it's called reference-free processing. So this is something, that's a strategy that's been used for the much easier variant calling problem, which is also challenging. But variant calling, again, that's easier than what we're doing today. But variant calling, typically, you map the regular reeds and you look for bases like this. So by the way, so this is an IGV view, which was in the readings, so which hopefully you're familiar with. All of these gray bars are reeds. And when you have a mismatch like we have here, in this case, this is not ventilation data. This is variance data. This is just saying there's a mutation. There's a change in this position. You have a normal sample. And so you can easily detect that there's a somatic mutations here because most of the reeds in this sample have a base change while all the reeds here. But this is reference-based. So you're mapping the reeds at that location and you're looking at the variance. So for variant detection, there's this alternative strategy where you can do this without even having the reference. So you just compare all the reads from the tumor with all the reads from the normal. And without using the reference, you compare them and you look for reads that are saying different things. And so you look for reads that look quite similar but that have a difference. And again, I don't go into the details. It's not so important. It's just that you can do this with looking. If you're looking for variance, you just look for clusters of reads that look very similar but have one difference. And you can do something similar with methylation. This is too much information and not so important here. But if you're interested, I'd recommend you take a look at this. I think this might be an interesting way of being able to detect methylation change in regions where the mapping step is a challenge because you're basically just looking at reads that are different in your two states. This would also work if you're working on DNA methylation analysis and you didn't have a reference genome or you were working on strains that are very different from the reference genome. You could actually just look at all your reads and look for reads that have methylation. Again, this was sort of an aside. So we're going to get to your question soon now. So quantification of the DNA methylation level. So we've mapped, again, the main strategy is you map the reads on the genome and now we can look at the methylation state. So our objective is to be able to. So we've mapped all the reads on the genome. We have the reference. So we should be able, by looking at how many of the seeds have been converted or not converted, we should be able to say the majority here. So here blue is methylated. So the majority, but not all here, all the seeds have been methylated. Here, they haven't been methylated and so on. So how do we actually get this methylation profile that we're looking for? So that's the next target. So this is where, so I said that the other slide was the hardest slide. Actually, maybe this is the hardest slide for me to explain to the point where I even left the caption there to make sure that I could actually go back and forth with me if I got lost. But so far, I've assumed that there were no SNPs in my description. But SNPs add a little twist to this whole thing. So let's try to look at this example. So here we have a C that's not a SNP, just a regular C that is unmetallated. So you see that the true genotype, which is unobserved, in this case, is correct. So this is really a C. So a C on one strand, that means a G on the other strand, that means that after the bisulfite conversion, all the C's will be T's and all the G will also be G. So this is what you expect from a real C in the genome that's unmetallated. So the challenge slash fun part is that sometimes you have variance. So the genome that you're sequencing might not be an actual CG at that position. So if there's a SNP, that individual might be T. So if that's the case, you're going to be reading all these T's. But so if you're comparing them only to the reference, it should be a C. These are all T's. So if you're not careful, you're going to call this unmetallated C, because all of the C have become T's. And the trick is that if you have directional sequencing and you know that you have some of the reads on this strand and some of these reads are on the other strand, you can look to see whether you have a G or an A on the other strand. So these T's, when you have an A on the other strands, are not because it's an unmetallated C. It's because it's actually a T. So it's just a little bit of a trend. So the information is there. If you have directional reads, but you have to be careful about that. Similarly, you might have something in the reference genome that's a T, which if you read it, again, you're not careful. You see T's. You say it's fine. But here, again, in this case, this was a C in the actual person that we've sequenced. And you should be able to see that by looking at the reverse strand. So there's a way to reconcile what the real state is, including if it's at the SNP. But you need to have the software that's going to call the methylation state needs to be aware and looking at this data in the same way. And I don't know who wrote that software, but it's a pretty smart person because it's pretty challenging to take all of that into account when you're calling the base. Yes. Does that also take advantage of the expectation that just reprime of the C there should be a G if that were a methylation site? I think it could, but the problem is that if you put that as an assumption, then it's going to work for certain things and not. I mean, here, this is true no matter what. I'd say the challenge here is that this is an easy case where I say it's 100% un-methylated. The problem is that on top of that, there's these gradient levels of methylation. So this is a toy example because so what if you only see one T? Are you calling a SNP up there? And then you've got errors. You've got different levels of methylation. So this is an easy case. So it gets very tricky around SNPs what you're calling. And also, what's the impact of mapping? The mapping step assumes that you've got the correct references. So if you've got SNP, the mapping is maybe less accurate than regions where this is already hard with just regular variants. I mean, this is like hard Corvine chromatics. Yes. Well, it is if you have the reference, yes. But if you've got lots and lots of variants, this problem becomes even more of a challenge. A lot of times, what gets done, because accurate calling of methylation states on SNP is so hard, one strategy is actually to ignore the calls. So unless you're really confident that it's working, you can also just mask paces that are known to be variable between individual and say, I don't trust the methylation calls as much. I'm not answering your question at all, but this is a different thing. So these sites are difficult, and then you can actually ignore them is one strategy you can use. To answer your question of if there's no reference, on different genomes, if there's a reference, that's fine. Otherwise, you might try these reference-free approaches where you're not mapping the reason the reference, and you're just comparing the reads that you're getting into two conditions. But I don't think there's a lot of work on that, but you can. OK, so I lied before. This was the hardest slide. I'm not even looking at this side. I think you get enough of the point that the methylation values that you're going to get on SNPs is even more challenging to assess accurately. And for this, you need to take into account the two strands, ideally, to be able to do anything. If you have undirected reads, then you have almost no choice but to mask regions that are known to be variable. OK, so some of the tools that do SNP this type of methylation collar, this SNP, methylation, metal extract, this SNP, so there's different ones. This SNP, which I'll cover now, at least has a lot of the steps that make sense and that are quite powerful. So this particular tool is inspired from the GATK framework, which is quite popular for variant calling. It has many of the steps that I'm not even getting into that are also quite useful. One of the steps that we didn't discuss in the context of methylation, but that's useful in the context of variant calling, is local realignment. So that's another thing that mapping each of the reads individually is one thing. But in regions that are difficult, it might be good to realign those reads by taking the information of all the reads to do a better job. Again, all of that is a bit technical and I think is beyond what we can do in an hour. So it includes a lot of these additional steps that we use strategies that we use to improve variant calling. So local realignment, base quality recalibration. Then you can actually use the two-strand information to do SNP calling. And then based on all of that, you can assess the methylation state. But again, it's quite a tricky business to do this. Lots of people have worked on this and have different assessments of quality and so on. But again, I think it's variant calling itself is challenging. This, I would say, is even more challenging. But what you get out of this, again, the accuracy here, measurements, it's not so key. In easy regions of the genome, this is going to work great. In certain regions, this is going to be a bit more challenging. OK, so if you're still with me, we've gone through the challenging part, I think, of the analysis and of the slides. So now, so we've mapped the reads on the genome, and we've been able to call the methylation level at all of the bases. So the next step is being able to look at the data with data visual inspection. So the tool that we're going to be using is IGV. And IGV has this mode where it actually can easily color the reads, taking into account that these reads have been sequenced following after bisulfite treatment. So it has this, and we'll do that in the practical, it has a way where it's going to be It has a way where, because have you seen, so the direction of the read makes a difference in terms of which strand you're reading. And so it's going to color in red, the non-converted C, whether they're on the plus or the minus strand, they're going to convert them in red. So these are regions that are methylated, that were protected, and it's going to convert the T's, which were the unprotected C's, which were the unmetallated C's, whether they're on the plus or the minus strand in blue. So red will be methylated, and blue will be unmetallated. So when we'll look at data in the IGV, we're hoping, and if the practical works, this is what we should see. So we should see things like this, where within the reads, there's a lot going on, but red means methylated C and blue means unmetallated C. So this would be a promoter. So this matches very well. You've got a CPG island. This is the promoter of this gene. It's hypo-metallated in the normal sample, but it's sort of more mixed with some methylation. So it's no longer hypo-metallated in the tumor. So this is like the example I shown you in the very beginning. So this is the example of a gene that should have been on with an accessible promoter, but seems to have been shut off in the tumor. So we'll look at example similar to this in the practical. So it's good to look a bit at the data to make sure that everything has worked in this way. You can also then do some additional analysis of your data. So looking at the global distribution of methylation values, look at samples similarity, and so on. So these are global values, which is more or less what you would expect. So methylation, so you've got Cs that are completely un-metallated, and then some Cs that are highly fully methylated or partially methylated. So this is the kind of distribution that you expect to see. This is information on the coverage of the different Cs, because of course, these values are going to be precise, especially if you have a certain number of reads covering the C. Otherwise, you'll be at 100%, but that's because you have only one read and so on. So you have to be a bit careful about that. There's different tools. I'm putting metal kit as one example. Different tools that allow you to generate these types of plots and so on. Again, there's quite a bit that we could do with some of the downstream analysis of these data sets. We won't have much time to do that in the practical this time, but these are some of the tools that you might be able to use. The kinds of analysis that are interesting also are these just pairwise comparisons. So you've got methylation values for all the Cs in the genome, or a subset of Cs if you had enriched base strategy. You might compare the values between pairs of samples. So this is just showing you the pairwise correlation between different pairs of samples and whether you're getting what you expect and you can identify. So a lot of this is quite useful if you have multiple samples to verify that the data looks as you would expect. And this goes back a little bit to your question. Many of these tools were developed for microarray methylation values, but the same things and the same strategy. Once you've converted the bisulfite reads into these methylation profiles is just you have scores over more Cs, but you can use the same strategies as before to compare the samples. Another relevant analysis would then be do some kind of clustering strategy looking at similarities between the methylation profile. And the question here is really, how do you normalize the data? How do you look at all the Cs? Because most of the Cs are unmetallated. Maybe you focus your analysis on the Cs that are in promoters and CPG islands to get better clustering. But this really depends on your analysis and what you're doing. But some of these still sort of basic analysis allow you to know whether your samples, whether it shows the profile that you expected and so on. So I'm coming to the end of the workflow with some even more advanced downstream analysis, assuming that all the basic processing step work. So downstream analysis, now we might be interested in identifying regions that are different between two groups of sample, looking at, so these are called DMRs for differentally methylated regions. You can look to see where these regions into two groups are. So this is really a lot of time the whole point of these studies is to compare two conditions or two sets of samples to identify regions that are different between them. So here's, I forgot to put the reference here, but here's from the same review that I had before, an example of that. So if you look at the top, you've got the different CPGs. You have, assuming you have cases and control, you're going to get these methylated value at all of these CPGs. And what we're interested in are potentially regions like this where it's hard to trust a single CPG, even though maybe that's relevant. But most of the time, we expect methylated region to span a few CPGs. So we're interested in regions like this where in the cases, you have high methylation. And in the controls, you have low methylation. So this, again, it's too small for me to read. So here, you can do sort of a single CPG analysis to know whether it's higher in the cases versus in the control. Single CPG analysis is maybe not as robust because you might have issues. A lot of times, we're more interested in the region itself being higher in the cases versus the control. So we're looking for places like this that are higher or lower methylation in the cases versus the control. So to be able to identify regions like this, one thing that's important to realize is that just like almost any other experiments, it's good to have replicates. Because if you don't have replicates, if you don't have enough replicates, you cannot distinguish CPGs that are just highly variable. And they might be highly variable in both the cases and control. So by having replicates, you can control for that variability and really identify regions that are different. So this is sort of an example of that, where you've got, so let's call the blue the control and the red the cases. So depending on if you only had one case and one control, you see that you might have said this region is, I have to say, you see that there's actually quite a lot of variability in that region in the cases. So it's not necessarily, so you might, if you only had had this guy and this guy, you might have said, so the top red and compared to the blue, you might have said that the blue is hypo-methylated relative to the red. But clearly, it's just it's a very variable sites in the red group in this case. And maybe it's not really what you're looking for. So the biological replicates, like in anything, help you identify the sites that are truly different between the two population and not just the sites that are variable. Another thing in identifying these regions is that individual CPGs have values that can fluctuate quite a bit, but generally the methylation patterns span a few bases. So it's actually good to smooth the data to identify these regions that are differentially methylated, because you might not want to trust a single site that's methylated. So smoothing the data to identify these regions actually helps. So I'm almost done. So just to finish up on identifying the differentially methylated region, this is just one example that I pulled out that I thought was quite neat. So it links back to the very beginning where I mentioned that methylation is important in development. So this is through time. I believe these are different days of development. And then these are ES cells, but you see that this is showing the methylation values in profile. And you see that there are differences through time, in a way, in terms of specific regions that are hypo-methylated and then that have higher methylation state. So this is in a case of identifying. So what you should come out of the analysis of the MR region are exactly what you see here, that there are these blocks that are differentially methylated in some states versus other states. And this, to link back to what was said yesterday, should also match what we see with the histone marks that tend to be correlated with methylation. OK, so almost done. But this is an important slide, as well, in terms of coverage. So I mentioned that whole genome bisulfite sequencing, it's quite expensive. So you really need to decide how much you need to sequence if you go down that route. And this is an interesting recent paper that did this comparison and did this analysis. So here you basically have multiple replicates, and you're looking at the similarity between different replicates. But the key is really, so in terms of the coverage requirement, let me just size. So this is looking at sort of overall, well, these plots are a little bit different. Let me focus on this one first, which is really what I want to talk about with you with this slide. So this is, on this axis, the average size of the, again, usually one of the thing you want to get to are these ranges that are differentially methylated. So what's the size of the regions that are differentially methylated? And what's the average methylation difference that you can detect if all you have is 1x worth of sequencing? So even with 1x worth of sequencing, you can actually detect very broad regions, typically 1kb, that have very big methylation difference. So if you have samples in two condition, even in which shallow sequencing, if you're fine with just looking for very big regions that have very different methylation change, you're able to detect them. As you have more and more sequencing, 5x, 10x to 30x, what you're able to get is sort of finer and finer regions, so smaller regions that have more and more subtle differences in methylation state. So the sequencing depth really affects your resolution in terms of identifying these differentially methylated regions. No case do you have very good resolution on single CPGs because it's too dependent on the reads you observe. But if you're looking in regions of a few hundred base pairs with 10x to 30x, you're really usually able to detect these with confidence detect these differences. So if you're wondering about how much sequencing is necessary, I think this is a very good reference to study in more detail. OK, so conclusions. So I think fair to say it, it's not easy analysis. It's got a few, I mean, bearing calling itself is not so trivial, but this has a few interesting twists. So I hope I gave you some ideas of how to choose the appropriate DNA methylation technology depending on whether you have a very targeted specific questions and many samples where microarray is probably the right way to go or you're interested. I mean, one of the other things, if you're interested in methylation in non-CPG contact, for instance, that's one of the reason to use one of the sequencing-based approaches. So hopefully I gave you some hints as to how to choose the appropriate technology to profile methylation. It's important in any of these analysis to check the quality and watch for biases. Especially if you've got many samples to make sure that it's more or less uniform. I've shown you a multi-step analysis workflow in the lab that's coming up. We won't be doing all of these steps because we would have ran out of time. We'll do some of the key steps. But again, hopefully I gave you a sense of what all the steps should be. And the last thing I'll end with before the break is that if you generate data, it is one thing. But there's also data sets that are really available. There's whole genome bisulfite sequencing data sets that have been generated in the context of this IAC consortium that you've heard about yesterday and that you'll hear about more in the module for this afternoon with David. But it's just, in particular on methylation, there are not so many whole genome bisulfite sequencing data sets, but if you're planning to generate some, you might also want to take a look to see if there's other reference data sets on other relevant tissues that you could also use in comparison. So with that, I'll end and take some questions. And again, I purposely wanted to keep this a bit shorter so that we have more time for the coffee break. Woo-hoo. OK, go ahead. What's the recommended? What's usually recommended? Usually it's 30x, a bit like a variant calling. So that's, I would say, the standard. But for some application, again, depending on the resolution that you want to have or the money that you have, you might have less or more. But I think the standard is still 30x. Or is it 60x? So what's the ix standard, officially, actually? It's 30x. 30x. I mean, I think, like, and that's based on the observation that we were stepping back. And the others, as you said, indicate about 70% of methylation sites are, if you know what this methylation state is, in case of the methylation state is the same, then you can rather set information as soon. But you know, the losing resolution, right? And we know that single CPGs are functional or anti-functional. It's an obstacle, so it's a balance. Yes? It depends on how complex the populations of cells are. This is a bit like what Martin was saying. I mean, here, for the most part, we're really just targeting the average methylation in that population. But for sure, I mean, there's probably other interesting examples. We're just getting into that with whole genome sequencing with methylation state. We're not quite there yet, in terms of signal. I mean, I think, in some ways, so like I meant, the microarrays, for instance, are very robust in terms of giving pretty accurate measurements. And the biases are a bit more well-understood and well-known there. Some of the biases associated with the other capture strategies, including our RVS, are only partially worked out, the cutting efficiency. And how does that vary, whether the sites are methylated or unmetallated, at least to all sorts of things that make it actually quite challenging. I mean, I'm not sure that even though the whole genome bisulfide sequencing is more expensive, that means you don't need the replicants, because you still will. There's maybe a bit less variability between individuals, such that you might not need as much. But you still have sites that are biologically variable, that it's not just a technical artifact, and you won't be able to distinguish them from the ones that are really different between the two populations. Maybe there. So there are, I don't know the name, but there are definitely tools that try to see, using a little bit this information, that you expect stretches of more or less constant methylation state, you can actually use that to try to assess whether you have subpopulation of cells within your endenic tract. And actually, some of the tools that have been developed for microarrays will also apply to this data. So once you estimate the methylation state, then you actually have even more data points, and you can feed them into the same algorithm that are used for microarray to try to identify these different populations of cells and to see whether you have a mixture of two or three cell population based on the methylation state. So yeah, you had a question? What percentage change in methylation? 10%? Well, so the paper I mentioned, I think, does a quite good job at trying to estimate. So here, it doesn't go to 10%, right? So these are robust change at 30x, saying you're able to detect robustly changed that are 30% methylation. I mean, it depends on the number of replicates. But for a typical number of replicates of three or something like that, a change of 10%, unless you have a lot of reads or many replicates, you won't have statistical power to say that this is real. It might work if you have lots and lots of samples in the microarray experiment. Then you need replicates for very high depth to be able to call these more subtle methylation change. What was the question? Yeah? So I believe you also need an input control for those to be able to talk on. We've got a lot. So we start with the decision. The answer, but I respect it, like, how much fractional change. When you look at fractional meaning of methylation change, what are you really talking about is the change in cellular population, right? Methylation is binary, right? And when you get thousands of microarrays and large law studies that are done, and people are pulling out data about new changes of 0.05 with some statistical significance, what does that actually mean? Especially when you're talking about blood. What does that mean? Does it have any biology to do with at all? I don't think so. So be careful. What is biology you're trying to figure out, first of all, when you're creating these studies? Sure. Sure. Well, the reason I ask is because we're looking at, like, a bunch of the promoter regions. And so if you're looking at a difference in that region which you're looking at multiple species, and you're seeing a change in, like, the signifier, you're looking at an enhancer region with a certain percentage. That's an average of the difference. Each individual can change, like, the average, or down. So it could be a percent-to-side competition, or it could be the fact that not all CPGs are doing that doesn't matter. So I think it's just about... Yeah, so I think those are two different questions for sure. I mean, I think you're going to get into DMR calling and, you know, how you build DMRs. You want to see the CPGs and the JSON CPGs. Because of those back questions as opposed to windowing and then having a heterogeneous mix of calls up and down, I think those are those are separate questions. But remember that, you know, the fractional change, think about the biology, too. You can get lost in the statistics and come up with some very small fractional changes in that relation. Is that telling you about the biology of whatever you're looking at? Well, so this is one of the places where it also helps to do the smoothing, because then you're doing this testing on fewer sites than all the sites in the genome. So this is one of the places where it helps you quite a bit. OK, so I think we'll stop here. Take a 30-minute break as was planned, but start again maybe quarter to 11 instead of 11. So 10.45. We'll start again and get on to the practical.