 All right. Good morning everyone. Before I start, I have a question for the local people here. So who's actually local here? How often is the weather like this? Like all summer. All summer, but like just two months or like half a year type of thing? Okay. And the rest of the time it's raining, I hope, or something like that. Otherwise it's not fair. The rain is a lot. Is that true? Yeah. It's not as rainy as you think. This morning was just beautiful. Well, I moved here from Suffolk out. Keeps the expectation low. Yeah. Makes sense. All right. So as Anne mentioned, yesterday was really about gypsy analysis. So we went into the ins and outs of how to do those types of analysis. So this morning, I guess both my module and my lab are really about how to analyze methylation data. The practical is going to be on whole genome bisulfite sequencing. But in the introduction, I'll talk about other assays to actually measure methylation. So let's let's go right into it. So the objectives of the module is really, as I just said, to go over the different technologies that are used to measure DNA methylation, have a sense of the strengths and weaknesses. And then the workflow that will go into more detail both in the presentation and in the lab is really bisulfite sequencing, data analysis. So, you know, you will see some of the challenges will extract methylation levels from this type of data. And then we'll learn how to visualize. And then on the on the on the DMR is the different methylated region. This is really just going to be an overview. Okay, so just starting from the beginning again, to make sure if you're many of you know this, and I know some of you know this better than I do. But so what is DNA methylation? So yesterday, we were talking about histone methylation. Here, this is really we're going to be talking about DNA methylation, in particular, five metal cytosine methylation. So it affects, you know, the majority of CPGs in the human genome. So this was already discussed a little bit. So the fact that this high level of methylation on CPGs is going to be associated, we'll look at this with some figures as well is associated with repression. And then it went but this is mainly in CP rich regions. The relationship between DNA methylation in other region is and transcription is a little bit more complex. And again, that's why this is interesting to have assays and to be able to look at that more carefully. So on the left side here you have histone methylation, histone tail methylation, which you again yesterday, you were looking at different chip seed pulldowns associated with these types of marks. So antibodies to H3, whether it's methylation or acetylation, it's modification on histone tails. So yesterday you were looking at chip seed data sets of histone modification. What we're going to be focusing on now is really DNA methylation on the right side. So whether, so this is really the addition of that metal group on the cytosine to get metal cytosine. And this can happen either de novo or, and this is part of why methylation is so interesting. It can happen on the complement strand during replication. And so this is in the context of maintenance. If you have already methylation during replication, you can maintain that methylation. So why is methylation interesting to study and important? Again, Martin talked about this a little bit yesterday. There's a question of cause and effect. But one of the things that's interesting about methylation is really it does have this mitotic inheritance because again, of this process of adding on the second strand methylation when you have already methylation. So that actually enables cells dividing to retain methylation. So that's one feature of methylation that's important. And then methylation is known to play an important role in genomic imprinting, transposable element silencing. One thing Martin didn't say yesterday when he talked about that and talked about Barbara McClintock is that two days ago or three days ago, I think was her birthday. And in honor of her birthday, they called it transposable element day. So I don't know if you guys had a big party for that, but that was three days ago. And so, but clearly methylation is very important in controlling transposable element and preventing sort of hopping around, especially of some active elements. I mean, the human genome, there aren't so many, but that's still a main use, I guess, of methylation. Importance in stem cell differentiation, embryonic development, and also at some level inflammation and potentially infection. So I like this figure because again, as an introduction to methylation, this is sort of a simplistic view of methylation. But again, this is sort of the real story is more complicated than this, but this gives some of the basic principle of methylation, as we understand them. So at the top you have what's happening around genes. So you see that in a normal state, you expect low methylation in the promoter, expression, and then methylation in the gene body that actually is preventing abnormal start sites. So this is what you expect in the normal state around genes that are meant to be expressed. In cancer, sometimes you have this deregulation, both ways in high levels of methylation in the CPG islands that lead to repression, so you no longer have activation of the regular transcript. And sometimes you actually have also the reverse and the loss of methylation in the gene bodies leading to these aberrant transcripts initiated with it. So this is what's happening within genes, potentially, and then going back to repetitive sequences and transposon. Again, the normal state is to have high level of methylation in those regions to control and prevent expression of these elements that can lead to new insertions and damage. So again, in cancer you have, in some cases, high pole methylation, you lose methylation in these regions, which leads to, again, the barren transcripts or transposition events in some cases. There's a nice twist to the story in cancer where this, as it turns out, actually might be a feature of cancer cells that may be recognizable by immune therapies. So because you have these transposon viral transcripts that get expressed, this is a marker that immune therapies can actually target. So even though the default is that this shouldn't be good because it might lead to transposition events, it might actually be a tool that we might be able to use. And so this is one area that's quite interesting of research around that. But we're here to learn not so much about methylation itself, thankfully, because I wouldn't have been able to go much further than this. So we're here to learn about how do we measure methylation and how do we analyze data coming out of that. So standard approach, so there's really three broad categories of approaches. So you have microarrays, which we're going to cover briefly, enrichment-based methods, so different methods to enrich DNA fragments, and then sequence them, and then more whole genome, so untargeted approaches with whole genome bisulfite sequences. We'll go through that in detail. One very key principle and technology that's actually being used in many of these assays is bisulfite treatment. So this is an important concept that's going to be important for many of these technologies and leads to many of the analytical challenges downstream. So it's important that, again, that this concept is clear. So what bisulfite treatment does is that it actually converts unprotected, so if you have a methylated C, that C is protected from the bisulfite treatment and remains a C, but unprotected, unmetallated Cs through this bisulfite treatment protocol get converted to uracil. One twist on this is that this is going to be happening in a strand specific manner, so you need to take that into account, but really the main thing is that protected methylated Cs will remain Cs while unmetallated Cs will be converted to use. So another, I know what we're looking at is, so you have the protected methyl Cs which are not converted to uracil and then following a PCR amplification, these unmetallated Cs will basically become Ts and methylated Cs will remain Cs. So again, assuming and we'll get to that, but this leads to quite a number of challenges in the analysis because, well, I mean, looking at this here looks simple, but if you think back about how we do the analysis, we start from these reads that we get here, we map them on the genome and then from that, well, like yesterday we would quantify how many we have in a particular region. Here the analysis, as we'll see, is going to be a bit challenging and even the mapping step is going to be more challenging because, you know, this no longer fits directly the human genome. We have to have an alignment that's aware of the fact that some Cs might be Ts and so on. And again, we'll have to take care of that and on both strands. But still, you know, once we're able to do that, we will be able to extract back the proportion of Cs that are methylated. Another thing to keep in mind before we go into this is that we're doing these experiments on tooled cells. So we're talking about different levels of methylation at a given place, right? It's not like all of the Cs at a particular position are necessarily methylated. Most of the time we're also going to have a mixture and what we're going to be measuring is what fractions of Cs are methylated at a particular position. But this bisulfite treatment technique is really the basis of many of the of the assays that we're going to be talking about. So let's jump right into it. So thinking about bisulfite microarrays as a first way. So I mean, this involves preparing the DNA doing this bisulfite conversion. And I don't have like more slides on how this works. But you can imagine that so it's a hybridization on the microarray. So it basically the microarray has probe, you know, for, you know, situation of a methylated C versus an unmetallated C. And then it just measures the ratio between these different probes. So you basically have this as an example down here where these Cs are methylated. So it leads to this sequence. So you could have a probe associated with that sequence. And you're basically measuring that probe versus an unmetallated probe that would have these Cs converted to use and then to Ts, right? So but you can see how, especially if you have just one C, that's relatively simple. If you have multiple potential methylated Cs in the neighborhood, then you have to design your probes carefully to be able to detect the different types of scenarios. So I'm starting with Illumina 450K. You know, there were microarrays even before that, 27K, 450K. Now the methylation epic array, also from Illumina, basically looks at 850,000 CpGs and then has probes for the unmetallated version and the methylated version. And then it's able to assess the state based on which probe hybridizes better, basically, or which lights up better. So here's a figure from a relatively old paper, but it's nice because as you'll see it shows side by side the different technologies. So this is an older array that's looking at only 27,000 CpGs. So that one is not very dense because so this plot, you saw a little bit the UCSC browser yesterday. So this is a plot from the UCSC browser. It's looking at a particular region, like it's the Hawks cluster. So you see that, you know, here you've got quite a number of CpGs and they tend to cluster, you know, as islands in terms of even more dense regions with CpGs. So these are labeled now in the red. So there's lots and lots of CpGs in the genome. Some of them end up being clustered, as was discussed yesterday. So you have a track here that actually highlights where are the CpG islands. And, you know, you could argue that those are the most important, but, you know, again, it's there's lots and lots of CpGs that are not in CpG Island. The CpG islands, if you look, tend to be found at the promoter of genes. But again, it's not the one-to-one correspondence. So what the microarrays did, and again, initially using 27,000 probes and later 850,000, is to profile as much as possible some of them and then get measurements. But you see that the measurements are quite sparse, especially with this, these first generation of arrays. So really, and again, we'll get to look at how sparse or not sparse. Yeah. Yes, that's right. So it's such that you get, and then it's a proportion that's going to be converted into that that ratio. It's very similar to genotypes where, you know, you genotype arrays are very similar, right? So you have alternative allele, and then you're measuring one versus the other. So the probes themselves, I don't remember the length, but yeah, they're relatively short. Yeah, 50 mers. But I think, I mean, this example shows some of the complexities, right? I mean, typically, methylation comes in clusters, but you don't have, you know, if you have a region that's dense with CPGs, I don't know the detail, but the probe design must be a bit challenging to be able to assess what's going on. Yeah, so again, I think they, I don't even know, but I suspect that they just probe, you know, they assume fully methylated or unmetallated in this case. And it's probably a ratio between those two. Yes? Yes. So I don't know whether, if you are looking at particle and CPT, as you can see, like it's fully methylated or unmetallated, or because it's 50 mers, I don't know. Is there any technology that I can only see which part of it's methylated? I think there are. There are. So that's right. So there's a no, there's a, I forget, I know that there's a, there's a, there's a variant to the bisulfite treatment that actually allows you to distinguish the hydroxymix, methylated versus the methylated Cs. But if I, but then you're, you have to do the two experiments if you want to be able to really distinguish the two. So this is a crude, just approximation looking at, I mean, since we're on that topic. So again, so here I'm focusing on the, on the, just the regular five-metal C. There's also, and this is a limitation of the arrays. It's not only CPGs that are methylated. It's mostly the CPGs, but you also have other Cs that end up being methylated. Those will also be missed by these microarrays because the microarrays are really targeting directly known CPGs that are interesting, right? So you'll, I mean, this is something actually that we'll get out of the whole, you know, bisulfite sequencing that we wouldn't be getting out of microarrays. Yeah. So, so that's a good question. So we'll get to that a little bit, absolutely. So depending on how, how much you do the treatment, you can either damage the cells too much or not, convert them enough. So there is, that is one thing to watch for in the protocol is the conversion rate and whether, and for that it's helpful to have, and we're talking about that a little bit yesterday, sequences that we know are methylated, fully methylated or fully unmetallated and then make sure, you know, whether you're converting those fully or not. So typically those are important quality metrics to look at is the conversion rate at, at known site. And again, often you spike in sequences that you know are not methylated or that you know are methylated and you can use those to really make those measurements. All right. Above, I think this is, I'm not 100% sure, but I think even calling CPG islands, you've got some parameters of what you call a CPG island. So this must be just in terms of density, some, some parameter around that, but I'm not, I'm not exactly sure. Predict it, exhibit it to combine everything like so. Yeah, I'm not, I'm not exactly sure. There, I didn't understand the beginning. Well, there, well, so I'll get on to the pros and cons. I think if you have a lot of samples, there is advantages still, whether it's for RNA-seq, I mean, not RNA-seq, but for profiling expression. Similarly, for microarrays, if you know exactly what you're interested in looking at, you know, you're looking at methylation on these genes that are well covered on your array, and you have 100 samples, you know, you might feel a bit better off doing microarray experiment because well, number one, the, and that's what I was going to get to the methods to analyze microarray signal are well established. So you, you know, you, you don't have so whether it's men fee or any beats, there's lots of tools that have been tested and the normalization and all of that has been worked out quite well. So the analysis at some level of microarray data is slightly easier. And the cost is also is much better because there's a big, well, we'll get to that, but there's a big difference between the cost, between microarray and, and the unbiased whole genome sequencing. And then you have the enrichment based methods in the middle, but there's still, you know, if you have a lot of samples and you know your question, I would argue that microarray, both for methylation and expression in some case still is useful. Yeah, I think you guys are stealing all of my punchline for later. Yeah. No, no, absolutely. So another thing, I mean, whether it's mouse human, I mean, for it to design a microarray, you need to know the sequence, right? So, so you don't have that in all organisms or if you're studying that in an organism mouse, I would think that there is a methylation array. But yeah, that's another limiting factor. Okay. So this is, this is for, for microarray. So, so now we move on to some of the enrichment based approaches. So metip seek is one. So this is similar to chip seek from yesterday in many ways. So you sonicate DNA, you have library preparation. So this one is interesting because it doesn't have the bisulfite treatment. What it does is it uses an antibody that recognize that particular modification a bit like yesterday again, you can have an antibody to, to different proteins or to different histone marks. So here you have an antibody that recognize this particular mark. And then you amplify and then you do high throughput sequencing. So very similar to, to chip seek experiment. So similarly, another enrichment based approach sonicate DNA. So this is methyl cath. And now the enrichment of, of DNA that you're doing is through targeting this methyl binding domain, domain protein. So you're targeting a particular protein that is associated with these methylated seeds. So again, yep. That's a good question. I don't know. So here, I mean, within that the library. So why you don't lose, but you don't have, you don't have conversion. Yeah, you don't have, because this is library gravity nature enriched and then like, oh, it's a good question. It's just antibody sensitive to, to methylated CPGs. So when you do the library, probably don't know. You don't reach any amplification when you, when you sequence, but you don't have, unfortunately you don't have martens. But I think what to mention here, if I may, this is a very good technique. However, it is not very sensitive to low CPG density. So if you do genome wide, although it's a great technique, but certain CPGs, you will be sensitive more to, so it says has some bias to CPG density. So basically, this looks like it just calculates the, when you fragments on it and then it calculates whether the, there are two or three seeds that are methylated or not, because it's going to capture some type of anybody's going to find it, the more cement-plated seeds. That's right. I'm more wondering about the library information. It's just a lot of information on the factors. Yes. Right. Well, not if you haven't done the enrichment. Not if you haven't done the enrichment. Some people do it before. No, but now I think it's something to do with the, I forgot, but this is very recent. Yeah. Yeah. People are doing it now. I don't remember why they do it, but it used to be opposite. But one, one thing that was said that's, I think, very true about the limitation of this is that it's an enrichment, right? So it's not, so if the region is highly methylated, you get higher signal, but it's not, depending on the CPG density, as you were saying, it's still hard to interpret that it's a bit the result, because all you know is that it's enriched, because even if we think about chip-seq, it's not really a measure, it's not quantifying, right? It's like, there's all sorts of, it's really, it's not the digital measurement of how many of the seeds were methylated at all, right? So it's really just, roughly, this region is highly methylated and comes down in your pull-down. So that's one of the things that's going to be, make this type of data harder to analyze. Because chip-seq, in terms of quantification of peaks, it's really hard to interpret, in some case, the height of those peaks, right? It's, as we'll see, when we're actually doing the conversion and counting, it's much more quantitative. Okay, similar enrichment-based, but here, at least the library prep comes down later. So, I mean, maybe it wasn't the stake in my other slide. So here, the differences that we're targeting, this methylbinding domain protein, and similar, you measure the enrichment that you're getting in the different regions. So I'm coming to the end of the enrichment-based approach. So another one that has been quite popular for a number of years is RRBS. So this starts with a digestion by the NSP1 restriction enzyme, which cuts CPG-rich region. So this actually also enriches direct or sort of in a semi-unbiased manner, CPG-rich region. And then you follow with the bisulfite treatment, the library amplification, and the high throughput sequencing. So the fact that you're actually using not just random sonication, but a restriction enzyme that targets these CPG-rich regions, sort of is a way of enriching for the sequences that you're interested in. So moving on to what does the data look like, I think, which actually is going to help visualize and put this together. So at the bottom, you have what we were looking at before, the CPG, the CPG island, the sort of the sparse microarray probes. Again, that's better now, but microarray probes are relatively sparse. You have data here from RRBS, and you see that you are getting a nice distribution of reads at some level that does follow quite nicely the distribution of CPGs in the genome. So you do get some representation there. So you have, and then you have the metal cap at different levels and the media data here. So here, again, you're only getting reads from these two approaches coming from regions that are that are methylated. You have an advantage compared to, for instance, here you had a gap in the microarrays, you're getting a nice representation saying that you have methylation in these regions that you were missing, and so on. And you really get, I mean, in many ways, a nice profile of the irritant of methylation. But I'll get back to that. It's still, you know, figuring out the quantitative aspect is a little bit tricky. And well, we'll see as we go through the slides what I mean by that. So again, I'm not putting too much emphasis on how to analyze these data, and each of those would be a topic on their own. So there are tools specifically sort of adapted to analyze these different types of data. So, you know, depending, so some modification, well, some modification of tools that are principles that were used for chip seek for to analyze both the media and the metal cap data to actually pull out the regions that are that are of interest. So what I think is helpful really to think about is really this slide, which shows, you know, basically where do you get data? So if you look, so looking at the amount of coverage that you're getting in different regions of the genome, so if we start at the top in the CPG islands, you know, how much coverage do we get in the CPG islands? You see that medip seek and metal cap are very, have very good coverage in terms of these CPG islands, you know, more so at some level, even than RRBS, even though RRBS is quite good. This is a bit different for for the microarray. And this is about the number of probes. And again, this is a bit of an older data, but it is true that most of the coverage on the microarray is also in CPG islands and then promoters. So you see promoters more or less the same. The big difference is looking again, sort of at the whole genome and not focusing only on the CPG island. So if you're interested in knowing what's happening outside of CPG island, well, RRBS data will be quite sparse. The array will be quite sparse. And at some level, the enrichment base, although they do have a slightly better coverage outside of just the CPG islands. Again, I'll get to the more detailed pros and cons after I talk about whole genome by itself and the last sets of technologies which are sequencing things. So I'm almost done with the sort of technology intro. The next two technologies are really sort of more on bias in the sense where with whole genome by self sequencing, all you do is isolate the DNA, do the by self treatment and sequence. So it has the advantage, but this advantage that are written there. There's a there's a version of whole genome by self sequencing that's very similar to what you would do with exome sequencing. So when the whole genome sequencing was expensive and it still is, exome sequencing became popular because you would basically prepare DNA and then have capture probes, different types of capture probes that would restrict the DNA fragments that you would sequence. So you can do something similar. And so this was done for exome sequencing where you had capture probes that would be designed across all the exomes. And so you would, the only DNA you would sequence would be the DNA associated with coding exomes. So similarly, an MCC approach, and there's different variants of that can capture DNA fragments in regions of interest after the by self treatment. And so that's one way of being a little bit more cost efficient, if you know where you want to look. So in this case, this was, you know, basically targeting CPG islands, but also targeting different enhancers defined by other ways. So that's one way of reducing the cost. So there's lots of tools for the analysis of this data. And again, this is what we're going to be going into in detail. I've added, so I've added one of the latest tool for bisulfite sequencing data. And some of this, there's some bonus slides that are going to be in the presentation. So that are not in your book. And that includes a comparison with this latest tool that's used for old genome bisulfite sequencing. So again, this is what we're going to go over. So I'm almost done with the sort of broad introduction. So this is the slide I had with talking about some of the advantages and differences. So, you know, at some level, all of these methods do provide methylation measurements, although as I mentioned, the enrichment base approach are a little bit harder. That's what you have here. So the enrichment base are a little bit, you know, again, if you think about the fact that these are mixtures of cell, that you're doing an enrichment, we'll see later that if you have SNPs and variants that leads to all sorts of differences in these regions, and all of these are sort of combined into providing just like you got so many reads from that region, but it makes the data a bit more challenging to analyze and normalize. So I have the, well, a bit like what we talked about, microarrays, lower cost, but provide accurate measurements, especially if you know what you're looking at, the example of whether you do have a microarray or not, and I'll have that on the slide, but I think that was a good point as well. And then, so the last two points, you know, these approaches that are sequencing base provide absolute DNA measurements at the base pair resolution, which is great, but the cost can still be a bit expensive. So maybe at this point, I don't know if you guys, before we go into the workflow for analysis, I don't know if you guys have more questions on the technologies or the advantage, disadvantage. If you're all warmed up and ready to jump into the analysis. Okay, so let's jump into this. So starting with a bit on quality control and pre-processing. So again, this is going to be similar to what we did yesterday. And at some level, I won't go into as much detail as was done, but as we were saying yesterday, it's very important that you look at your data, where they all sequence using the same protocol, the same instruments. If there's big differences in read length and you're comparing samples that have been done using different read length, that's going to be a bit problematic as there are technical issues. So, you know, it's really important, as we were saying yesterday, to look at the data. So this is, again, what you did yesterday, and we'll do that in the lab a little bit, just looking at the overall quality of the data of the reads that you have to see if there's anything that's abnormal or there are issues. So, we talked about, so holding on by sulfide sequencing, you know, you would expect a relatively smooth distribution overall, no seeds in this case, because those would have been converted reads, but again, we'll get to that. With RRBS, because you're not, you're selecting the samples that you're getting, you might have a little bit more of a variable distribution. We talked about this yesterday as well. So, you know, if you actually observe some normal stuff in the distribution of your read, you can trim the reads in different ways. You know, there's, we didn't go over that in too much detail, but there are specific tools that are really, can facilitate that. I mean, you don't have to manually go through your FASQ files, start cutting, so there's trimimatic and different types of reads that can be, different types of trimming tools that can be used, using different approaches. I mean, I don't have, again, it's not so common to do trimming except for the end of the reads if you have lost a loss of quality. But, you know, in the context of methylation and now with even better data generated from the sequencer, this trimming step, well, again, unless you really see that you have a drop in quality. So, this is, well, hopefully you don't have to deal with data sets like this. So, again, this is an example where this is, you know, read quality, I mean, from the beginning of the read to the end of the read, and then the distribution of read quality as you saw yesterday. So, you can have criteria to basically just retain reads that are above a particular value, but to be honest, if it really, if your data set really looks like this, you know, you should ask yourself some questions, I think, because even if you just retain the good reads, you know, there was clearly some problem with either the sample preparation or the sequencing. But definitely, I mean, it still might be better than not using it at all, but it's good to take a look at the underlying data. So, we talked a bit yesterday about duplication. Duplication in the context of methylation is going to be at some level even more important, because we're going to be trying to quantify how many times do we see a methylated C. So, this is an important slide. So, imagine, so if you have a diverse library, so lots of starting material such that all of your reads are really different fragments, you know, you're in a situation at the top, so that's the best case, that's what you would want. If you have a low diversity library and lots of amplification steps such that you have a lot of basically, such that you have lots of identical fragments, basically, you end up basically, and here I should say, so the amplification takes place after the conversion, but you then will still retain these misleading or I mean, these bases that suggest that there's a methylation, but it's going to completely throw off your estimates of methylation if you have some fragments that are represented many, many times, if that makes any sense. So, you know, you want to make sure that you remove duplicates in, especially in the whole genome bisulfite sequencing data, because otherwise it's really going to throw off all of your estimates of methylation. You know, a fragment that was present one time will be, in some case, present multiple times, and that's going to affect your score. So, I thought, so I didn't have, I didn't have much on the conversion rate that we talked about, but that's, that's another important factor which I have, I guess, here, but so we talked about the re-quality presence of adapter. So, again, that's something that FastQC will report. You know, so if you, if you have lots of adapter sequences, again, it's going to lead to misalignment and miscall, duplicate rates, duplicate rates are very high, or you're basically losing a lot of sequence or your coverage, which was 20x might go down to 10x, and that's also going to be problematic in terms of giving you good rates of methylation. And conversion rate, actually, if we have time, we won't do it in the lab itself, but so it's important to report on these conversion rates, and again, usually in the sample preparation of these bisulfite conversion, you have these sequences that are, that are, that are not methylated, or that are fully methylated, and that allows you to, to estimate the conversion rate. So, we won't do that in the lab itself, but I'll, I'll show you where, where that data is for, for some other data sets. I thought one thing, and this is one of the, should have like bonus slides, so this is a bonus slide, so I need everybody to look up, I don't know if you'll miss it, but so this is, this is from the ENCODE data portal, so they specify, you know, what are their expectations for, for a good whole genome bisulfite sequencing data sets that they want to release. So, so you see that, ideally you have replicates in the same tissue, you want a conversion rate above 98%, so you want most of the Cs to be converted to, to T, then you want, so if you do have replicates, you want that, that quantification for sites that are well covered to be high, good correlation at CPGs when you have replicates, it's fine to have either Paraden or Single Land, as long as you say which one it is, and then, you know, there's metadata that needs to be associated, and, and, and you'll cover that quite a bit with David this afternoon, so, but this is more on the technical side, but again, I'm hoping that at the end of the lab we'll take a quick look at some of the ENCODE data sets that they've made available. Okay, so now, so off to the fun part, so so far this is sort of easy background, so, so why is, before we start looking at the slide, so why is bisulfite sequence alignment? I already kind of said it, so, I'm sorry? Right, so it doesn't quite, the reads don't quite match the reference, right? Right, that's right, yes, so I mean I remember, I mean at some level it's true with sequencing reads, DNA reads as well, right? It's, you know, the reads don't perfectly match the reference all the time, there's snips, there's different things, sometimes there's repeats, but now with the bisulfite sequence treatment, I mean all the reads don't match the reference, right? So it's like free-for-all in terms of, of, of, of what you're getting in the reads, so it's, it's really not, not such a trivial problem because, and you also, again, you need a method to find where they go very fast, you've got millions of reads that the reference is gigantic, and then you've got all of these, so it's, I mean I'm, I'm not the one who wrote that program, I mean you said this is like, it's, it's not, it's not easy, yeah. I think the bisulfite treatment won't change those, right? So, so I think it's, the bisulfite treatment is restricted, but it's, I mean even without that it's complicated enough, right? So, so here's a toy example that shows again what's happening, right? So you've got an example with a methylated, a methylated C, and typically again these come in pairs with both sides, on both strands, and unmetallated, unmetallated C. So if you denature the DNA, you've got the two DNA strands, single strand, you do the bisulfite treatment, and so you see the, the methylated C are protected, the other ones are converted to uracil, slightly different on the reverse strand, and then you have PCR amplification, converting those uracil back to Cs, and typically, so, so this, so assuming 100% success rate, and then usually we're, we're close to that with the treatment, these are the sequences that we're getting. So this is, again, so we expect to be getting these types of sequences, and we need to be able to map back to the reference genome, and, and be able to, to reassign, right? So, so understand that the mismatch, you know, allow, allow these mismatch of having keys here, and understand that this means that this was an unmetallated C. Sorry, I, I, I don't know, because I saw it's like, it's just because it's, because it's the activity I didn't need to shred, so you actually keep the nature of the DNA, and then, to alternate, and then, then you do the other, you know, so you double shred them to, to do the other way. Yeah, right, right, right. Good. Okay, so, and get back to, to the challenge of mapping these reads to the genome and identifying the metadata. So there's, there's three main approaches to do this. So wildcard alignment, three letter alignment, and this less common reference free processing that I'm putting here. So, so again, so, so people try different approaches to try to, to resolve this. So wildcard aligners is to basically say, you know, C's can be T's, we don't know exactly. So, so you convert those to a wildcard, which you will allow to match both C's and T's in the alignment step, or you can modify sort of the penalty matrix, depending again on how the aligners work, such that mismatch on C's don't really count as mismatch. So there's a number of tools that, that, that use, that are implemented this approach of basically saying, you know, we don't really know what's going on with those C's, sometimes they're T's, let's just, you know, we convert them as wildcard and allow anything. Yeah, that's right. So, so, and we'll get to that. So it's, yeah, the tricky part is, as we'll see, is that it's sort of, what happens in the end is, is also sensitive to the density of C's and CPG's, I guess, at some level. So, I mean, this is going to work well, if all you have is just one of these C's in the middle of a very information-rich sequence. But again, it's a wildcard aligner, so you allow C's to match with anything. The alternative is, is to convert all the C's into T's, and, and, and and vice versa on both strands, and then basically, instead, basically convert your alphabet to three letters. So you're, you know, they're, in the same ways, you sort of give up to really understand what's going on. So you convert all the C's into T's, and then you do this, you pretend as if the genome was, was only three base genome. So, again, this is a bit technical and, well, just to be made clear, it's going to get even worse. But, but, I mean, I think it's important to understand those, those general concepts, and then, and then you can forget about them. But let's try to, let's try to understand them first, and then forget about them. I mean, ultimately, it's not going to be so important, but, but it's going to give you a sense of why and some of these, these challenges. So, you know, the, the additional complexity that I mentioned is, is the fact that you've got, not only is it a population of cells, it's also, you know, sometimes you have two alleles, so it centers like it as opposed to being, so, so the two strands, so it's, so, so here, you have the methylation level that's either at 100% or 50% or 0%. And again, from that, you're getting these, so this is a toy example, you're getting these reeds. So, so the part that's important is, so, and these are miniature reeds, I mean, typically reeds are much longer, but this is just to give you an idea of what's going on. So, so this is your starting point. This is a truth, and these are the reeds that you're getting. And this, these cartoons try to give you a visual representation of what is happening. Maybe the three-letter is, is easier to begin with. So in the, in the three-letter alignment, you've converted both the reeds and the genome into this three-letter alphabet, so there's no more Cs. So without Cs, what you end up is, is actually losing, so in shaded here are, are reeds that came from someplace, but they're now ambiguous, because in the three-letter world, you have less information, and so you have more ambiguity in terms of where things go. So here, in, in this mode, you end up losing quite a number of reeds because you don't know exactly where they go anymore. They could go in lots of places. So if you look in terms of what you're able to, to call, you know, you're basically are losing some regions where you did have reeds, but you're no longer able to estimate because of that ambiguity of the three-letter genome. The flip side in the wildcard alignment where you allow, if you remember, you allow mapping no matter what, you know, see, and this actually allows you to retain more reeds, but, but it has this twisted effect and, and of, of like biasing some reeds in some places, and I have that explained better on that, but basically, you know, you end up having, this ends up being more conservative where you only retain reeds that you're sure about, and this ends up, you know, sort of, sort of working, but, but leading to a few weird things too. So next slide that explains this better than my blah, blah. So three letters, a liner have lower coverage in highly methylated regions because they end up purging because in these regions, all of those Cs become Ts, and so you end up, in some case, losing reeds, but again, as the reeds get longer, that problem goes away. Wildcard aligners typically have higher genomic coverage, but at a cost of introducing some bias towards increased DNA methylation because, so there's, there's a bit of a bias that's introduced by these wildcard aligners. So again, I mean, it's not super important, but, but at least hopefully it gives you a sense a little bit of what's, what's happening. So I mean, typically, I think there's a slight, you know, preference for these types of three letter alignment because they're a bit more conservative, and if you have longer reeds, they tend to work well. Yeah. Yes. Well, I mean, there's different approaches, but in the simplest approach, you discard them because you're, it's ambiguous and so you don't know where to count. Again, I don't know for methylation, for variant calling, sometimes those reeds are used and tried to weight them in different, you know, but, but typically they're discarding just because it's simpler. Yeah. Well, which is also true here, right? You know, so, yeah, maybe it's just again an example to show, maybe it's in practice and theory, it's a bit longer and it would be, yeah. Yeah, it would be. Okay. So, so which tools actually implement this type of strategy? So if we think about, again, these three letter alignments, so that means that you have to convert the reference to, to replace the Cs by Ts, and you also need to convert all of your reeds into that, into that alphabet. Oops. So, so this mark does that where again, the genome itself is converted. And then, so you have two different versions of the genome that you have to map to, and then you have two different versions of the reads that you also have to map to. And so you have to do all of this. So not only do you map one read to one genome, you have to convert the genomes into, on the positive negative strand. You have to do the same with the reads. You have to do all of these different types of mapping and then resolve where does the read go. So you have to see whether, you know, in one version of that it's ambiguous or whether in the end there's a clear mapping location. So, so this mark implements that type of approach, and this is actually what we're going to do in the practical. Yes, but, but a lot of times actually it's the, it's the converted reads that are also interested to look at when you want to see exactly what, what, you know, what the data looks like. But again, this is exactly what we're going to be doing in the practical. So GenBS, which is another tool that's a more recent tool that we're currently testing is also a tree base aligner. And the big thing with this, this tool is, if you think about again, all of the steps here, there's a lot of intermediate files, and it's actually quite a slow process. So one of the, so even though it's more or less doing the same thing as this mark, GenBS does that more efficiently for the most part. So at least that's what they claim in their paper. So these are also some of the bonus slides. So I think they're going to be on the, on the, on the student website as well. It's just they're not printed out. So you see, you know, well, in a paper, you know, of a new tool, they always say that they're better than other tools. So they, you know, that's what they do here as well. But, but again, we've been testing this and it seems to be holding up that it does perform well in terms of performance. So here you just see sort of the CPU, the amount of CPU it takes, the amount of time. So GenBS is, is, is very efficient. You compare that to this map, one of the first one, this mark, you know, it takes, I guess, more time in terms of hours. And this is for two different coverage of holding on my cell flight. And, you know, it's faster, but it also maps more read. So this is the, the amount of bases that are aligned. So it seems to be profound, you know, so it's been implemented in a faster way. And such that you're getting, you know, so GenBS, which is this one, comparable to BWA met meth, which is yesterday you were talking about BWM. So BWA, which is the main mapping tool, or the most used mapping tool probably for genomic read, there's a version for methylation. So, so here this GenBS performs comparably or similarly to that tool in terms of the amount of reads that it maps, but does that slightly faster. And I have a bit more on GenBS later in terms of what it does too. So the last type of alignment strategy, or not alignment, but reference-free processing of these reads is actually just that. So it's like it bypasses, it tries to bypass the alignment. And this is more sort of an aside. It's not so common, but I thought it'd be interesting to talk about this for a second. So this idea of not using a reference comes from, again, variant calling and just regular, how do you do variant calling? So this is, so switch your brain to not methylation, but back to just regular DNA, regular variant calling. So here, variant calling, you map the reads. So all of these, this is IGV, which we're going to be using. So all of these are reads that were sequenced that are mapped to the genome, and you have mismatched that are highlighted. So you map the reads to the genome and it's quite easy to see where there's a variant here because all of the reads have a variant. But what happens if there's a region that has lots and lots of rearrangement or lots of lots of mismatch, you might not be able to map and detect the variant. So sort of an alternative approach to variant calling is to do reference free variant calling where you don't map to the reference genome, you just compare reads in two different conditions. So you've got a tumor, you've got a normal. So you just look at your population of reads forgetting about the human genome, just look at all your reads. And then you say, well, in this set, I have reads, I have many reads that are different from that set. So again, the details are not so important here, but it's just the concept where if mapping is difficult, why even map? You could just compare if the ultimate goal is to compare two different conditions, you could compare your reads to your reads and forget about the genome. So again, this is just one tool. Personally, I haven't used it. And I don't know if you have, Martin. Yeah. Yes. Yes. Yes. Right. So this would be really if you're, so suppose you're interested in a normal tumor methylation, right? So if you have a normal sample and a tumor sample, you could just look at your methylation reads and say, you know, so most of your reads are going to look the same in the two, in the two samples, but you might have a population of reads in your tumor that are very different than your population. And then you don't find, you have a whole bunch of tumor reads, methylated tumor reads that you don't have in your other sample. So you identify that group of reads and then you say, where do they come from? So you would still use the reference, but sort of downstream. And so, you know, it might, I would think that, especially in some of these regions that where mapping was a problem, because you have low complexity and things like that, you know, you might see a pattern that you would miss in the other approach, because, you know, they fail that missing step, right? If we go back to, you know, we go back to this, we know that there are some regions that are becoming ambiguous and you can't map. Maybe that's all where, you know, maybe you have a hundred times more of those reads in the tumor than you do in the normal. So, you know, there might be something there. So it's, again, that's the idea of this approach that was developed initially in the context of tumors, which is also the genomic sequencing, which is also not that commonly used. But the same principle in principle should also apply in methylation. So I'm putting that out there, but this is really sort of an aside. Okay, so moving on, so we've talked about alignment, which I mentioned was one of the big challenge. But once we've aligned the reads, they were going to be interested in quantifying the levels of methylation. So this is a different cartoon view of what we have. So if you look at the top, you have the actual methylation level at different CPGs, then you're going to have, after mapping, you know, your different reads. So here we're not showing the Cs and the Ts. We're just sort of color coding these. So you'll have, if you had no methylation, as we've seen, all of these will have been converted. So you end up having a mismatch in this case, but we know that we should read this as an unmetallated C, so no methylation. This is the opposite, where all of these Cs were unconverted. So this corresponds to 100% convert. So once we've mapped the reads, we were able to read out these patterns and convert that into a methylation profile. But maybe this is a good example to go back to one of the things we were talking about. I mean, this is much more quantitative than the pull-down experiments that would then just give levels. But here, we're able to really count the number of observations and convert that into a ratio of methylated Cs. Well, so maybe you're getting into what I have on the next slide. I mean, again, if the reference genome is a C at that position and all you see are T's, this means that they were all unmetallated at this point. But maybe you're getting into what I have on the next slide, which is, but what if the reference genome is wrong and that individual has a T at that position? So I mean, again, this is the nice case, but I don't know if I've already lost you or if I'm going to lose you now, but this is the right place to get lost if anywhere. Because this is now talking about the fact that so far, we've sort of assumed that the reference was correct when actually there's a million SNPs or more, 10 million, I guess, right, between individuals usually. One million. Depends on the individual. Okay, well, I'm weird. I have more than that. So there's lots of SNPs. There's lots of SNPs in the genome, right? There's lots of places where on top of the fact that you've sometimes converted a C or not converted, you've got differences. So this slide is to show that a little bit and show a little bit the impact of that. So assume, so on the left side, we have the same types of examples that we had before. So we have a real C, right, which has a matching G. So here, again, in the context of the bisulfite treatment, we expect to get the C's converted to T's and on the reverse trend to only have the G. So this is for a real C in the genome that is converted. If though you have a SNP such that this is our reference, but in that particular individual, what we have is a T at that position, right? So the T that we're going to see here is misleading. I mean, it's, so it's not the same as this T, right? So again, looking at the reference, we're going to say, we see a T here. It's supposed to be a C. There you go. You've got an unmetallated C. So that here, if we only look there, that's what we're going to think. If we have sufficient reads such that we have reads on the reverse trend though, we should see an A in this case because there's a SNP. And that would change our interpretation to say this is not a unmetallated C. This is actually a SNP. Similarly, you can have a reverse type of thing that's happening where a T is a SNP such that you think it's a T there in the reference, but in that individual, it's actually a C. The C gets converted into a T. So again, we're like, in this case, we think nothing is happening here. We think nothing is happening because it's a T and it's supposed to be a T in the reference. But actually, it's, if we look again on the, well, so this one is, it's total confusion, isn't it? Well, at that point, I'm confused too. So if we see a G, so this one looks, oh, no, no, no, no. So we can still again say that there's a problem that this is an unmetallated C because, because again, compared to the reference, this pattern is not expected, right? We would have expected an A on the reverse trend. So again, we need to have an algorithm that's aware of these possibilities. Right. So, but I don't think you have to do that because again, you've got, if you're using, if you have enough coverage and you have the reverse trend, then, then you can resolve all of these different cases. So, but it's really based on, you need to have sufficient coverage such that you see the reverse trend and then that allows you, in this case, to resolve. But it's really a bit of a mind exercise, I would say, but to do that. It's a bit, it's even worse than that if you think about the fact that it's a, it's a rhizygous SNP and then you're going to get a mixture and then you really have to have enough read depth to be able to resolve all these cases. So again, very happy. I'm not the one who had to write the software to extract that from the data. Yes. The SNP goes in this way, right? So it's when you, the reference is a C, but it's a T in the reference. So it's, so this is the reference and then it's, it's going in this direction. So this is, you don't see this, right? Well, all you see, all you know is the reference and can you get these reads, but you can infer what the true genotype is based on those patterns on the two strands that you're getting. And but again, I mean, think about the fact that it's, I mean, it's, it's, so it's not, it's not really not so easy, but thankfully other people have worked this out. And, and my point here was really just such that you're aware of this, you know, again, you can imagine that depending on the CPG density, the SNP density, you know, these problems become, become also even more tricky to be able to really resolve. And you need, especially for calling SNP like this, you didn't, you do need to have sufficient coverage to be able again to have enough observation on both strands. I mean, I mean, another thing I didn't even talk about that you also have sometimes errors, right? So you'll have a read that actually has an error that adds up and that is in there too. So it's not so easy to call SNP. But again, as I said, there's some, some tools out there that do that. So this SNP, for instance, that we won't be running in this case. I mean, again, if you're interested in this, I encourage you to look at some of these papers. But this SNP is one tool that we typically run at the end of this mark to, to call SNPs from this data as well. And maybe this is not so, so critical, this SNP accuracy and so on. So you have that in the slides, if you're interested, as I mentioned, I have a few bonus slides. So this is another of the bonus slides coming from the GEM BS paper. So the GEM BS paper also implements an efficient SNP caller as part of their, of their package. So here you have the three different aligners, this mark BS map, GEM, GEM three, which is the aligner that's used by GEM BS. And then again plots, it's very efficient in terms of, of the resources and the time it takes to call SNPs. Again, if we, if we trust their, their figures, you know, this is what they report in terms of the accuracy of their, their SNP calls in this data. So one thing to point, so, so in this, these plots, you know, it starts with, well, extremely high coverage, right? I mean, unrealistic high coverage down to, you know, 30, you know, this, even this ends up, you know, forehold genome sequencing is, is probably closer to the range that, that's achievable. And then it plots, you know, the amount of false negatives, you know, the amount that you're missing at these different cutoffs. So, but I mean, so the performance is, is reasonable, but you do have, especially if you only have 18x 20, you do have some, some mistakes, missed calls. You know, one thing that would be interesting, which we don't have here is a comparison with whole genome sequencing, whole genome sequencing, it's much more robust at calling SNPs, but this, this also works reasonably well. Again, so this is more as a, is a nice to have. We won't be doing that in the practical per se. Yes. So typically we do the SNP calling at the end, which is true, like, like the case that we showed here, right? That's a good question. I think, I think the SNP calling takes into input as well, the, the methylation calls, and we'll change the methylation calls at the same time, but I'm not even sure. So I think it also takes into account. Oh, that's right. So, so the reference is SNP aware. I guess the bigger question is in terms of personalized genomes. So if you could personalize the genome, the same genome that's what you're sequencing, whether it's an organism or if you have that, that's kind of consortium, many different issues with the same individual. So what power do you get by aligning to a personalized reference, which I think is what you're asking, but a model or an instance, of course you could do that. If you could generate that. Well, usually this is done on an individual, by individual basis. Actually, it would, if you were to pull all that data to call the variance, which would probably be a good idea because this way you would have sufficient coverage on these to be able to call it well. I guess it would also depend on how distinct that population is. It depends on how much it's drifted, one or two individuals, then those positions in the reference and then realigning you might see. Again, it depends on how drifted it is. This is a green line. It looks like a Scandinavian population, but it's an Inuit. But one point that Martin made that's important and I forgot to mention, but the annotation of dbSNP for these calls is quite important. So most of the variants still are known, the location of where they are. So annotating the methylation calls using even common genotype is an important step. Okay, so moving on to something that we will do in the lab as well as data visualization and some aspect of statistical analysis. So one of the things that's quite helpful with your data once you've aligned it is to look at it in the browser. For the particular lab that we're going to use IGV, which hopefully you've installed on your computer as pretty instructions. So we're going to be using IGV. And IGV is very nice because it has a dedicated mode to look at methylation data. Such that, again, both in terms of making sure that you've run the pipeline correctly, it's useful to look at. And, well, without going into the detail, IGV has a display color mode that's methylation data aware such that you can identify, again, visually methylated and unmetallated Cs. So this is a case where you have no methylation in the normal and higher methylation in the tumor, which is the kinds of going back to our starting point. An example of one of these cases where you have an aberrant hypermetallation on the promoter of that gene. So we'll do a little bit of that in the lab. Another thing that we won't do in the lab that's helpful is really to also look at the global distribution of methylation values and doing clustering of samples based on similarity and things like that. So this is exploratory analysis using the methylated regions. So looking, for instance, at overall, so once you've actually extracted the distribution of methylation values across the different CPGs, what's the distribution of methylation that you're getting? You expect the distribution a bit like this where many CPGs are unmetallated and then you have from highly to more intermediate methylation levels. This is about, I guess, the re-coverage showing how many are you actually able to call because obviously depending on the amount of reads you have, you're going to be able to estimate that methylation values better or worse. So I talked about this a little bit when I talked about ENCO, but you're probably going to be interested in correlation. So once again, you have values of various CPGs of different methylation value. So this would be to look at the correlation between different pairs of samples and depending whether they're replicates of the same thing or so I don't remember what ENCO expected with replicates in terms of correlation above 80% or something like that, I think. So again, the higher coverage you have, the more accurate your methylation levels will be and you can sort of play with this. I'm sure that if you have a cutoff where you only take CPGs that have sufficient coverage, then you would actually have a higher correlation between replicates and so on. You can do, again, so this is all using a toolkit called Metalkit or you could do that directly in R, but there's a lot of routine that are implemented in Metalkit. Just again, comparing different samples and seeing patterns that you would expect or not expect based on what you see. So I'm coming towards the end now. So just out of the presentation, I think I've been talking for long enough, we'll soon actually do something with our hands. So just to finish up, though, so another important thing that we typically do. So here, I mean, this was more in sort of exploratory mode, just to make sure. So just like it's very useful to look at your data at the very beginning, it's also very helpful to look at your data after mapping and after calling methylation values, just to make sure that it looks more or less as you would expect before you go into sort of more advanced downstream analysis like measuring differentially methylated regions. So assuming again that you were successful in all of the initial steps of mapping and calling methylation values, so typically something that's interesting to do is to actually identify in two groups or in two conditions regions that are differentially methylated. And you could do that at the level of individual CPGs, especially if you have sequence data and that's quantitative as we've been looking at. So you can imagine this case where you have different CPGs across the genome and then different levels. Again, in order to be able to identify robustly statistically significant differences, you need only replicates. And you can either do that at the level of individual CPGs where you could actually measure, well, what's the assuming some kind of distribution, what's the probability of observing three of these with a high level and three with a little bit low level. And so you actually can extract at the level of individual CPGs a significant score for being differentially methylated. So this is in some ways similar to what you would do with gene expression or with other things. You just have to model the distribution, your expectation slightly differently. Typically, in theory, you can do it at the level of individual CPGs. But depending on your coverage and how much variability you have at individual CPGs, often this is more than at the level of tiling and then looking at regions or either you have fixed size tiles or you have a fixed number of CPGs in a row where you want three CPGs or more to be in a row to be different in the two groups. So again, either fixed size regions or fixed number of CPGs and then similar type of statistical test between the two groups to identify differences. It's useful and important to have replicates as much as possible because you can see that sometimes there is actually quite some variability between. So here if you have the three samples in red, one of those maybe looks a bit like the blue, but you see that there's quite a bit of variability. So it does seem like there's a difference between red and blue in this region, but again, this seems to be quite variable. There is, again, because the coverage at individual CPGs is typically relatively low, the methylation levels can be a bit noisy, but if you use smoothing approaches, you can get much cleaner signal. And then again, how meaningful it is to have a single CPG that's differentially methylated is unclear. It's more clear at the level of slightly larger regions. And so in this context smoothing is good. And again, the types of things that you're hoping to detect. So this is a paper that's showing different cell development stage and cell lines methylation level. And you see that in development, methylation here in these two promoters gets lost such that probably these genes get activated at that particular cell stage. But again, for this typically you actually smooth the data and or use tiling windows to actually identify these regions that are differentially methylated based on the statistical criteria. So I'm almost done. No more bonus slides for you guys. But this one I think is also important because again, oftentimes, I guess, identifying TMRs these differentially methylated region is really the main reason why you're doing this, right? You have two conditions or you're looking at the time course. And this is showing sort of the pros and cons of having higher depth data to identify these regions that are differentially methylated. So just going over this quickly, you've got different cell types. And you see that the methylation patterns correlate well by cell type. That's what you see on the left side. So this is similar to what we saw before in terms of correlation. The plot that's interesting. So this is, you know, there's human ES cells, there's cortex, there's liver and different types of blood cell types. But the thing that's nice is is this plot, for instance, which shows, you know, how many DMRs can you identify? So first you have the full data at 30 X whole genome sequence. But then they sort of down sample that data and ask the question, you know, with less data, are you able to identify differentially methylated regions? And what you see is that, you know, if you start with relatively low coverage, 1X to 5X, you identify, you are able to identify regions that are bigger in the order of a KB and that have a very, very large difference in methylation, 40%. And as you increase coverage, what you're getting is the ability to detect smaller and smaller regions that are differentially methylated and smaller effect size. Yes. Well, so again, here, their starting point is really taking full 30X whole genome by cell flight, comparing two different conditions and using that to define DMRs, right? I mean, the more data you have, the more sensitive your windows, the more subtle the changes are going to be. So it's, there's no good answer. I mean, the point of that particular specialist panel is to say, you know, the deeper sequencing you have, the more subtle effects and smaller regions you're going to be able to detect. Yeah. Well, so that's what this is. Yeah, that's, I mean, that's what Dave tried. So if you look, right, at different coverage, you also have some estimates, but the problem is, is what's the ground truth, right? You don't have, you know, so here they're defining the ground truth with replicates and 30X. And that's what they're using. And then they go down to see, you know, with fewer replicates or less coverage, you know, how much would they have found of that sense? Yeah. No, no, absolutely. Okay, so any other questions? So hopefully, well, I started, hopefully it wasn't so bad the way I described it. You don't see the analysis, it's not easy. I think some of the concepts are a little bit tricky when you think about these converted reads, and you'll see that in the lab, you know. But again, there are tools that have been written to do that. So you have to choose the appropriate DNA methylation technology. And we talked about quite a bit the pros and cons of the different approaches. I need to check and quantify and watch for biases. Again, this is the kind of thing we'll do a bit in the lab. Ulti step analysis workflow. I guess that's, that's again, what we're going to do in the lab. The available whole genome by self sequencing data sets. I think if we have time, what I'll do is we'll do a bit of that at the end of the lab. I'll show you both in the portal, but also in the ENCODE, where, where we can find some of these data sets. But I think 15 minutes early, we'll start the coffee break. And then unless Anne disagrees, we'll start again 15 minutes early such that we have more time.