 All right, well, welcome back everybody. So my name is Guillaume Bouch. I'm sorry actually I couldn't join yesterday. So I'll use this opportunity to introduce myself a little bit. I believe you've already met from my group Hector and Maraike and David. So Hector and Maraike in particular will be helping me with this module on whole genome bisulfite sequencing and analysis. I've seen parts of what Martin and Edmund were showing and obviously it's set the bar very high. So hopefully we'll do it as good as they did in terms of introducing you with methylation analysis using whole genome bisulfite sequencing. So jumping right into it, the objectives of this particular module are really to know the different technologies that are used to measure DNA methylation. And as you'll see, we'll do a bit of a sort of a historical background in terms of how DNA methylation was measured. And then also talked about some of the newer technologies. We'll talk briefly about some of their strengths and weaknesses of these different approaches. You'll see that, yeah, it's still in some case a challenge to do in the profiling of the whole genome and characterize the methylation status. What we'll focus on though for both the presentation and also the lab is really, you know, the bisulfite sequencing data analysis workflow itself, which is really sort of the mostly used approach here. And then we'll sort of highlight some of the principle and challenges in those analysis. And then within the presentation, we'll provide sort of a high level overview of the workflows and the different analysis, but then you'll go much deeper into that with Hector and Maraike in the practical. So really feel free if you have questions along the way to stop me. I don't see the slack, I don't know how, and then we'll just set up, so I don't see the slack, so but please raise your hand or if anybody can just stop me and ask questions that are on slack, I'll be happy to answer them. All right, so here we go. So what is DNA methylation? I guess in the chip-seq practical and lecture you were already talking about different types of methylation, here we're gonna be focusing on DNA methylation and in particular on five metal cytosine methylation. This particular type of methylation affects between 70 and 80% of PPG in the human genome, high level of 5MC in CPG-rich promoter, in particular associated with repression, and there's still lots to be learned about what's happening outside of these promoters and in CPG-poor regions and understanding the relationship between DNA and methylation and transcription and lots of other processes, which is why having good assays to characterize this DNA methylation are really useful and needed. Again, just sort of a quick background to make sure that everybody's on the same page. Within the chip-seq analysis, what you guys were mostly looking at were histone tail modifications, here we're really looking at the actual DNA itself and its methylation, and for most of the analysis, we're gonna be focusing on the cytosine methylation into metal cytosine, and one of the properties you can have, the normal methylation, but one thing that's quite specific to the DNA methylation is the maintenance of methylation with really methylation that's added to the complement strand, which is one of the unique feature of DNA methylation is the fact that you have both de novo and maintenance, and especially the maintenance can be associated with interesting processes, which is what I have on the next slide. Methylation is known and has demonstrated role in mitotic inheritance through this mechanism of having methylation on one strand that's then reproduced on the complementary strand. So DNA methylation has been shown to be important in genomic imprinting, in the silencing of transposable element, stem cell differentiation, lots of development processes and distinguishing different cell types. Again, there's a certain level of stability associated with the methylation marks once they're put in place, and that really means that in terms of cellular development, this is an important mark to characterize. It also has role in inflammation and in many other processes, in particular in cancer, and this is what I have on this slide. You know, there's some very well-known patterns associated with methylation and cancer. You've got some cartoon examples here at the top where you have a normal state where you have limited methylation in the promoter of a gene, which leads to good transcription of that gene while you have some methylation within the gene that actually prevents alternative aberrant transcription start site to initiate. So this would be in a normal state, but then you can have what's shown below, I guess I could put a pointer on this, you know what I'm talking about, or you can have the status at the bottom where aberrant methylation patterns which occurred a number of different ways, either epigenetic deregulation or specific mutations in methylation pathways leading to aberrant methylation pattern where, for instance, you now have methylation over the normal promoter of the gene, preventing its normal expression and lack of methylation within the gene leading to alternative stress being expressed. So overall control and correct deposition of methylation is quite important for normal processes and in some case can be deregulated in cancer. At the bottom, you just have an example of the other thing that I mentioned which is quite important for methylation which is the control of transposable elements and preventing allu or L1 transposition in human transplants. So you have, again, in the normal state, methylation over some of these repetitive or transposon sequences. So this is the expected normal state, but then if you have displacement of that methylation then then you might have genomic instability and transposition events that get reactivated. And so in cancer in particular, there's lots to be characterized in terms of methylation in particular because there's some not quite powerful pharmaceutical agent that might actually also allow you to try to revert and modify the state to the more normal state. Okay, but that was just a bit of the motivation in what are some of the interests in methylation. Now, I'll quickly, as I mentioned, go over some of the, I guess, traditional assays to profile methylation and then move into some of the more the latest technologies. One of the things before I go into this in more detail is that you'll see that one thing that's in common with many of these approach is this bisulfite treatment. So bisulfite that's used for microarray and also for some of the whole genome bisulfite sequencing cells. So just a couple of slides on this. Oops, I'm using my, so the bisulfite treatment and bisulfite conversion, the bisulfite conversion actually has a different effect on cytosine that are unmetallated versus methylated. So you have a weird thing showing up on my screen. So you have that through the bisulfite, conversion, the unmetallated cytosines are converted to uracils while the methylated cytosines are protected from this treatment. And if you then have PCR amplification, the net effect is that the unmetallated cytosine will be converted to TME. And then the methylated cytosine will be converted and will remain as a cytosine. So using this trick, and if you then sequence these PCR amplified fragment, you'll have a way to identify the C that were methylated versus unmetallated because of this distinctive factor here. It's a little bit more complicated than this to get back to that when we talk about a bit more of the workflow to analyze, but one of the complication comes from the fact that you have the two strands of DNA and the effect that this will have on the two strand will appear a bit different in the processing because like on the reverse strand, that C is a G, well, but again, because it's typically the CPG or have methylation on both sides, but it's just the two strand will have to be treated informatically differently to resolve these C to T conversions accurately. But again, we'll get back to this with a bit more when we look at the analysis workflow. All right, so that was just a bit on the bisulfite microarray treatment. So one of the first genome-wide approach to characterize DNA methylation was to use bisulfite microarray. So you had the DNA preparation, the bisulfite conversion, and then a bit like what you would do using microarrays to do genotyping, you would actually hybridize and you would have probes that represent CPG with and without the methylation, and then you would be able to characterize which sites are methylated, and then you do data normalization and then analysis. Again, I'm going quickly over this just to give you a bit of a taste of how what are the alternative approaches to this. One of the challenge with the microarrays is that even with 450K or some of the larger arrays now with 150K CPG or features on them, you can cover some of the genome but maybe not all of the genome. So the next slide, but if you have access to the slide, it's not a big surprise but how many CPGs are in the human genome or those who haven't look ahead into the slide to see what the answer is to that. Oops, no, I've showed it myself. I just tried to go back and look at the slack. But the answer is 26 million. So there's really quite a lot of CPGs in the genome, in the human genome. So if you're trying to profile the CPGs in the human genome with microarrays, even if you have a million teacher, your coverage is going to be a bit sparse. And this is sort of an old slide showing this, so just giving you a sense of that coverage. You see over, you know, I'm not sure the size of this region, this is a Hawks cluster, but you see the density of all the CPGs in the genome. You see the locations of CPG Island, many of which associated with promoters of some of these genes. And this is an older version of the array that only had 27,000 problems, but you can see that the problems were really distributed throughout. So even with the newer arrays, you still have to make a selection and decisions about which CPGs or which actually sees you're interested in. And so that's a bit of an imitation of the arrays. So moving on to some of the other approaches, and these now are actually closely linked to some of the work that you've been doing over the last day and a half, which are sort of enrichment things. So you can do, this one doesn't even involve any, by self-treatment, you just, you know, prepare the DNA, and then you have an antibody that actually enriches directly for these five metal cytosine, so the cytosines that are methylated. So you're doing basically very similar to a chip-seq experiment, except you have an antibody that's targeting the five-prime methyl cytosine, and then you do the analysis, actually a lot like you would do the chip-seq analysis as usual. We'll get to the limitation, but one of the limitations with this is you'll get regions that are methylated, but you don't have the same kind of base-pair resolution of methylation as you do with the bisulfite treatment. So similar, another enrichment-based approach targeting, and these are some of the main approaches, there's of course even more approaches, but this is just to give you some examples, the types of approaches. So this is another approach that then enriches for a methyl binding domain protein, and then you would perform a similar library preparation and sequencing and an analysis of enrichment in some regions of the genome with the presence of this methyl binding domain protein. So the final example I'll give before moving on to the whole genome bisulfite sequencing approach which is really the main technology that we'll be describing the analysis for in this module is the RRBS technology. So this uses a special restriction enzyme that targets these CPGs, and so this will enrich for regions of CPGs in the library preparation, you'll perform bisulfite treatment and you'll do high throughput sequencing. And so this was especially at a time where doing a whole genome bisulfite sequencing was still quite expensive, was a way I guess of bypassing, sequencing the whole genome and really focusing the sequencing into CPGs and CPG Island. I'll get back to some of the challenges with this approach but this is again a way of doing efficient sequencing around CPGs following bisulfite treatment. Again this is at this point a bit of an old slide but just showing that you know there's different approaches to an extent quite a lot of similarities of profiles so this is a good sign in the sense that you know just globally you're getting consistent signal but each of these approaches has some you know advantages and disadvantages some of them really have this base pair resolution of CPG methylation others are more enrichment based and give you these these profiles of methylation. Yes, Mary. We have a question in the Slack from Mathieu. Yeah. He asks does the fact that methylated DNA mutiprecipitation uses an antibody based approach means that we should use a chip-seq like pipeline if we're coming across these kinds of data. Yes, that's correct. So you would be using a chip-seq type of approach and then detecting peaks and regions of enrichment but not specific methylations and the states. Thanks. And this is just again sort of showing some differences in terms of regions and coverage based on these and some of these older technology but I really want to spend a bit more time talking about some of the latest technologies that in some ways one of the big challenge with some of these enrichment based or RRBS technologies was a bit like the question that was just asked the types of analysis that you would do the types of normalization in the case of RRBS in particular I think some aspects of the normalization were quite challenging because of this special targeting of the cutting enzyme in this case by itself in contrast to all of this whole genome by self-hight sequencing as the event is really giving you an overview of the whole genome you know you do have the by self-hight treatment but you don't have further enrichment in some of the region to draw and you will be from this then getting you know a base level estimate of methylation so you don't need to make a periodic selection of which region you're going to characterize you're really going to be profiling the whole genome so many advantages in this context the main drawback at this stage remains the cost because of course that means you do need to cover the whole genome with short reads typically and so that adds to the cost but there's definitely at least you bypass some of these challenges of selecting the regions you're going to profile just a slight small twist on this technology which is a bit like you can do whole genome sequencing and you can do exome sequencing so you can design again if you do want to scale up this technology to lots of sample you could add a capture of DNA regions at the end before you do the sequencing so this would be another approach to sort of do mostly whole genome by self-hight but do some kind of selection of the regions to enrich for and this would allow you to lower the cost of sequencing and perhaps do more sample so this is sort of a twist on the whole genome by self-hight sequencing approach there's some very nice packages that allow you to we're not going to be using this in particular but that allows you to analyze data sets from micro arrays and some of these other technologies I just talked about again the nice thing is that from this and from these analysis you see that at a high level the global level the samples behave similarly no matter which technology is used whether it's a 450k array or the epic array or a whole genome by self-hight but globally it's quite similar but of course you might be interested in specific regions that might not be covered by the array and so that would be the benefit of holding on by self-hight sequencing in terms of I think I've said this already a little bit but some of the you know the advantages and disadvantages so they all provide you with you know overall accurate data, methylation measurements micro arrays typically have lower cost but force you to identify the regions the enrichment based methods typically have slightly lower resolutions again you'll get regions you won't get base pair resolution and be a bit of a challenge to normalize between different experiments the advantage of the RRBS MCC can hold genome by self-hight is really you get base pair resolution but genome by self-hight remain expensive so before we but this still remains genome by self-hight sort of state of the art of how to do this type of profiling in large genome now so that's really what we're going to be focusing on in the practical but I wanted to give you also a taste of some of the new technologies and what's coming in terms of methylation profiling and it's I think it's quite exciting that you know I'm sure you've heard about the long-read technologies and the long-read technologies not only allow assembly and you know variant characterization in difficult regions of the genome they also in many case allow you to look at both DNA and RNA modifications actually so just very quickly on the left side you have Pac-Bio sequencing if you're familiar with Pac-Bio sequencing it's based on really the polymerase itself that's doing incorporations of various subfluorescent bases and the Pac-Bio really takes a movie of these incorporation of all of these bases as the sequencing precedes and the very nice feature of that is that by just looking at these intervals and the time it takes to incorporate various bases if the strand as you see down here the strand that is being sequenced using the Pac-Bio has these modifications including 5NC well you expect to observe some change and some shifts in the time some of these incorporation takes so this can be mined to characterize methylation status of the bases at the same time as your sequencing. Nanopore does something similar using a nanopore with the single strand going through the pore and the rate at which the single strand precede through the nanopore defines can be used to extract the actual base but also actually the whether the base is modified or not and this would work with both made of DNA and RNA so you can there's already some software to extract the bases corresponding at the flow of that fragment through the nanopore and you can detect more subtle signal that corresponds to modified versus unmodified bases and this is quite exciting and it's already started to be used to profile methylation even in quite difficult regions of the genome that would be impossible to profile like Centromere that would be impossible to profile using arrays or short reads for that matter and 5MC is actually just at the top of the iceberg in terms of what types of modification can potentially be captured using these approaches so that's I guess on the plus side on the good side so there's really quite a lot of potential that the challenges is that this is compared to this remains sort of very much in development and quite challenging because these modifications don't affect the polymerase if we think about the pyro sequencing the effect is quite subtle in such that you really need high coverage you have to observe this many times to be able to have confidence that there is a methylation so in this paper they say 250x for accurate detection of 5MC so you can imagine reaching that type of coverage using long read for a large genome is extremely costly so it's lots of potential but still lots of work to do and similarly for nanopore you know they're already trying to resolve and improve the accuracy of the base calling itself this just added another layer without very high coverage and good base calling algorithms according to what you're mentioning that taking into account context this is still quite difficult so because of this the current methods really have high false positive rate they might work well with tiny genomes where you can have extremely high coverage but otherwise it's not so practical so that's part of why this is exciting upcoming technology but for the practical we're going to be focusing on sequencing maybe I can pause here for a second just to see if there's any question I basically finished the intro on the background and now I'll just provide a bit of info on the steps for the analysis in the second part so there's a question from Martin Wong in the Slack are there techniques for detecting non-CBG DNA methylation or even methylation of RNA molecules yes absolutely so I mean there's there's a range of techniques I have it here on the left side was the DNA modification and the RNA modification so these are all some of the techniques that are using the long read sequencing that can characterize and profile all of these different marks it's pretty crazy I think so there's other techniques to detect utter modification using short read so there's some modification of the the bisulphite treatment and different types of protocols that do allow you to characterize other types of marks but each has its own sort of flavor to them and again in this case and for the tutorial we're going to be focusing on the 5MC as one of the most commonly looked at DNA methylation work thanks for the question alright so moving on to the analysis now of this data so I mean you'll see that all genome bisulphite analysis is not for you know the it's not the easiest type of bifuramide analysis and a lot of that comes from bifuramide treatment and what it's going to require so I'll show that as I move through these slides but basically we have sort of standard quality control we're going to be doing alignment we're going to quantify the methylation and then we'll want as we're doing with a gypsy to do some visualization and statistical analysis again very similar to what you did with gypsy and looking at differentially mark regions we'll do a differential methylation analysis so starting with you know and this applies to any bioinformatics workflow before you start the analysis it's really important that you take a bit of time to look at your raw data we're actually not going to do that in the practical for this particular module for lack of time but again keep that in mind very similar to what you heard on gypsy analysis and I'll have a few slides on this you know where all of your samples sequence using the same protocol and instrument where there are any technical issues affecting some of your sample so it's really important that you look at the starting dataset even before you jump in more and more fancy statistical analysis because this you know might affect your interpretation if you really if there were different patches and so on so you can run fast you see very similar on some of these read files that you'll get from just to get sense of the quality of the sequencing and some of the read quality one of the things that you'll notice is a bit funny if you're doing this with you know bisulfite treated sequences that the sequences that you get back from the sequencers obviously have sort of slightly different properties in terms of percentage of fees and so on because of this bisulfite treatment so only you know the methylated C's will have been unconverted and so that's why it shifts the proportion quite a bit RRBS thinking about that type of data also because of the cut you know has an unusual distribution because it really enriches the CPG engine but again to some extent it's really to make sure that if you're analyzing 10 samples these 10 samples have similar distribution and behave as you would expect one thing that's also going to be quite important for the methylation quantification is your library diversity so you know if you have depending on if you have a complex library you know all of your reeds really are going to be covering slightly different interval if your library has a lot of PCR and duplicated sequences I guess so you might end up with lots of copies of exactly the same read following those amplification and the tricky part with this is that depending on whether these are independent observation are really just the same observation it's really going to affect your estimates of methylation of a given site so it's quite important so you have it here you have two observations but actually you're repeating this observation many times so you're saying that this you know has low methylation well actually you just have two observations and this would be 50% so removing duplicate is quite important to get accurate methylation estimates here and in general having complex or diverse library will really give you much better results and much more effective coverage so looking at the duplicate rate in your library is one of the important metrics so read quality you know presence of any kinds of adapters and the sequencers looking at duplicate rates and also conversion rate so I didn't have that in detail in the slides but how effective how efficient was the bite sulfide conversion in another metric to look at to make sure that your samples are similar just to give you a sense of sort of some of the standards used by encode ideally having two or three biological replicates so this V2T conversion rate should be 98% so again this is to ensure that the bite sulfide conversion worked you know between you should have good correlation for sites with significant coverage and so so again just to give you a sense of the kinds of numbers to expect when you look at these QC metrics moving on to the alignment so this is where it gets really quite a bit of fun so I mentioned so you know typically we have a reference sequence and we're just mapping these reads to the reference sequence the challenge we face with bite sulfide treatment Vita reads is that the effect on the positive strand and Watson strand it will be different than on the Crick strand and so by the time you do the alignment you will have to align to different references because you basically well I'll get to it when I talk about the alignment but basically we take into account these potential change in the reference of V2T and we'll have to make these changes relative to the genome on both strands and then with the reads on both strands so you end up basically doing the mapping four times assuming that you're on each strand both with your read and on the reference genome with these potential change so that adds quite a lot to the alignment step and different algorithms will attack this problem in a different way so there's three main strategies and really the first two are the standard ones for aligning the gene so one is a wildcard alignment and the other one is a three letter alignment in reference preprocessing so the wildcard one approach is to replace all the C's in the genome with a wildcard character so that whether it's a C or a T will align to that region so you just the mismatch that you would be getting whether it's methylated or not with this wildcard so there's a few tools not the one we're going to be using today but there's a few tools that use this approach another approach that this is where you end up doing four times the mapping to some extent is that you convert all the C's and T's and then both the reads and for the genomic sequence that you're trying to align to so by doing this I'll get to the advantage of disadvantages but whether your read came from the plus trend or negative trend and whether where it's mapping to on the genomic sequence you should be getting a good mapping based on that and the tools that use this approach and the four mapping are Bismarck that we're going to be using and some other tool, more recent tool like GenVS so here's a quick cartoon but again this is maybe a bit technical always takes me a bit of time to warm up my brain to this but you know you've got the wildcard alignment that converts the C's to these Y wildcards the CGs to wildcards versus the three letter alignment and the big take-home on this is that of course by converting all of these to wildcards you're losing a lot of the specificity into the mapping so you know you end up mapping lots of things including in places that don't really map to you're losing a lot of that specificity and this will have an impact on your estimates of DNA methylation and so typically these three letter alignment methods are more specific and they tend to be preferred even though then they lose to some extent sometimes or they're less efficient and might lose in sensitive so some strengths and weaknesses of these two main approach three letter aligners of lower coverage in these highly methylated regions because that's right they convert all the C's right so indeed so the regions that are highly methylated you might be losing some reeds so you're also you know they have a weakness in these regions the wildcard aligners have mapped more but the cost of losing specificity and having some biases because of all of these changes so a lot of these problems are especially prevalent in repetitive regions of the genome so this goes back to what I mentioned with the long reads that ultimately will be needed to resolve some of these typical regions of the genome but in general for most regions of the genome these approach perform quite well so another thing and I'm not going to spend too much time on this well so this is the approach by Bismarck that we're going to be using in the practical that does this four way mapping and then determine the you need best alignment I mentioned briefly that there's another tool that I know we've been using with an IAC and it works well which is called GEMBS GEMBS has a nice feature of being more efficient and taking less time than Bismarck but Bismarck has really been used quite a lot and this is the one we're going to be using in the practical and we've just simply down sampled the samples such that we could process them during the time that we have in the practical I'm going quickly over this because we're running out of time and I want to have a bit of time for questions but again you're going to be in the practical we'll also have more time to talk about this I just wanted to mention briefly that there's a third approach which is called reference free processing that might also be in some of these difficult region something to explore although I haven't seen a lot of recent results pushing into direction but there's a concept of reference based variant detection this is now thinking about just variance not just methylation but there's general idea that you map to a reference and then you look for mismatch and this is how we're going to be looking for you know methylation on the cytosine but just like when we do this type of variant detection you can also think of ways of doing this but just using the equivalent of camers you know basically doing the same type of approach but without using a reference and just comparing reads from two sets from tumor and normal without having a reference and then looking for variance so similarly and this was a tool again but I haven't seen this used that much but exploring whether the same kind of strategy could be used to do methylation so this is all about the alignment moving on now to actually estimating and quantifying DNA methylation itself so we'll do a bit like what was shown in the cartoon where I showed you the duplicate reads where after the alignment onto the genome we'll be able to count basically the unconverted versus converted cytosine by just looking at these these T's instead of G's and so on so we'll just at each position and then from this we can estimate methylation percentage so if you didn't think that by sulfite alignment wasn't complicated enough yet is the fact that the SNPs they're actually adding another layer of complexity in all of this because you know we assume that the reference is the same for everybody obviously it's not if you have SNPs this further complicates this conversion of the reference and the mapping of the reads and the estimates that you get out of the whole genome by sulfite sequencing so there's some tools that are out there that do both the SNP and the variant calling at the same time as as the conversion and mismatch coming from methylation counting but again unless you have very high read coverage this can be quite challenging so an alternative to this is sometimes to sort of ignore the SNPs to some extent the alignment step and then map them mask them I'm sorry from the results and the quantification of methylation because again the quality of the estimates over these variants will be lower so some of the tools that can be used for both doing the SNP and methylation calling are listed here but this is also not something we're going to be covering in detail in the practical okay moving on a little bit to okay we've aligned our reads we've quantified the methylation status you know we'll want to look at the data and make sure that things behave as we would expect one of the very nice things with IVV familiar with and is that it has a methylation specific mode where again these mismatch that correspond to methylated Cs can be highlighted in different colors so this is one of the things that we're going to be playing with in the practical but you can basically look at you know instead of looking at mismatch or so you've got the general coverage and what you're showing are these methylation status to the point where you can get you know regions that are differentially methylated in two sets of sample corresponding to the kind of cartooning examples that I showed you at the very beginning so again that's the kind of stuff we're going to be doing in the practical you want to look at some of the regions in your genome and look at sort of global distribution of methylation values and start doing some clustering of samples and so on so that's exactly what we're going to be doing in the practical you know you expect and you know the percentage of CPGs to be either highly methylated in a subset of cases or unmetallated for the majority of the genome and you also want to look at the coverage to see whether you have good and uniform coverage and whether your estimates are reasonable or not you know whether your estimates are reasonable or not you're going to see that from if you have replicates and if you have different conditions and so on through these correlation analysis and again this is the kind of stuff that we're going to be doing in the practical just comparing the values that we're getting across different cell types and replicates so examples and the kind of clustering you can do with methyl kit which is the tool we're going to use in the practical so I'm coming to the end maybe it's just another five minutes and then we can take a few questions if there are any so once you've identified you know we've looked at some of these general properties, samples behave the way you expect we might want to start really going as we did with chip seek identifying regions that are differentially metallated and so on into a global analysis of these so here's a cartoon from a review that somehow I'm not citing here that I like a lot so it highlights I think some of these examples the advantage that we talk at the very beginning about sort of enrichment approach versus single base pair resolution approach if we're holding on by sequencing we really have ideally base pair resolution of the methylation level of CPGs if you have a nice study with cases and controls whether it's tumor or cancel or tumor normal or whatever so you can see the level of individual CPG differences in methylation status you can do single CPG analysis to identify the ones that are significant in cases and control but you can also move to more region analysis and either having tiling regions or having a way to identify regions with significant signal and this in some case might be even more significant and then you can associate that with regions of the genome like enhancers and so on obviously like in any of these types of experiment it's good to have replicates you might have one sample that behave a bit different or quite different if you look at the two pink sample and it's not clear that the average here is very representative so having replicates will help you understand really the confidence in your signal there's also in the context of methylation the signal sometimes can be a bit noisy at the individual CPGs so in some case you won't actually do some smoothing prior to identifying some of the regions of interest allowing you to remove you might have undetected uncharacterized SNPs that are leading to CPGs misbehaving or the alignment as I mentioned is quite challenging so you might be losing some regions because of that so anyway so some smoothing might help resolve this so one of the tools that you're going to be using in the practical to do this and to try to identify these regions that act as DMRs this tool called the SS which you know can be used to identify these regions that are really in these two sample here behave different from these two all right I think this is my final slide for this it's just I like this figure as well because especially if you focus on the right side it shows that there's a trade off depending on how much you sequence so depending on whether you have very shallow sequencing in gray or very deep sequencing such that you're covering with your bisulfide treated reads between 10 and 30x well if you have very low sequencing you might be able to detect very big region and that have very big methylation differences but you're probably not going to be able to do very good job at single resolution CPG characterization but as you sequence more and if you reach 10x and more of the genome then you're going to start being able to characterize methylation much more precisely in terms of smaller regions and smaller methylation differences but it's always a bit of trade off depending on how much you sequence okay so some conclusion bisulfide sequencing analysis is not easy to be honest especially these issues about the reference genome the SNPs those are not trivial and they're actually not trivial on the software side of things either it's really having four bases of the genome makes the mapping much more unique if you think about the fact that we're now sort of removing a bit of that by converting some of the bases we're working more in something that looks like a three base spare bases genome and that really makes some of these mapability issues bigger I talked quite a bit about the different choices of methylation technologies you know some of the quality check and the biases yeah but maybe I'll stop here and you have just a few minutes but but I'll be one more slide I guess that I didn't show but that was only to talk about then David will talk about this much more tomorrow about you know some examples of available bisulfide data sets that you might have but maybe I can stop here and see if there's any questions on all of this