 Okay. Welcome everyone. So today, great comments. So today, I'm just going to be talking about differential analysis and how to handle that for your newly called gypsy peaks. Ignore this taxi. Sorry. Moving on. So some of the learning objectives I will try to cover today are what are the kinds of questions you can ask to better inform your analysis as you're looking at your gypsy data. There are a variety of ways to perform your differential analysis. One we've touched on, but I will go in more in depth is doing comparing it via genomic coordinates. So that's to identify overlaps. And elements that are not overlapping between your files. So to do so, we are going to use a bed tools. We will also include the context of coverage and amplitude. We will have to use a couple tools for that, but the main, the main statistical driver for that would be HR. And then we'll cover some, we will cover some ways we can handle triplicates. The main one I'm going to be talking about is defined. The pen is showing, right? Yeah. Awesome. Thank you. The main one I will be going over is defined in our package. And then I want to take some time to highlight other considerations when we are analyzing our chip seek just because sometimes it is not as straightforward as we would like. And then I want to highlight some additional resources for you guys. We have another tool called deep tools just for QC or generating plots and some existing chip seek pipelines out there that you guys can leverage. Okay, so, um, day one, we aligned our reads called our peaks. And then we did some QC to remove them, remove the blacklist regions, and now you have your peaks. So, what can we do with that? In terms of analysis, there are two features that we get are two analysis pathways we can do. And if you recall that your enriched regions are really just a set of coordinates. You can find those coordinates if they you can find whether your set of coordinates overlap. And you can also find whether your set of coordinates do not overlap so they would be considered unique to each condition. Now, the context of whether of how to interpret overlap and your unique peaks vastly differed according to your model or experiment type. What I mean by that is let's go into the first example. So, say your example for example a your first set of files are a solution for condition a, and your second set of files is K for mono methylation for condition for conditioning. When we perform a overlap and find that we, what we are essentially getting is a bunch of regions that are enriched with both 27 a solution and K for mono methylation. So, what does that mean? Well, what in, if you recall the, the study that Martin highlighted where they found regulatory elements that had certain histone marks, 27 a solution and K for mono methylation often co occur in region and distilled regulatory elements that we called active enhancers. And so active enhancers then go on to affect promoters, increasing transcription, etc, etc. For unique peaks, what does that mean to have a solution on its own that doesn't coincide with K for mono methylation. Well, going back to that same paper, 27 a solution can co occur with other marks at different regulatory elements. So, for example, 27 a solution can occur at active promoters. So this, so in those promoters, they would also share with K for trimethylation. They can also be bivalent promoters. So they could be at a location at a promoter where they share it with 27 trimethylation. What does it mean for K for mono methylation to be on its own. Well, we've seen examples where K for would be a primed enhancer, and then following some change, it becomes active. Let's take a look at another example. What if we are comparing say 27 a solution across to cancer types. If we find shared peaks that would indicate that they perhaps share common regulatory elements such as shared enhancers or shared activity at promoters. So that suggests that they have common mechanism or regulatory pathways regarding unique peaks. For example, that could mean disease specific 27 a solution. And to cover one more model, say you had a cell line that you then induce and uninduced for, I don't know say like a transcription factor, for example. And if you were looking at 27 acylation shared peaks shared peaks between your induced and your induced state would mean that your 27 acylation mechanics are independent of the treatment. But then if you find 27 acylation unique peaks. That would mean that's your one set of acylation was gained following the induction of your transcription factor, and then a set of 27 acylation peaks was followed was lost following the induction of your transcription factor. So really just to summarize all of that. The types of analysis we can do is to identify overlaps and unique peaks, but the interpretation of your overlap and unique peaks will depend on your type of experiments. So just keep that in mind as you are asking questions and trying to formulate a, a, a, a analysis around your experiments. So, how do we identify identify genomic overlaps. You will find in a lot of papers they have a figure such as this, where it is a Venn diagram showing how many regions are exclusive to one condition how many regions are exclusive to another condition, and how many regions are shared. How do we get to this point. Well, recall, recall that your enriched regions are in bed format and really what that is is a set of genomic coordinates for each feature between your two files. So we can find the common alley or the overlap between your two files by using a tool called a bed tools. I'll go into more detail, but that tools later. So to identify the commonalities what we can do is just use the command bed tools intersect, we would list our first file by doing dash a file a and doing dash be file be that will give us the the intersect so the left outer joint. So that would be the 61,000 and 147,000. If we were to do the same, but to switch our file a with file be so the command would become bed tools intersect, and we want to report the number of elements in file be that intersect. Oh, sorry, that's sorry. Let me try that again bed tools intersect dash a and file be that would just give us the overlap in the middle here. Likewise, if we were to do the identify the file be that overlaps file a that would return us the intersect so the 147,000. Adding dash be will alter the command. So instead of giving us returning sorry, instead of returning us the intersect of file a and file be instead it will give us elements unique to file a that are not found in file be so in this particular example, bed tools intersect dash be dash a file a dash be file be this will give us the 61,000. Now the neat thing is file be can be any can be a variety of types of files. It can be a set of. It can be a bed file so containing a set of coordinates. It can also be a BAM file. It can also be a VCF. So what that will give you is bed tools intersect that file a and sorry, file a and file be if file be will say a BAM that will give you all the regions that contained reads from your BAM, or if you were turning VCF effort, sorry, if your file be was a VCF that will give you all the regions that had a that intersected with a VCF. So one problem when trying to generate this kind of an analysis that I would like to highlight is that your intersects are not always one to one. Let's take this example down here where this is your file a this is your file be and each of these blocks are a elements from each of the files. So we can see here that in file a this particular this particular block intersect multiple in file be this will affect how you count because do you report this number relative to a or do you report this number relative to be so one method to solve this issue is to merge your intersect together and by merging what what that means is you take the left most coordinate and then your right most coordinate and report this whole intersect or this whole section as a single element. And how do we do that. We can do that through the additional features of bed tools. So bed tools is actually a really great tool that I hope that everyone will become more accustomed to using as you are exploring and learning how to perform gypsy analysis just because it is so rich in features. Being a tool that can work on a bed file and a bed file can be really a set of German genomic coordinates for anything and bed tools offers a lot of functionality. The one featured I mentioned was merge. So, given your set of bed, bed files where you have various elements. And you supply a file, it will then merge those elements into a sync merge elements that are overlapping into a single one. If you provided a certain distance that that can also merge your elements into a single one as well. It will also merge bookended elements into a single one. So, very practical and useful. Another function I would like to highlight is called map. So say if you had your file a so this could be your enriched regions, and your file being this could be, I don't know, your regions of regions of methylation, for example. What map does is it takes the overlap from the elements in be that intersect with a and then can summarize the values that accompany each of those elements, for example, you can and use that to calculate the mean methylation that co occur with a. So this allows you to combine your analysis with other epigenetics such as by self fight. And one of the last features I would like to highlight for bed tools is the closest function. So given file a given file be it will just report the closest element in proximity to your file a. It hasn't been extremely useful for me. Say, for example, you find an enhancer. You have, but the problem is, it's hard to contextualize what your enhancer is doing. One common assumption is that your enhancer is operating on a nearby CSS. So using something like closest can then help you identify the closest CSS. And just to highlight the interoperability of bed tools is available command line, which we will primarily be using. You can download it through conda. It has a wrapper and are Python, and it's also available in a Docker. And as I noted before, it's not just limited to your peaks as bed files could be any feature, including plus or minus two KB of your CSS, it could be exons, it could be methylated regions, it could be tabs. So you can do a lot to manipulate to manipulate your files. It's also good for speedy and analysis. So one issue I would like to highlight when you're just doing a genomic overlaps is that they make very simplistic assumptions. For example, here I have K for trimethylation for two conditions. The highlight I want to show is that in one condition, I have two peaks here and then in this in the opposing condition, I'll have only one peak. The highlight indicates a peak that is supposedly unique to my above condition. So if we check the amplitude in the tracks, that is not the case. So this would be a false positive of following a binary comparison. If I were to do the same and look at it in this scenario where, sure, this peak and this peak for the 27 oscillation for these two conditions. So they are technically both present, but if we compare the height and the amplitude of the tracks, one is vastly higher than this one than the other. I think this is also another very obvious example. So the question then becomes, how do we account for difference in coverage difference in amplitude. To do so, you'll often find analysis such as this where your X and your Y axis will be a signal or coverage or amplitude from each of the opposing conditions. So where each of these dots will be a peak from a unified peak call set. So that means we would take your peaks from file peaks from the this example this wild type, and then peaks from your mutant. Read them together, sort and merge them and to get a unified peak set. We then have to assign, because that's just a bed file, we then want to convert that into a bed graph file where we want the fourth column to be coverage and pile up information for for the X and the Y to inform our X and our Y values. So to do that, we will use bed tools, utilizing the feature of intersect and counting the number of reads that intersect each element in our unified peak call set. To do that, we basically have a data frame that we can then read into our, once we read into our we can apply a, we can find differential peaks through full change. We can supply it to other statistical software that can apply normalization for library depth, for example. Sorry, to account for library depth, for example, such as a common, sorry, commonly used ones such as HR and DC to that will give us a combination of full change and AP value. And then we can just apply those apply certain cutoffs, and then we would have the differential regions in red here and the differential regions in blue here to highlight the potential of this of doing this particular analysis. If we were to just use bed tools and only annotated the regions that were exclusive based on based on genomic coordinates, what we essentially get word would just be the regions on the Y and the X axis. So we would be missing out on the context of all all this in between and along the roughly along the diagonal. So to summarize how this workflow would look like. We start off with our peaks, we start off with our bands, we read our peaks into bed tools and then we merge it to get a unified peak sets. We then supplied this peak set with our bands into bed tools where we would enumerate each peak with the number of reads that can be found there. This produces a data frame, we can read that data frame into our and use statistical analysis package such as such as edge art to calculate that. And then from there we can just buy a threshold hold for either Q value or your full change, and then we get differential peaks and your other peaks. Moving on, I would like to talk about triplicates. So one strategy we can use is called a diff bind. So diff bind is a wrapper essentially where after we get our so assume each of these is a experiment and each of these lines are a track and each of these bars is a pile up for each of your replicas. So we then perform a peak call with our same max to we get enrichment and enriched regions. So in the red, what we will feed that all into diff bind and what diff bind will do is similar as before, it will merge all our peaks across all our three replicas. And so you just get one set of unified peaks. It will also do the count for us where it will count per replicate the number of reads in each peak. And then it will apply normalization, and then perform a statistical test to give us the in the statistically valid and full change appropriate full change will start a calculator Q value for us and then we can filter by a full change in Q value similar as we did before. Some of the cons for using diff bind, for example, is we need replicas so more than three. One problem is your merging can go poorly, if your peak call wasn't great to begin with. And so this makes it hard, or at least your certain, sometimes your analysis with roadblocks doesn't go according to plan, go according to plan so it might not perform that well. But if you have three replicas, this is a great package to use. If you don't like the default overlap function, it has other overlap functions that you can use to make it a bit more specific. And you can, as a rapper, it also has access to both edge R and DC to so you can use those to find your significantly enriched regions, according to your choosing. Regarding edge R and DC to I've mentioned them quite a bit so I just want to give a refresher to those who are not as familiar with it. I'm not a statistician so I might butcher some of these explanations but bear with me. So both are soft, both are software very commonly used in RNA seek. They make the assumption that different differential events are rare. So they follow a bind negative binomial distribution. Some of where they can differ is in their normally, sorry, normalization methods, in which edge R uses something called trimmed mean of M values. It looks at each, each sample, and then tries to in assuming the log ratio is one given that nothing should be too different. And then calculates the weighted mean of a log ratio per, I believe it was each combination, and then scales it accordingly, so that it is as close to one as possible. For DC to do it does something similar where it looks at each of your samples. And within each of your samples also looks at each gene and tries to find where that gene is relative to all the other samples in a term in terms of a geometric mean. And it finds the median of that and then scales each length accordingly to that median ratio. Also, in terms of the type of statistical tests that edge R does and DC to does edge R has two functions. If you are just comparing two groups, you get your age. Edge R performs an exact test. As of late, if you are performing an analysis on multiple groups. Edge R uses a can use a generalized linear model. And then for DC to it just uses a wall test. So, again, just summarizing things. For your experiment, you can ask yourself, do you have triplicates. No, we can then go into the previously described pipeline where we take our peaks, we merged them into a unified peak set. We then use a tool, use a tool to look at the coverage for each of our bam files in that unified peak set. We then create a data frame, supply that into edge R. And then we do a, we can then filter for significant and reads, sorry, peaks of a search in full change defined is nice because it basically does all that for you and you can call it in R. Edmund sorry to interrupt Monica has her hand up. Just before you continue. Um, like I use HR for for transcriptomics bodies like my, my preference. Is there any reason why you prefer HR here also for for chip seek or either way like DC to will work similarly, or why do you prefer HR here. In this particular scenario, I'm just using HR because it works better for our for the, sorry, for the tutorial. Yeah, for the tutorial that we're about to go through. But really, you can switch that between a DC to or HR. I do believe DC to only plays nicely if you have triplicates. Or at least more than. Sorry, 2 or more per group when you're doing your comparison. So if you only have 2 samples that you're comparing, you might want to use HR. Okay, that makes sense. Yeah. Before we move on, I think there's a question in the slack that Jose is answered, but he also says he wants your opinion on from Desmond, a bit of a theoretical question, but I recall Max to assumes the chip seek data follows a Poisson distribution, both both edge are and De seek to assume the data adhered to a negative binomial distribution. Wouldn't this lead to a bias in the downstream data analysis or does the NB models each package use is generalized to the Poisson when working with chip seek, or is this not even an issue. I think my take on that is the 2 assumptions being made for. Max to chip seek calling is going to be a different assumption opposed to. Opposed to what edge are and to assume, because again, that's. Sorry, go ahead want to interject what I answered admin, but this is how I see it is is that the assumption is being applied on different targets. So what Max to is analyzing this reads what edge are and the seeker analyzing our genomic features, whether they're peaks or something else. So the distribution is. It can be different distributions and still not be contradictory, right? The peaks can have 1 distribution, but the underlying reads can have a different distribution. I will say also that the, the negative binomial distribution is like due to the next generation sequencing data per se. So that's why it's used now. Well, maybe just to interject. I mean, the negative binomial distribution works for for expression for gene transcript. Abundances. But, you know, does that model apply in the context of chip seek distributions. You know, I think that's. I haven't seen any evidence that that's the case. I think DC two is, I mean DC two is designed for doing differential analysis across large groups of samples, where you have, you know, many, many measurements, 10s or hundreds of measurements. You know, assumes a mean and uses a negative binomial to provide a significance for the differential expression chip seek is a different beast because you almost never have that number of samples and so, although the new version of DC two of course allows you to do a lower number of replicates. I'm always very cautious and using DC to for differential analysis and prefer to use diff bind or or deep tools to do to do any differential analysis that depends on amplitude. I mean, this is a big discussion and maybe maybe we want to work work through it but but even thinking about what amplitude means in the context of chip seek data. What does amplitude mean amplitude is a reflection of occupancy of that mark within the population of cells that you're measuring. It's not a measure of strength in the sense that we think about it in gene expression amplitude and gene expression means that that gene is expressed more amplitude and chip seek means that more of the cells within your population have that market that particular position, and we have no understanding of what the minimum and maximum is. You know, taking into account amplitude I think you have to be very, very cognizant of the, you know, of the biases in the data and the challenges there. Well, you know, well, well undertaking such an analysis. And I mean I think the first step is to do a binary analysis with, you know, FDR cut off peaks that you use that have been generated using a Poisson distribution that would be my first approach. And I would be very cautious and using amplitude, but that's just my opinion and obviously this is an area of of ongoing research and development. But, you know, you really need to think about what amplitude means in chip seek which is different from transcript. Anyway, I'll be quiet and please. No, definitely appreciate the input from everyone. Thank you. Okay, moving on. So another method to utilize your triplicates if you don't want to call peaks of on them individually and analyze them individually. What you can do. So again, each of these tracks would indicate a separate replicate and each colored bar would indicate the reads corresponding reads there. And then you would just combine your three bands into a single one. And then you can perform the same binary analysis as we have highlighted previously some pros and cons for this particular approach. This particular approach allows you to use the same strategies that we highlighted in the binary comparison. One issue is assumes that there is tight correlation between your replicates. So for example in this taught in this crude drawing, you can see that one replicate is vastly different from the other two. So it may represent what you are showing. Or it may represent some of the results. So let's start coming back to the problem with Broadmarks that I briefly talked about. So when you're working with Broadmarks, the merging result can be get messy. For example, if you are looking at triplicates here. If we were to merge that you essentially get this extremely large feature and honest and it's hard to say because maybe you do want something as big because 27 and me three can look that large occasionally. But because of Max two and how it performs a broad peak calling thing, it doesn't always have the. What's the best word to call it the best resolution for boundaries. So some additional ideas I would like to highlight when trying to wrangle your Broadmarks is that, like I mentioned. Broad mode works, though it can, it can have for boundary resolution for your broad peaks. So if you are just trying to identify a differential. Broadmark, sorry differential mark what you can do is instead of peaks generate a set of overlapping windows using, for example, bed tools. This will essentially bend the genome, and then you can apply the same approach where you just do a recount per each of your enriched regions. You can do a recap for each of your windows and then apply apply a apply a statistical test to find enriched regions and use those. There are other tools that are that also try to address this broad peak calling problem. For example, we would be cicer to there are additionally certain are packages. This one is called seesaw that tries to use the overlapping genomic window approach to help find differential broad peaks. So, so either items I would like to highlight as you are trying to perform your gypsy analysis is keep in mind what keep in mind of the big picture what are you after are because what we really want is to identify a set of regions to gain insight into a cells population and epigenetic state. Even if you find, say, a one a different even if you find a differential for say, a solution. Based on what we learned from the end code paper that usually co occurs with other with other measurements, for example, if we found, for example, for the differential. We can learn a lot more by looking at, for example, k for mono methylation to identify if it is an active enhancer. If it is an active enhancer, you would usually see that chromatin accessibility is also increasing at the location to allow for recruitment of say transcription factor. Speaking of transcription factor, you might see increases of that. Additionally, if it is a newly formed 27. Sorry, if it is a newly formed enhancer, does it have any consequence downstream to the distally regulated gene that is that is acting upon. Do you see increase in transcription, for example. Another idea to also keep in mind is ultimately we want to link our epigenetic state change to some sort of functional genomic feature. So say, for example, you called your K for me three peaks, you found your differential K for me three peaks, and then you intersected them with your promoters and essentially what you have are a set of promoters with differential K for me. Three events. That is one method, but perhaps a quicker way to get that result is if we just looked at our genomic features directly. We did a count found and found a, sorry, did account for K for me three and and applied as the, sorry, applied the workflow of counting how many K for me three reads are there and applying a full change and a significant special to just gain to just get all K for me three promoters. So that will give us the same result essentially. And also, consider the computational cost. As Martin highlighted, ventals and genomic overlap is sort of the first step is sort of the first way to start exploring your data. It's also one of the fastest ways. But as we've highlighted, there are some caveats to that. Edge are defined. All of these are packages in our are isn't known for speed, and it can have trouble wrangling large data sets, if you have a lot of them. So you might have to come up with. Sorry, you may have to come up with workarounds such as doing the counts, the overload counts yourself, and then supplying that into our to perform her analysis. Another item I would like to highlight is called the when you're doing analysis for say a taxi and also surprisingly applicable for a chip seek is the idea of irreducibility discovery rate. So the So the idea here is asking the question of if I have two technical replicates or two biological replicates or even pseudo replicates. So I, for example, I subset my library down to 90% and compare how often am I able to call that same read, or sorry, how often am I able to call that same peak. So in the about in the diagram a and diagram B. The idea is that yours, regardless of regardless of depth in your two technical replicates biological replicates or pseudo replicates, your signal, your relative signal should always be the same, or be similar or along this diagonal. And so your, your, sorry, your let's say you sub sampled your peak, and then the signal jumped from originally here. Sorry. Yeah, it deviated off the diagonal of this which suggests that it is subjective to have a link and that means it's not highly reproducible. Similarly, if we were to look at it based on ranking. You would expect your one of your top piece to be consistently our, you would expect your peak to be ranked the top regardless of whether it was a whether your library is at the normal depth, or if it is 90% if it vastly changes. It can, if it vastly changes that might suggest that that peak isn't particularly reproducible. So this is an example of a good library that has a high reproducibility, a example of a poor of a, sorry, of a example with low reproducibility is that, as you can see, after you sub sample your peak, the signal vastly changes and isn't doesn't scale accordingly. And even in terms of rank, if your peak, for example, was ranked with waypoint but then it after you sub sample it falls to the wayside all the way to the bottom, it might suggest that I get that this peak isn't particularly reproducible. And I would like to highlight this particular set of software to use for a tax seek is because, again, your attack seek doesn't have background. So applying this IDR threshold could be another way to get to have a more high confidence set of Pete. So, a few more items I want to highlight deep tools is a awesome tool with a lot of feet, rich features. Like Max two, it can generate big waves, where you can then use those big waves to produce figures. One figure I would like to highlight, you might see very often is the meta plot, and this heat map, the idea of this heat map is each row would be your set in of your set of peaks. And the heat here would just indicate the normalized coverage at those peaks. So TSS and TES would just be the start and end of your peak. And then you can see that it extends a little outwards from those regions. Using this, you can demonstrate the peaks that have low strength, or sorry, low signal, the peaks that have high signal. You can use this to compare across various marks. So the idea that would be, let's look at this one, for example, you can see that this one would be. If you were to look at this particular replicate, for example, this replicate would be showing no. Sorry, would be showing no signal at these peaks. Well, the red replicate would be showing a lot of signal at these high peaks. So this particular, oh, sorry, I also forgot to mention this is the meta plot. This is just a summary of all the information here just to denote the trend. This particular analysis is also useful for doing QC. For example, if you're, if you supplied instead of peaks you supplied region, promoter regions at say the TSS, we typically expect to see enrichment at the TSS for K4 ME3. If that deviates that might suggest either a reflection of biology, or something went wrong such as a sample mix up, for example, and etc etc. Deep tools is also good because you could use that to do a correlation plot. This will give you a denogram and then you can see the correlation between. Sorry, so this is a heat map you would see you would see the correlation across between all your marks. So this is useful in when comparing a set of narrow marks, they tend to be more closely associated. Broad marks tend to be off by themselves. So this is good for doing again QC to see expected relationships. I do want to warn that if you are incorporating multiple sample types. The biological difference differences tend to overwhelm the sorry, the biological differences can tend to overwhelm the histo marks in certain cases. Additionally, the assuming that you understand chip seek, and you know what you understand chip seek you know the steps you know what analysis to do. But let's just say you don't have the time to come up with all the scripts, all the scripting and coding yourself. So various consortia have already designed a chip seek. So you can analyze them in the same way that these consortia have the plus benefit of doing that is you can then use the outputs and compare it to the outputs that the consortia has generated. And then they will both be general, they would be both be analyzed in a uniform manner. I'm just highlighting the chip seek pipeline developed by encode that Martin and I believe Guillaume is part of the taxi pipeline. You should see a lot of familiar elements such as you have the reads you align your reads you. You mark duplicates you sort you take your alignments you send it into Max two, and then you derive your piece. Another set of pipelines I would like to highlight if encode wasn't your flavor and of course also has a set of pipelines. Each for a chip seek and another one for a taxi. Again, you would see a lot of this familiar elements fast QC to just take a look at your files, BWA to perform alignments, but card to mark dup filtering and somewhere along here is Max two to call your enrichment. Okay, so in conclusion, consider the types in kinds of questions that you can ask when you are comparing comparing your peaks across different kinds of experiments. One of the first analyses you can do is just use bed tools to compare overlaps and find to find overlaps and unique peaks. The general methodology for utilizing amplitude when comparing your peaks is to have a unified peak set. Get contextualize that with recounts have a signal data frame and then perform some sort of statistical analysis to get your differential peaks. We also briefly talked about ways we can handle triplicates. For example, you can find your differential peaks, you can use bed tools for the genomic intersect, you can feed your signal data frame to edge are or define for example. And I hope I also was able to highlight some of the considerations and complexities that you may encounter when you are working with peaks. So I think this is a good part where I can ask any questions.