 In this part of the module, we would like to learn how to place your short reads on the reference genome. Well, first of all, what is reference genome? How to place them? We also learn standards for the files, how to store those short reads with alignments. So information about the read and information where this read is aligned. And we also discuss typical goals of the ChIP-seq experiment, as well as the first steps at visualization and analysis. But please see it as not analysis, but QC and analysis. And that's very important because the fun part comes probably after what I'm talking. But what I will be talking is absolutely necessary in every experiment because you want to be sure that what you're dealing after and do different interesting discoveries with the data you trust and understand. That's the first. And I have a couple of disclaimers, maybe. First of all, in order to do it, you probably will need months and maybe many months to learn the whole process. That's from experience with Martin Epigenetic Club and the students who are working there. It takes some time because it's many, many, many steps. And they're a little bit task dependent. But it's not that it will be a black box or set of rules which you always constantly follow and get a good result. No, you have to vary it from this. That's the first. And second, we will be using some of the tools which are available. Most of them are just developed by our fellow bioinformaticians. They are free. They have good qualities, but they're changing. They're evolving together with evolving technology. And if you used, for example, some tools six months ago, then you didn't use it, then you come back and try to use the new version of some tools. Something may break. So always remember to understand the current tool you are using and check that you're able to use it. All right. So the first we are going to talk about is, and we see it on both of them. Yes. It's a short read alignment. And I specifically draw it in this way. So we have a black box here, the gray box, which is called a liner. And there are two in components. One is your FastQ file with short reads and FastQ files for every read we have four lines. So the first line is header, sequence, some spacer line, and the quality. And then we have a reference genome on the bottom. And this is just a sequence. And at the end we have an alignment, so-called alignment file. So what is the reference genome? Reference genome, as I just said, is a sequence of ACGT. Sometimes if you take the human reference genome, AG19, it's not the last version, but it's the most popular version for analysis. And you will just try to search for any letters, but ACGT. You will find some. You actually find three letters, which are not. By some reason, authors of this build put some, it's redundant, not redundant there. So there is Y letter, which represents several nucleotides. So in the new build, we will have more of those. So the genome actually will not look traditionally as a sequence of four letters. So just some stats. Human genome, current human genome, as you know, is about three gigabytes. Mouse genome is a little bit shorter. As an example, the plant genome, some plant genomes are short, like Arabadopsis. But some plant genome are very, very long. And of course, all these one has to keep in mind when one does alignment, because the larger genome you're using, the longer your alignment will take, will take. If you use a very short alignment, or short alignment, like the elegans or yeast, your alignment will be blazingly fast, using exactly the same tools, exactly the same computers. So where your access reference genome is better to use not the genome from your friend or from somebody, unless you're doing some custom work, but downloaded from reliable sources. And one of them is NCBI, of course, or you can download from UCSC on the browser. You have this download button, and all species are there. We will end the tutorial. We will try to check this. So now we have a genome. We have our FASTQ file. And we would like to do alignment. But many alignments, and of course, alignments were known before next generation sequencing data appeared. Alignments were developed before. But the whole area was open for the people, computer scientists, who would like to develop FAST aligners for short reads. Typically, they're using all kind of heuristics. So it's not exact match. And we are not, actually, we don't want that our alignment is exact match. We want to allow for reads to have some wobble. Because as we learned in the morning, our sequence reads have certain sequencing errors. And we don't want to be very rigid. We would like to allow for some wobble. There are two kinds, two major kinds of aligners. One are based on hashing. And the second class, which actually we will be focusing is based on Borel's Wheeler transformation, compression. And here I listed names of those alignments. I just wanted to mention that also the goal of this model is to give you some information and introduce you to certain lingo and to some names, which you will remember maybe when later you will meet in your work and you know what does it mean. So we're not going to many, many details. But so based on hashing aligners, what they're doing, they basically hash Genome. So if, for example, your sequences are 30 bases, your short reads. So it chocked Genome into those 30 bases, the whole Genome, step by one and create a huge table. So the key of this table is the sequence and the value is a location. And then it's very easy. You can just query this table. You look up this table. You look at your sequence at your FASTQ file, you take your sequence, you have a huge table. You look up where position is. That's a very, of course, simple way of showing it because as I mentioned, we allow mismatches and those aligners take care of bigger table, not only table based on the genomic cell but allowing some mismatches in every genomic location. So the second aligners are faster, they're most more efficient because they're using compression and that's the trick. And your search space is smaller and that's why they're faster, while the first type of alignment aligners are more sensitive. So that's a trade-off, speed versus sensitivity because we don't want to have very, very precise aligner which you'll find absolutely optimum solution to place our FASTQ file onto the Genome but it will take months and months. So it's always a trade-off. Just also to mention that aligners and the way we align increase will work for genomic DNA like whole genome shotgun data or chip-seq. If you would like to align RNA-seq or whole genome bisulfide data which will be discussed tomorrow, you need to use different aligners or modifications based on the aligners. So here from the literature, relatively recent, a table which compares different aligners and in red aligners based on transformation and in green is hash-based aligners and you can appreciate that the speed up of using this Burr-Wheeler transformation is about 10 times. So now that's one thing is speed. On the next slide, from the same publication, we have sensitivity and specificity of different aligners. So what it shows here, percent of reads aligned correctly, percent of reads aligned and percent of reads aligned properly. So you can see that BWA in a sense hits a sweet spot. It has relatively high percentage of all reads aligned and relatively high percentage of all reads aligned properly. There are some other aligners which can have higher percentage, for example, mark. It's a previous, like from the same author, previous version of aligner. It aligns more reads, but the proper aligner is much more poor. So it's all to show you, we are going to use BWA, it's very popular aligner, and to show you that it's not a bad aligner if you're considering this properties, this characteristic of it. When we choose aligner, of course, we can use default parameters, but it's good at least to try it and we will try in the lab part to see what parameters are available for us and what parameters we are choosing. As Martin already mentioned, normally what is happening when we have a long read, let's say 100 bases, it's very, very expensive, would be very expensive even using transformation based aligners to use the whole 100 bases and to try to place it on Genome. Because as a rule of thumb, you can say that if you have a stretch of DNA with 20 bases and use some simple aligner method like blood, you can place arbitrary sequence of 20 mirrors somewhere on the human genome with probably some mismatches and some gaps and so on, but you can place it. But as soon as you start to have 25 mirrors, 30 mirrors, if it's placed on human genome, then means it originates from human genome. It's very hard at random to place 32 mirrors on human genome. That's not for nothing, the seed for the aligner for BWA, the default is 32 bases. So basically 32 bases we take from the beginning of our read when it's 75 or 100 bases or 50, we take the 32 bases and try to place it on the genome. And the rest of the sequence of course is used, but it used as a local improvement. So we basically look up if the stale of the read, we align the seed and sometimes the seed aligned in multiple locations and then we use the rest of the read to place it better. So that's what happens. So this is a publication, the reference for you if you would like to check it out BWA and as I mentioned, it has a good balance between accuracy and speed. So now imagine we run our aligner and we are going to do it. We're kind of going backwards because in the lab we are going to do this and in the lab we will be following exactly the steps I'm describing now. So what we can have, we look at the result of our alignments and we can have reads which are uniquely aligned. That's gold, that's what we want. That's the first class. What else can we have? Obviously we can have reads which are not aligned. So we tried, no way, they're not aligned. Those are unfortunately wasted and there are many sources of what could be those reads. Do you have an idea what could be those reads? Contamination is one thing. Second, of course, if due to sequencing error, the reads scrambled so much that it can't really find a place. That's more rare, but contamination is one thing. What else could be contamination that read comes from different species like bacteria or for example, it has adapter sequence in it. Long, like I said, 10-mere. So it's too expensive. At the very beginning of the read we have 10-mere which doesn't belong to human genome and it was not properly trained. We can't align it. So what next class of the reads? It's properly paired reads. And maybe here I'd like to use a few minutes. So we already heard from Martin that there are two flavors of experiment. One is single-land and one is paired. And I will be intermingling between two cases. So both single-land and paired-end. Of course when we have paired-end, we have more information because we have two reads. They're independently aligned, but they also have relationship between them. So when we have paired-end experiment, then we can have a pair which is properly paired. So both reads are uniquely aligned and they're uniquely aligned within expected distance from each other. Let's say 500 bases. So if reads aligned far away, you will be suspicious whether they are properly paired or something happened. Sometimes it's interesting actually. But if we have paired-end experiment, the second class uniquely aligned and properly paired. And here I show you an example of real results. So when we align paired-end data, we can reconstruct fragment length distribution. So what is shown here is on the x-axis is DNA fragment length as reconstructed as a distance between start of the first read and the end of the second read. The whole fragment. And the height of course is just frequency. And you can see that's a typical shape. It's not Gaussian. We have maximum somewhere around 150 to 100. That's a typical values. And then we have a long tail. So some fragments are long, but we have less probability to have long fragments. And the first plot comes from data which was syndicated. And Martin mentioned that there is another kind of library construction when we use native protocol, which is M&A's digested data. And on the bottom, we have the small insert which shows us the fragment distribution for the M&A's digested data. And here is very nice because we immediately see biology because the peak is very sharp here. And the maximum of this peak is located exactly in 146. So that's a nucleosome periodicity. So that's typical, typical distributions. And we would like that properly paired reads are live within this range. So then, of course, we have properly paired. What else? We can have non-properly paired. That's another class. And non-properly paired can be of different kind when one read in the pair is aligned. That's all right, but another is not. Just not aligned. Probably because of sequencing care or some other reasons. However, we can have both of the reads perfectly aligned and uniquely aligned. But what can happen to those reads? That just may be a question to you. So imagine one read is aligned on chromosome 1 and second read is aligned on chromosome 17. And we see this. They both align. Everything is OK. So this can be alignment error because we have a symphony between different genomic regions. And with a couple of mismatches, we can place better. Remember that alignments are heuristic, right? They kind of try to place read better. But if there is one mismatch in one location and two mismatches is another location, it will still report us one mismatch, the best, absolutely best alignment. But maybe the true location is these two mismatches. We don't know this. Maybe the data, especially if we work with data like mouse and it is very polymorphic data, so it can be all kind of things. However, placement of reads on different chromosomes can be also reflection of translocations. If we have data coming from transformed genome, it can be true situations that we have pair and reads spanning a region of the genome where translocation happens. Our reference doesn't have this. But when the real sample has it, and when we place our reads into the reference, reads end up in completely different places of the genome. So this is an interesting. For some of you, it can be interesting because it can be very relevant to the research project. Or if two reads are improperly having proper orientation. So typically, if it's pair end, they should face each other. We sequence one from 5 prime to 3 prime and another one on opposite strand from 5 prime to 3 prime. But if they have opposite orientation, it can be sign of inversion. So just to mention this. So then, which we already discussed a lot in the morning, reads can have be duplicated or multiplicated. By the way, when we say duplicated reads, of course, they can be multiplicated. We can have two copies of the same read or we can have multiple copies. Now, of course, if we work with single end data, even looking at the fast queue file, we can say which reads are duplicated. Just look for the same sequence. Because if read is positioned in exactly the same place in the genomic, it should have the same sequence. And actually, for duplicated read, we require that the sequence is exactly the same. Because if sequence is different, it could be artifact of amplification. But we run, of course, special tools after alignment tools to detect duplicated reads. Now, if we have pair end experiment, when we call duplicated reads, still the language is duplicated reads, but we're actually talking about duplicated fragments. So we call fragment duplicated. If both reads in the fragment, read one and read two have been aligned to exactly the same position of the genome and have exactly the same sequence. So then fragment called duplicated. Of course, if you think about it, the chances, now let's talk about questions we have before. Should we keep those duplicated reads or not? And I actually think I somewhere I have, yeah, this slide. Maybe, maybe, let's look at this slide. So now, what is our expectation? And I know that some people actually look at this question in depth, what is expectation for a read to be duplicated? Now, we have, say, 50-year read single end experiment, 50 bases long read. Now, what coverage in the position, in some cases of the genome we should have in order to have at least one read to be duplicated? It's 50. If we have coverage more than 50 and our read length is 50, then there is no way we can get more than 50 without having two reads being exactly aligned to exactly the same location. It's clear, isn't it? We can't have this tower of higher than 50, let's say 51. If we have 51, then in this particular region, we should have more than one, at least one read to have exactly the same copy. So that's how you can estimate what to do with duplicated reads. If you are working with typical cheap seed experiment, the true coverage shouldn't be from the data we normally work at the moment, shouldn't be higher than 100 to 100. If it's single end read, maybe yes, then you would like to choose your duplicated reads. But also try to do this with some care because of course, when you accept duplicated reads, then you will have some locations which are spurious locations. You are getting just because you stuck on each other those duplicated reads. Now, if you want to study this, what would be the way to study those duplicated reads? Just select just duplicated reads and try to see where they are aligned. Maybe there's some genomic locations where they are aligned. Maybe this duplication is due to some sequence complexity or something. So just give you, yes. What's the duplication actually because of the experiment group? So duplicated, your question is, what is the origin, why we get this? So for the technologies, illuminate technologies we are talking at the moment, they're due to PCR amplifications. So it just, you have the same molecule sequence twice. How can you distinguish between the one that are coming from PCR and the one that actually are increased copy numbers? You cannot, of course. You don't know how to distinguish. The question is whether it's actually PCR or you expect. And somewhere in the previous slide, for example, if you're looking not cheap sick but exome sequencing with coverage tens of thousands. So you have small region of the genome which is covered deeply, deeply, deeply. Of course, there you expect a lot of duplicated reads and fragments just by nature. So you have no room to place them differently, right? To get this coverage. But here, what typically you do, you just say, I can't distinguish these two. I don't believe the chances that they're true biological different that actually sequence two different caucus of DNA are low, high chance that they're coming from the same molecule, I collapse them. I only consider it once. And remember, like just what I said, it's not necessarily duplicated. So some reads are multiplicated. And you can find, if you look carefully in your data, you can find that there are some reads multiplicated thousand times by some reason. Maybe there's very simple sequence and PCR pick it up and just create lots of copies of it. So then the last one, but not least. So we talk about all these possibilities after we aligned what kind of reads we have. And the last is multi-map tweets. So those are reads which are interesting because we don't know. Aligner doesn't know where to place it. Aligner tells us that it can be placed in chromosome one. If it's exactly the same quality, it can be placed in chromosome two or chromosome 15. And we don't know what to do. What are the origin of those those reads? Say it again. Yes, because we know that mammalian genomes are highly, highly rebellious, like 40% of human and mouse genomes are actually repeats. We call them repeat sequences not identical, but it's close enough for short reads to be placed in multiple locations. And those are also problematic because if we don't know where to place it, we don't know how to work with those. Aligner, BWA chosen, we can say maybe not the best, but the strategy also chosen, it's like arbitrary assigns. It out of like 10 locations that it can be aligned, BWA arbitrary assigns one location. However, not default, not by default, we can ask BWA to report all locations. The read is aligned. So we can, if somebody is interested and if somebody studies repetitive elements, then that's possibility. So now this whole story about aligning in multiple locations is a little bit of a headache of the analysis of all next generation sequencing experiments. We call, we talk about notion of mapability. What is the fraction of the genome which is uniquely mapable, uniquely accessible to us through the next generator sequencing and what fraction is not. Obviously the shorter reads are, the smaller genomic mapability is, like if we have 36 mirror reads, we started actually with 27 mirrors, but 36 mirrors, only approximately 80% of the genome is uniquely mapable. Different chromosomes have different, by the way that's also important, mapability, there is a global value for it, but it's also location specific. There are some locations which are very unmapable, there are some which are very mapable. The longer reads we are, we increase our mapability. So here is an example as you said, like example of short reads, reference genome and in orange we have repeats. So these two red reads can be aligned in this location or in this locations, while green reads are aligned uniquely. So now we have single end data and we have this situation like that, we have quite a few reads which are reported as multi-mapped. Now we have pair end information. So actually if we have pair end experiment and these pairs are related, situation is different. Like let me come back. So these two reads are not aligned uniquely, multi-mapped and these two reads are not aligned uniquely. Now we use pair end information. We can't help to this fragment, the entire fragment is mapped into the repetitive element, we don't know what to do. However, these two reads are rescued because the pair made of this red reads, the green read is aligned uniquely and we can infer what is a true location because we can use the fragment length. We can say that these two reads, one green and one red has to be within say, 500 basis from each other and we can find the proper placement of those three red reads. Otherwise there are multiple locations on the genome. Is it a little bit clear? Maybe it's time to stop for a sec and if you have questions to talk about this. Yes. In the table that you showed us for the different alias, it seemed like the both sides who had more time and higher sensitivity. So why did we do that? It's no, no, it's just like it's for us, for GC, I'm working in hardcore GC, it's more historical. BWA was developed earlier when the other liner was BOTI, as the running time was much higher and also I think the air rate was, it was more air prone, BWA was more stable. Then you learn some aligner, you kind of don't want to jump from aligner to aligner and in particular if you analyze, if in your analysis use multiple sets of the data, you would like to use the same aligner. For example, when we use our data and we would like to compare with some published data, we realign it with BWA and actually using exactly the same parameters. Yes, BOTI2, I use BOTI2, it's a good aligner now and many people using it, it's kind of 50-50. So it's actually faster, a little bit faster, doesn't it? Yes. I think that you've all mentioned like we need to be aligned with an aligner and we get all of those metrics. Yes. Is there a possibility to get them to get all of these statistics after? We're coming to this, yes, yes. There is a very good tool which gives us all statistics about our aligners, our aligners. You mentioned that practice says use the same aligner for all data sets. I'm wondering if you ever take the final line and see courses after doing BWA and then if those are hash based on aligners, you get the most out of your data? Yes, not hash based, but I, for example, sometimes I took an aligner read and use blood. You know, the blood is a heuristic developed by Jim Kent and just try to place it. Yes. It's sometimes you have luck and you can find, you can, for example, you have an indel. So all these aligners, as soon as the complexity of modification as it grows, for example, you have five basis insertion, aligners start having more difficulties. And if this happens in your seed, in first 32 basis, then it's even harder. So we are coming to this, but the default parameters you're allowing only for two mismatches in your seed. And no, no insertion deletions. Your all insertion deletions are in the remaining part of the read. For sure you're wasting, but like basically that's your QC, right? You run your alignment and if you see that 92% of your reads are uniquely aligned, then you're happy. You probably would sacrifice this 8% of your reads. Some of them are legitimately, don't belong to your experiment, but some of them may be just, yeah, aligner error, but you can't, you have to, yeah, it's just a trade-off of speed and, but it's a very good strategy. I don't think we are using this routinely in a pipeline, but it could be a good strategy. If your date is very valuable, you can take an aligner read and just try to use some other aligner. From my experience, you don't risk too much, but one can try. Okay, so we were talking about this multi-map reads and so they're coming from repeats. We don't really know how to work with them and unfortunately those reads are perfectly aligned, but we just put them aside in most of the time unless you have a special question in hand. So here is, we already saw, Martin already showed you some of the screenshots from UCSC browser and here you can see, just from the UCSC browser, it's a custom trucks available for all of us where people pre-compute mapability. So what does it show to us? Where we have one, value of this track one, it means that every, and for example, here is 50 bases, so we take 50 mirrors, we take 50-50 mirrors which are covering this particular position and if all of them are uniquely aligned, then the value of mapability there is one. When 40 out of 50 uniquely aligned, then we have 80% of mapability, et cetera. So when you see those huge gaps, this is completely unmappable and as you can see, it's unmappable for 50 base pair long, for 75 and for 100. So even for 100 base pair long reads, we can't map uniquely this part and fair enough, down at the bottom here, we have repeat-masker track of UCSC which shows us repeats and there is a huge, and I think it's line repeat here and line repeats are. Yeah, so that gives you an idea that you can actually assess so for example, if you done your experiment, you have your reads aligned and in some region of interest near your beloved genes, you don't have any reads. You probably would like also to check whether there are no repeats around. Maybe there is no, it's not that there is no data around your genes but you can't align them. So, uniquely. This is, that's available for you from UCSC browser. You can just, yes. Yeah, that's a track already pre-computed. Yeah. Yes. What do you do with your genome? It's not on the browser? You can load it, actually. Maybe we can talk later about this. But you can load anything. You can fake, you can use human to displace the elegance or you can do whatever you like with this gene. So I more or less mentioned already that mobility affects our analysis. We sometimes, that's also something for you to maybe to keep in mind when you have, when you analyze data is completely with, with different lengths. Let's say one experiment 36 bases and another experiment 100 base per long. The only way to really, really unbiased, an unbiased way to analyze this data is to trim longer read to the shorter one, which is sounds like insane, crazy. You lose your data, but that's the only way to compare apple to apples. Because otherwise, you will see difference just due to different mapability in different locations, which you can't decipher. You can do all kinds of corrections, but it's very, very difficult and people can criticize you for that. So if you want to trim your data, you can say you want to try to send it back. How do you decide which, how do you decide how you turn it to apple? That's a good, well, I guess from the, from Smart Intox, you just look at your QC, you see as well my read probably starting from position five to position something has the best quality that would be your best, but. So you will be a little bit biased. You will be, yes, you need to, you need to decide. You know, like when people ask us, you didn't ask us the same question. Like typically, if you want to be really, really rigorous, you try several times. You tried the beginnings and position 10, position 20, you repeat your experiment and you see where data looks better. But you typically don't have time for that and you might not learn much, but this will be really, really honest answer on your question. Try different and see what happens. So now we're coming. So we've done alignment. We know the ology of reads we got and now let's talk how we store those alignments. What is the read, what is the file format to store, to store alignment results. So the current format is some sequence alignment. I actually forgot the abbreviation for this, but some file. And the every record of this file, it's an ASCII file. It contains information about one read and it's alignment. It has 11 standard fields plus some key rather arbitrary attributes which has key and value. But after 11, different aligners have different format. At the beginning of the some file, we have a header which tells us lots of information. It tells us about the reference genome and many other things. In the lab part we can look into this a little bit more. So the BAM file stands for binary. So BAM file is just a binary format of the some file. There is other flavors of compression of the some file. So the compression ratio roughly about five, a little bit more. The advantage of the BAM file, it can be indexed. So typically those files are huge, right? You have 50 million reads or 100 million reads. Now you have them arranged per chromosome. And for example, you would like to access chromosome 21. If you look through the some file, you go from the beginning and scroll through the file. But BAM file allow indexings and you can access directly the region of interest. So something to mention that good aligners when they, and BWA is good in this way, they don't manipulate with reads. So they store in the BAM file the read sequence and quality of the read in the original form except the strand. Sometimes it's reverse complementary read. So in principle, having a BAM file, we can reconstruct our original FASTQ file. However, the order will not be preserved. Of course, FASTQ comes from the sequence or a certain order. When we have a BAM file, we lost this order. But good BAM file contains all information of original sequence. But it's not all aligners. So there are some aligners with chopsy reads, chop sequences, dropsy, unaligned sequences, et cetera. Not all aligners. So beware that sometimes you can't really come back. If your only data comes from somebody and it's BAM file, it's not necessarily the whole experiment. Now let's talk a little bit about the header of this file. So the typical header will contain information about the genome we used and the format of the align would be like that. It will have this type which name sequence and that sequence name, in this case, it's one. And sometimes it will be CHR standing for chromosome one. Sometimes it's one. All depends what was the name of the chromosome in your genome. And we will see this in the next part. And then it also contains, it can contain all kind of information important to keep information about the aligner and parameters we used that we can reconstruct. Reconstruct it. If somebody wants to repeat our alignment, one can do it. Looking at the header of the file. Now those are 11 standard files, standard line, sorry, or standard fields in every record. Lots of information there. Lots of, you can learn a lot about every read. And those which has read our streak, we will discuss in details. No questions so far. Can I go for them? So the second field is the flag and that's a binary flag. So basically it comes to us when we look at the file. It comes, we see it as an integer and it can be 161, 83, something. But if you look at the bits, it's in the binary format, we have 11 different bits and each of those has some significance. It tells us whether this is parent alignment or single end, whether the read is aligned and it made is aligned or not. It tells us whether the read is number one or read is number two. It tells us whether the read failed quality, manufacturing quality controls, the scarcity filtering or it tells us whether the read duplicated or not. So this is a little bit convoluted. However, some tools, which we are going to discuss in a moment, they will allow us to filter reads based on the value of this flag. So for example, you would like only reads aligned on the negative strand and you can choose the filter with this flag and only pick up those reads. You can do many, many, many different things and it's very, very helpful. And here is an example. So capital minus, if we use some tools and later we'll discuss a little bit more about some tools. We talk a little bit more about some tools. Capital minus capital F, we will filter away reads with certain flag and it's minus capital F 516. We'll filter 512, two to the power nine plus two to the power two plus four. Two to the power two is unaligned reads. So we are not interested in unaligned reads and in filter out those reads which have just the quality failed. So when we add this flag and scroll through the file, we have a subset of reads, only those which satisfy those criterias. And small letter F will actually select on reads with certain flag. So here only reads aligned on the positive strand or only read one in among all reads. So the fourth column come back. So there's like one, two, three, then there's chromosome, read start. So the fourth column is read start and basically tells us the read start and it's only important, quite important thing and this was a source of many errors in the early days to remember, so when the read is aligned, when go to this picture, when the read is aligned on the positive strand, there's no problem. Here, this position, I'm circling it, is reported as a start of the read and we can easily, for example, if it's single end experiment and we would like to extend with certain average fragment length, 150, 175 paces, it's clear how to do it, we're just extending. If read is aligned on negative strand, the position is reported as here. It's actually not the end of the read, but the start of the read. It's not, in other words, it's not five prime end of the read. So people were taking this position and even if they're extending using the strand, strand information, if the read is aligned on negative strand, you have to extend kind of backward in three prime direction, but you have to use this location. So in other words, you have to add the read length to the start of the read. This was an error which was repeated multiple times. One has to always remember that the strand on which read is aligned is also important. So now the fifth column is mapping quality and you already heard about the thread score, the mapping quality is the thread score as well. Again, some tools have a parameter minus a lowercase q and here you can indicate mapping quality and only reads above this mapping quality will be accepted. So mapping quality is minus log, minus 10 log of probability of false alignment and when probability is one, it means that read is aligned in multiple locations. So actually we don't know how to place it. Then mapping quality is zero and then the higher mapping quality is, the smaller probability of false alignment is and the further away from multi-mapping VR. So how are we doing everything is okay? Yes? For the reading question, how do you decide when is there like some kind of an option on the BWA that's working? So the read extension is it's post-processing. So BWA wouldn't know, BWA just aligned read as it is and reports as the start of the read alignment. Now, when we would like, and as you see in few minutes, when we would like to create the coverage, we're not, when we look at Chipsic experiment, we're not only interested in the read because the fragment is actually, which fragment which originates from the immunoprecipitation is much longer than the read or often is much longer. So you would like when you create coverage to use the whole fragment. And you would like to, one of the way, one of the people do different things, but one of the way is to extend your read and look at the coverage of extended read. And this will naturally also occurs when you have parent, right? When you have parent, you have the fragment already given to you and your profiles of fragments as they are. But when you have single-end read and there are lots of data publicly available is single-end. So we already know what is advantage of parent. Maybe let me take a moment here. And so there are two advantages, two obvious advantages. So one of them is that we know precisely where the interaction between DNA and, where modification happens. So what fragment we pulled out, right? When we know both reads, we know that the antibody was interacting somewhere between two reads. When we have a single-end, we have to infer it. We take an average fragment length, extend it, and somewhere between the read start and the end of this extension interaction happened. But it's less precise. And of course, the second advantage is repetitive elements when replacing is much better. So now how to decide? So typically some tools can analyze your, if it's bare-end, sorry, if it's single-end experiment, they can look at your data and looking at reads on opposite strand, they can infer average fragment length. And then you know. Or even like finder, in finder we can infer again. It's not never precise. The distribution. Alternative, you have your wet lab traces, Martin showed Agilent traces. You can look at them and see the mode of the distribution. So well, it's roughly 160. I should say that your final results, if you are in a reasonable, if you choose your extension reasonably, your final results will not depend very strongly on this. But of course, if you, instead of 150, you will choose your extension 1000 and 500 basis, your results will be very, very different. So, yeah. So mapping quality. And this is very confusing part of the sum format, but it's very useful for part. It's so-called, it's field number six. It's cigar string. Cigar string essentially tells us how the read was scrambled during alignment. What was done to the read in order to actually place it in this location. It tells us how many bases were mapped without mismatches or actually, sorry, mapped. And it could be these mismatches or without mismatches, but as we said, we only allowed two mismatches. Then it tells us whether we have insertion or deletion. It tells us whether part of the read was soft, so-called soft clipped. So soft clipped means that sequence is there, but aligner by some reason decided not to use it in the alignment, in the alignment. It's actually start alignment somewhere in the middle of the read. And some aligners doing some brutal thing to the reads, they actually hard clipped reads. So they take their reads from the fast queue and they say, well, I don't like this first base, the 10 bases, chop it away and just forget about it. But then in cigar and only in cigar, we can learn that this was done to our read. So all of this is rather important when you do variant, variant calling, because details in the cigar and how the read was placed is very important for the variant calling. And I'm thinking maybe this is a little bit, maybe let's skip it for now. And maybe when we have in the love time, we can come back to the cigar, because it's very, very technical, very dry. I don't think I can teach you within 10 minutes about how to decipher cigar string. Just remember there is a cigar string, magic cigar string, and some tools help to decipher it, but you still need to understand all the details of the cigar. I have the different creases there. What does this mean? So cigar can look like this, it's really ugly. Looks really ugly. The most simple way is this, like seriously example, if we have 100 base pair reads, then the nicest cigar is 100M. So we have 100 matches, nothing else. However, in many, many cases, we have situations like that, three soft clips, 65 matches, six deletion, and that's deletion in the reference. Remember, so it's insertion actually in our read. Deletion in the reference. Then we have another three matches, and then after this we have two insertions in the reference, means deletion in the read, and then we have 53 matches. And if we add all those bits and pieces, we end up with 100. But that's how it looks, but it gives us a lot of information about the element. Let's maybe let's skip it. I feel it would be nice to skip it. So now we're coming to some tools, and some tools is a C program, which used to manipulate with BAM and some files. It has lots of options. If you, in the lab, we will do it. If you just type some tools, you will see that it gives you help. Most of those bioinformatics tools, by the way, if you completely stuck, you don't know what to do, just run it with no parameters. It gives you some help. And so there's different categories of tools, indexing, indexing, editing, file operation, viewing, and statistics. And I would like to talk about Flux Start. And we had already questioned how to get statistics. And if we just type some tools, Flux Start, and then we give our BAM file, give us interesting statistics details, and we will do it during the lab. Now, viewing of some tools, it's some tools view, and we give BAM file, which we would like to view. BAM file, if you just do less or any other command, Unix command is a BAM file. We will see gibberish. We can't really see it, but with some tools view, we can see whatever we like. And here is there is a magic. When our file is indexed, when we run command, some tools index BAM file, then in the same folder where our BAM file is, we create an index file and we will do it. Then we can view the certain region of this BAM file. When our command is some tools view, BAM file, grammasome, column, start, and we only pick up reads belonging to this region. And if you run your experiment and you would like quickly, you know certain target and you would like quickly to assess whether you have enrichment. In five minutes, you can answer your question. You run this command. You see that your enrichment in this particular region, you get, say, 1,000 reads. You have the same region, the same size, but randomly on the genome, run it five times. And randomly on the genome, you have five reads, but in this particular 1,000, you absolutely sure that at least this part is showing enrichment. So it's a good thing to know. Next tools which we are going to use and good to know also is Picard tools. Those are Java written by Rod. And again, when you look in the Picard, there are many, many, many different tools you can use for manipulation with some in BAM files. Very important for us is two. One is mark duplicates when we have pair and reads, it's very difficult to say which pair is duplicated because how would we know, right? Because two reads have to be identical. So sometimes read one is identical, excuse me, but read two can have different location. So Picard will help us with this, this is one. And second tool in Picards is merge some files. Very often our experiment date is spread in multiple lanes or multiple chunks. It's even advisable sometimes spread your data to avoid certain biases. And then you want to merge them and this merge some files will do it for us. So just for you to know that those are very simple tools, but for you to know that they exist and sometimes useful to use. So I came to the visualization part and maybe you have many questions at this moment. Yes. So when you're merging your, when you ask your fast queue from different names, is it better to merge them to merge to fast queue or to merge to file? I think it's better to use fast queue as they are and merge why? Your fast queue is coming from slightly different experiments from slightly different experimental conditions, maybe from different sequencing machine. So you would like to align and then to have your bomb file to maybe run stats on them and to see this fast queue has 90% of alignment and this one is only 30. Then you would know that maybe this whole flow cell had some problems. So it's better to, you always can merge afterwards. So that's one thing. And also you actually, when we do alignment, we artificially chop our fast queue in the smaller chunks and then put on a cluster to align with multiple, multiple CPUs just to do it faster. And then we merge aligners, alignment results. So now let's talk about visualization. There are lots of tools created for us to visualize data. We can visualize bomb files. We can visualize product of bomb files. We can visualize files which post-process bomb files. We can use for that. So why visualization is important because it's very important. Unfortunately, what we are doing and I'm using this term very vaguely, but when we are looking at chip seed data and a lot of other next generation sequencing data, it's deep learning. Like human interaction is very, very important. We know very little about what to expect. We don't have Siri behind our data. Therefore, we would like, so that's actually everybody's raw eager to as soon as data is out, everybody wants to look and to see how it looks. And you can visualize your data in IGV tool. It's a Java tool or Genome Browser, WashU Genome Browser or UCSC, which is web-based and very popular. IGV, it's Java as I said. You can have a web start of Java. I will be using UCSC later because it's more accessible for us. You just type genome.ucsc.edu and you can load your data and do whatever you like. Now, you already saw those type of pictures. This is like coverage trucks. And that's what we do. This, like from the data we are going to work with and we are working, this is H1, histone modifications that talk six trucks for the six marks we discussed in the morning in Martin's presentation and the black one is input DNA. So just to repeat, input DNA is essentially in the modern experiments, people try to do the following. They do all steps of the library construction until immunoprecipitation. Then they split the reagent into two tubes. One goes through the IP and another is not. And so they're identical data, but one is not IP. And then you denature, you remove proteins and then sequence this input. Now, the belief was that this is the best background. It's good background to compare the data to what will be, but of course, as Martin already said, there's a lot of biology just in input. This is one thing. And second, of course, when we do precipitation, our random reads, there's a lot of non-specific reads and they're not necessarily just the same reads as they would be in input. So even if our experiment completely failed, let's say we use antibody, which is non-specific, the results will be very different from, well, not very different, but different, especially in some locations from just input. That's from experience. But nevertheless, input is good to have. You would like to see rather uniform coverage. And if your data is deep enough, you see details of nucleosome positioning there and many interesting things. So the two other tracks, lower tracks, is actually from ENCODE, we prepared for you just transcription factor. So protein DNA experiment, Cheapsick. You can see nice peaks there. The green one is RNA-SIC, but that's beyond what we are talking at the moment. So now the question is, how do we arrive to those views? This is nice, right? We started with our FASTQ files. We run it through those little steps we just discussed. And then we have this view we can, so here in UCC browsers, there are genes we can zoom in different genes and see whether, for example, the red track is HDK4MA3, which is a mark which tells us about transcriptional activation of the gene. And when we have a, we can see a peak in the promoter or near TSS of certain genes. This gene is probably expressed. It's very interesting. How to arrive to those? The different file format to visualizing UCC, the simplest one is called BED file, BED format. And here there is a interesting decision which was done by UCC people, UCC folks, that this file which contains coordinate of the region. So every line in this file, looks like this, chromosome, some location and another location. And the first location should be smaller, strictly smaller than the second location. The first location is the start of the region, the second is the end of the region, but this is a zero based. So if our region actually is 400,0061 in our BED file, we put 600. 400,600, not 601. So this is something when always come up and people make a lot of mistakes here. One base you think sometimes it doesn't matter. It does. You will quickly learn that it matters. So remember that this BED file is zero based. So if our coordinate starts at 10 in the BED file, we enter it as nine. This is, it's hard to explain why, but there's probably some reasons. Not all their files, zero based, but this one is. So the next file is a weak file and it has a very tricky format. It has multiple flavors. I will talk about the simplest flavor and it is important when we have coverage profile when varies very rapidly. When we have, we kind of don't have continuous coverage with the same value. So this has a fixed step line. It has a line where the coverage starts. So here's chromosome three start. Just to give you an idea how it looks, we have a tool which will generate this file for us. The last format I would like to mention is BEDGRAPH format. Technically speaking, it's the format which is equivalent to the weak format and there is always trade-off. Sometimes it's more beneficial to use weak format, sometimes BEDGRAPH format. So BEDGRAPH is exactly like BED. It has chromosome start and, but it also has a value. It has actually coverage. So essentially I use whiteboard too complicated, maybe just with my hands. So essentially you have start and end and the value, the height of the tower. You would like to display. If you have a BEDGRAPH format, only start and end and it will be the same, just a box of the same height. But score, the score will distinguish different locations. Yes. So does that score just the count of these ones that overlap it? Anything you like, if you would like to have, yes, but we can normalize it. And it can be flow number as well. It can be anything. But of course the most prosimmonious is just coverage, as you said. If we take every fragment as one, we stack them on top of each other and in every position we can have just integer number. But it can be normalized. And sometimes we would like to normalize because for example, we compare two libraries of very different depths. They're rather naive, but pretty well working solution would be let's divide it by the total number of instance library. Then we kind of can compare two libraries to each other. So now to generate big file, we will be using bump to week tool, which we created and yeah, so but that's in the lab part. So now what else can we say? I just wanted to give you a few examples of what can we do with the data prior to calling enrichments. So of course the next we have our data. We look in the browser and we would like to call enrichment to have a set of locations, set of genes which have some special significant market. But we can do certain things even before enrichment. Just looking at the raw alignment data. And it is a QC and it is some of them, Martin showed you the plots which have those bars which tells us how many reads falling in the promoters of the gene. So after alignments, we can just have 20,000 coding genes. We take transcription start and transcription start site of every gene, a step maybe 1500 basis on either side have a 3KB region and let's check how many reads falling into all this 20,000 3KB regions and compare to like different, we can do different things. We can say what is a fraction of total number of reads it is or what would be the number of reads comparing to the same genome, random genome region. And if we see that for the mark we expect to be located at TSS, we have enrichment, we are happy without even doing very sophisticated work. And here I'm showing you an example, it's a cartoon example where what this plot shows I calculated the mean coverage near the TSS in the promoter of 20,000 coding genes and look at the distribution and it is a red curve. And the blue curve is the same when I take 100,000 random genomic location of the same size, right? And could create mean coverage, so. Early metal plots. Yeah, you can call them this way, if you, yes, why not? So let's explain, I apologize, we should put it on the slide. So this is our feature, we have genome and we have our coverage profile, goes like that and here we have genes. This gene transcribed this way. There is another gene transcribed this way. So what I was just suggesting is the following, we can take gene annotation system by going to ensemble or going to UCSC, take all locations of the TSS of the gene. This will be this point and this point and 20,000 other points in the genome. I don't know about coding genes. Take regions which are 3KD. Take regions everywhere and then could create mean coverage which will be integral here divided by the size of region but the size is the same. And for every gene we have a number, CI and this I goes from one to 20,000. So we have 20,000 numbers for every gene. Now we take the same region somewhere outside of the TSS of the gene, somewhere on the genome and do exactly the same randomly, calculate sometimes it's recommended more than 20,000 maybe 100,000 times and calculate the signal outside of the TSS. And that's what I thought there. So the blue curve is random curve and the red curve at the TSS. So in this particular case, and if it was at 3K43, definitely our experiment worked. Our signal at TSS, the red curve shifted on the right. So on average, we have much more reads at the promoter of the gene than anywhere else at the genome. So is it kind of something which you think, yeah. So now in another set of plots which you can do right away without calling any enrichments is to look at the, so instead of looking at the mean signal in the whole promoter, now let's look at the coverage in a different, along a different axis, let's put this way. Let's look how profile look like for all genes. So now if I come back to the same drawing here, we take all those profile in every location for every, for all 20,000 genes and stack them together, sound them together, right? We take one, another one, and some on top of each other because all of them is just a row of 3000 numbers. We have 3000 numbers, 3000 numbers, 3000 numbers, 3000 numbers. We stack them all together. I hope I'm not losing you, but if I am, just please let me know. So if we do this, and then we look at those four marks, we discussed with Martin-Martin mentioned to you, the first one is the HTK4ME3. Let's look at this plot. So the line here in the middle is the transcription start side. So that's exactly where the gene starts. So on the left is prime end and three prime. By the way, when I said stack those profiles on top of each other using the strand, if the gene is on the negative strand, we flip it, reverse it, right? We always want five prime be on the left and three prime on the left. And then we see this very interesting characteristic profile, and I'm pretty sure some of you are familiar. Some of you, many of you are probably familiar with those. And anecdotally, this type of plot, which I just calculated for the sake of this presentation, but say five years ago, it could be published in the journal. It was news. Now it's routine for us. We look at those and we kind of think that this is type of QC. If those profiles looks away there on the screen, then the data is okay. You can see that there is a gap at the TSS, which represents the missing nucleosome. And then when you see the increase of coverage inside the gene at the five prime UTR or first exon, and then it winds down in the body of the gene. On the right, there is a K36 ME3, another mark, which is elongation mark. This mark covers exon or gene body of actively transcribed genes. So at the transcription start site, it's still missing, but then it starts start growing, et cetera. So this type of profiles, we can learn from this type profile something. So now the black curve is input. This famous input we were talking about. You can see some interesting features. When the mark is highly enriched, like the red one, H3K4ME3, all plots have different scales. So here's 25, and we almost don't see input. Input is a flat line. When we have small enrichment, for example, let's look at the very last plot, H3K9ME3. We know that this mark is not present at the TSS of most of the genes, only present at TSS of a handful of development-implicated genes. We don't have this mark. The size of coverage there, very similar between input and the mark itself. And you can see that input has a certain peak at TSS. And this type of behavior is very experiment specific. I can't say that you always see peak. Sometimes you see a trove. In fact, from my experience, depends whether it's a single end data or parent data, you see different features. But it is related to the structure of chromatin. If chromatin is more open at the TSS of the gene, it's easy to share, and the fragment will start more readily in this location. So that's why we might expect. Yeah, and then the repressive mark, H3K27, that's the situation, it's not repressive, enhancer mark also has a very similar to H3K4M3, double hump picture, and most of them will have a trove at TSS just because they're non-nuclear something. All right, that's just a representation. That's just, oh, natural break. I was in some symposium few, like UBC a few weeks, a few months ago, and somebody showed this picture. And I found it's kind of interesting. It was epigenetics symposium. And I Googled it. So basically I typed epigenetics and then went to images right away. And fair enough, like image number six was this one. So basically what we will be talking now, this finding enrichment, et cetera, it's more like an art. And yes, so the advice here that if somebody asks you that you don't understand, say it's due to epigenetics. Epigenetics now is actually, yeah, so now we'll be talking at the very last part of all of this. I have, I think, 10 minutes to talk about enrichment and we will be doing this also in Zalab. So what is the main challenge? How to find enrichment when we're talking about the Chipsick experiment? Because probably for Chipsick experiment, we expect, our expectations are least obvious. So for other types of data, we have more expectations. For Chipsick, we have no theory, we don't know what to expect, that's one thing. We is data driven. The data tells us what we observe. Now we don't have any analytical or we, in many cases, we don't have true analytical knowledge about the distributions which are behind the data we see. We don't know what are distributions for the background. We don't know what are distribution for our Chipsick signal. Yes, Poisson, probably it's count data, probably Poisson. But when we look at the data and try to feed Poisson to the data, we see that sometimes it is Poisson. Sometimes it's not nearly Poisson. Although it's the same type of experiment, maybe done in different, slightly different cell types, or done in different time, different sequences. So that's a major challenge. We don't know how to discriminate between signal and background because we don't know distribution for neither signal for no background. Now all experiments have biases and have experimental artifacts. We also don't know about those. And sometimes we can't avoid them. We have rare cells or data which will be full of artifact. So that's the first thing which we can stumble across when we look at the alignment. And we look at the, if you browse our data long enough, we come across something like that. And you can tell me what is wrong about this. So that's exactly what we already saw. This is six marks. The black one is input, the pink and this blue is actually transcription factor binding experiments. What is special about this observation, what we see here? It's kind of bothering. Yes, they look very similar. You will be suspicious. Why activating mark and repressing mark look so similar? This is typical view of alignment artifact. You can see thousands of those when you on genome, entire human genome. There will be thousands of those locations where we will see something like that. It's a little bit experiment specific. It depends on the read length, fragment size, but we will see those. And we would like to get rid of those. And that's not always we have luxury to have all six marks. So sometimes we have one chip seek experiment and one input. And then what we are doing, we are looking that our chip seek experiments say blue and input look very similar. So then we are getting suspicious and probably blacklisting those locations. So alignment artifacts, what are the origin of those? Very likely the origin of those is just multiple copies of certain sequence in the sample genome, which are not present in the reference. So essentially in the sample, we have multiple copies of certain sequence. As a result, we have many reads coming from those multiple copies. When we align them back to the genome, they align to exactly the same location. So we grow our artificially. So they were all just non-specific reads, but there's so many locations that they come back to the same genomic location and present alignment artifact. So what we would like to do, sometimes people, if they have thousands, hundreds of thousands in each location, they say, well, 1,000, maybe I can live with it. But the problem is that when you compare several data sets, they will always come into your Venn diagram intersection. When you do, this is my set one, set two, intersect, and you have some thousands or 5,000, maybe common locations. Out of this 5,000 common, 1,000 will be the satisfaction. You don't want it. You would like to get rid of those. So that's result from like a finder, finder tool. And you can see here that the black is actually profile of your input experiment. It's chromosome one, here is centromere. So neocentromeric regions have lots of this artifact show alignments. This is a log scale. We took a log scale of coverage. So you can imagine this towers are tens of thousands in coverage, while average couch is 10, but you have location where you have huge towers which would, which appear to you as enriched, but they have to be discarded. Now what I would like to mention, just for your knowledge, we're not going into many details. Surprisingly, those artifacts, they're a little bit cell type specific. Most of them are kind of coming again and again in different experiment. They are recurrent, but there's certain locations which are specific. And you can see here, this colon cancer example, and we see it here, but with nothing in memory gland and interoid. Like this is almost the same data, but the blue is those locations which we used from encode. So encode published blacklisted regions. So regions which are those alignment artifact, which are recurrent, and they published them both for human and mouse. And you can see that truly enough, most of them, most of the red which we identified, just kind of de novo for this experiment, this particular data set, they coincide with blue regions which are blacklisted, but there is some subset which is not. So you have alternative using blacklisted regions or doing it de novo. And here's example for both mouse and human. So now a finder is basically tool which is which areas developing, we are actively developing and hoping to finalize. So what is the problem with those analysis? As Martin mentioned, this is a hot area for research. There's a big list, there's lots of papers which even comparing multiple tools, there are maybe 50 different tools. All deal with calling in region for chip seek experiments. They all have different flavors and different focuses. Some of them are funny. Some of them specifically designed for single end experiment. And they're using single end experiment, they're using advantages and quotes of single end experiment. But this is just because people try to do their best for the analysis. Now, most of these tools using a limited set of data to validate their tools. And as a result, sometimes you fall in a trap of overfeeding. You basically have number of three parameters and you adjust those three parameters in a way that your data is described best. Of course, when the new data comes, you may have a problem. And I'm having problems all the time with finder that it's very easy to get into this rabbit hole to try to fit your few parameters to the data. So then the idea we had now is to have as less parameters as possible and almost no parameters. Try to be trying to use data classification approach rather than any feed of the background model to Poisson or negative binomial which was used in many other tools. So we'll be short here. You have this in your slide. The tool which is probably the most popular. The tool which is fair to say forks is Max which was first published in 2008. It was a big, big evolution for this tool but we still think that this tool is was designed for the data, was not of the data kind we are looking now, not as deep as the data we are looking now. I probably leave it if you have question and we can talk about details of those. There's references and maybe I finish it talking about narrow versus broad regions. I think most of the tools, and you heard already this slogan about so some of them, some of the peaks are punctated or some of the regions are punctated or peaks and some of broad regions, broad domains. That's because of the nature of what we see actually in the browser or we see as enriched. Of course, when we talk about the basic enrichment it happens at nucleosomal level. So one nucleosome when it's histone modification it's either marked or not. Now, some of the marks like HTK4 and E3 they happen in several neighboring nucleosomes maybe 10 or maybe five, those are marked and then we have nothing. So then as a result they appear as a cluster relatively compact cluster. Some of the mark happens as regions where multiple nucleosomes which are juxtaposed to each other all marked and those regions can be up to 100 KB or even more. It's kind of effective spreading. If we look back into this plot, oops, so those are human embryonic stem cells and like for example, if you look into this track the purple one it is HTK4, HTK36 emissary. As we mentioned already those are elongation, those are marked on top of the actively transcribed genes and sometimes you have genes which are long and they will be covered or carpeted by this mark rather significantly. So killer basis of this mark. Now the red one is very punctated, this is HTK4 emissary. Now the brown one is HTK27 emissary. So you probably heard that those are extended mark, so broad mark. However, and in many in most differentiated cells that's what you will see. So this is rather broad mark domain. Often it has shape like that. It has peak which has a neighboring broad domain. It's very frequent, very frequent situation. But in embryonic stem cells, actually HTK27 emissary looks as a very compact mark. So maybe it's not, it's the data we chosen to go with but the look of the HTK27 emissary repressive mark is different from what you would normally see in other data, so I just wanted to mention this. Now also maybe one thing to mention also before my laptop will die but, and before we go to the break is that what we observe, let's roll back and actually ask what we are seeing here. We see different heights. So now our experiment were done on multiple cells in this particular case, it's probably tens of millions of cells. And we see in some locations coverage 50, in some coverage 100, so 100 DNA molecules were stuck to each other in another 200, in another 10. So what does it mean? It mean probably frequency of this mark in a certain cell. So just to mention this, that's not a single cell experiment. We are looking at the average of the bulk of mark and when the mark is a little bit lower, it means that it's less probable to have this mark here. So it's present in some cells, it's not present in others. When we have more, like modular alignment artifact, modular experimental artifact, right? So sometimes you have lower coverage just due to mobility which we discussed or other things. So with this I would like to, more or less I think, finish.