 So, we will be talking about cheap sync analysis, many things which I will be talking about are rather general, so they are applicable to other types of mediating circumstances. So, that's not outlawing, so learning objective is written here, maybe you saw it already, so we will discuss how we do, after a sequence, what we are doing to them, in order to arrive to some biology, not making the final, final biological analysis, but to some biology, how to identify those regions which are enriched in cheap sync experiments, we will try to also discuss what is actually enrichment, how we define it, right, because that's, actually your goal-specific, your question-specific, and it's not an easy, it sounds like very natural and easy to say what is enriched, but if you stop for a second, think what it is. So, that's an outline, so we will talk about alignment, general property of the reads, methods of control, how we generate coverage profile after the other reads, principle of the approaches to data analysis, and what I would like to suggest, so I will be skipping, so I have way too many slides, I wanted to have them for you, maybe as a written materials. If you have questions, please ask, and also I will reserve my right to ask you questions just to get a little bit of feedback, and I will be, as I said, I will be skipping some of these slides, and this may be almost redundant, because already Martin showed a lot on biology, what we are going to do, what we are studying with cheap sync data, and here just a few comments, so there are several flavors, of course, with cheap sync data. One is something which was restarted with this transcription path that you made by, right, when you have an antibody, go into certain protein, and then you do experiment when we have this antibody, and later, a little bit later, as it was antibodies to storm modifications, so we are studying this structure, which we believe is determinant of this computation of chromatism, and I really like those electronic microscope images where we actually see nuclear storms, and we see this 30 nanometers microtubules which is the next level of the computation of chromatism, which is also determinant by those histone tails, which Martin already mentioned, and declaration of those tails is, so, yes. So that's a summary of the process, and again, you have a great deal of overlapping data with this reader's talk, you already saw certain things. What I would like to mention here, maybe, while this step is precipitation itself, and it could be brought into the storm, or to this matter, like there is, like, DNA manipulation, like, anything, right, when you divide the body sample, and it also can be typically trained and measured. It's actually very interesting to say. So here, you have all steps to typically make a prediction, so to say, and then typically, at least in our center, in Martin's loop, I'm working at several levels of QC. First is QC0, which you discuss, which is QC based, essentially, on the sequence series, prior line, QC before the alignment, and then we have QC1, which is certain QCs which are after alignment. We are going to talk about this. So now, this is another year, as I said, and this is our kind of almost final result when we have leads agglomerated on top of, in this case, it's human chromosome. So now, short read alignments for what we have to have in order to perform it. We should have reference to don't. We should have FASTQ file, of course, of our interest, or several FASTQ files, read one, read two, as we already discussed. We have an alignment itself, which is this black box, and typically, it's black box for all of us. It's very, very hard to understand all the details of aligners to use, but we benchmark aligners, and I will talk a little bit about benchmarking. So output of this process, so there's three variants, FASTQ, reference, and alignment, and some, of course, alignment parameters. And those parameters, although even if some facility does alignments for you, it's better to understand because with this, these parameters are critical and depends on the data, and sometimes, if you don't have mainstream data, it can affect the quality of your results. And then alignment file, I will be talking a lot about the file and format and all that. There is also, as I think we can do with short sequence read, and some people do. We're not going to talk much, almost at all, in this workshop, but there is also short read assembly. I know that people try even for Chipsick data, assemble short reads. This is basically like stitching reads in some kind of continuous, longer DNA stretches, context, and then align those to reference, you know, or just analyze those. There is reviews, which is here, you can look at it, and you can think what is applicability of this approach. Reference genome, reference genome, usually we download from NCBI. This is the main source, you see C has its own sequence, one can download from there. It's a big file. It usually consists of A, C, G, and T, as it should be. It's DNA. It has Ns. What Ns means, Ns means that from some other methodology, for example, Cytogenetic, we know that in this location, we should have some sequence, but we can't sequence through. Or in the reference, this part is absent. So when reference was assembled with some gaps, and in order to fill those gaps, it's approximately we know that there is, like, say, 5 megabase of sequence. And then people who are responsible for the assembly, they put this 5 megabase of Ns. And sometimes the Ns are short. Now, I doubt that if you look at the genome, even in the previous version of human assembly, you can find letters different from A, C, G, T, and N. There are some of these redundancy code, right? So when we have two rings, for example. So we have letters which are responsible for multiple. You impact nomenclature, when I impact nomenclature. And in D19, there are three locations like that. So if somebody, and I actually wrote this beta way because I wrote a tool which was only knowing alphabet of 5 letters. And then my program broke. And then I figured out that there are three locations in the genome where it's not A, C, G. Now, AG38 is even more complicated this way. So just be aware that it could be letters different from 4D letters. Then also if you look at the genome, and to this matter, the genome which we have for love contains this. There are letters in the lower case in upper case. It's not necessarily always the case. Usually aligners are agnostic to the upper lower case. But this repeats our mask. And then those genomic sequences which are masks are in lower case. Now one important thing about genome when we use alignment, the genome usually has to be indexed. And maybe, I don't know, are you familiar with indexing of genome? And why do we need it? Or indexing to any file. Indexing is a very simple concept in a kind of simple terms. When we have a long file, which we, in order to access this file, we can access it sequentially from the top to the bottom. But if you want to access the middle of this long file, we need to index. We need to break in some kind of coordinate system to this file. And that's what is done. So genome is very long. It's 3 billion nucleotides in order to access certain locations it has to be indexed. So then we know that starting from this position, it's actually this number of nucleotides past, right? So I don't know if I'm 100% clear, but it gives you an idea of what is sequence. So this is just few facts. So we have to deal with the range of the lengths for the genomes. The shortest one is really very short. It's 5 kilobytes for the file. And human genome is 3 bigger base. As you know, there are some plants which are very, very long. I think one of these processes is actually 30 long. Of course, alignment depends on the length of the genome, and sometimes not really linearly. So the longer the genome, the longer we unavoidably need to align. So I'm not going to go into many details of the short-lead aligners, but just give you an idea that there are two major classes of aligners. One class is based on hash tables, and one is based on building suffix strings. So hash tables, the principle is very simple. We take genome and say we have 30 bases. So the very natural way, how would I approach this problem? I take genome, then tabulate every 30 mirror of the genome. So this will give us the entire complexity of possible genomic sequences. And then I have my file coming from Illumina. There's also 30 base pair reads, and I just compare. I have this big table where I have sequence and coordinates, where the sequence comes from. And then I have my lookup table, and then I have my file. I take the first read from my file and try to check is it available in my table. If it's available, yes, I find coordinate. If it's not available, probably it's not aligned. Or maybe there is a mismatch. That when complication are coming, of course, we know that there are sequences in care, there are natural polymorphisms in the data. So it means that I am aligning, it's not exact match, but it's alignment with a mismatch. And then where complication comes in, that's where real aligners work. So here there is a review, which is a little bit mathematical, but contains all details. Now I mentioned already about benchmarking. So when we would like to align our data, and we would like to decide, even if you look now, Wikipedia is a good source, actually. For biology, I think it's a pretty good source of reasonably reliable information. It puts this way, there is a Wikipedia site on short read aligners. It has like many, many. To decide which one to take, we would like to know their properties. And people specifically publish, there are several publications. How we compare aligners? First, what matters to us is execution time. We have one genome, we have one file, it's our sequence read. How long will it take to align? And you can see there is a huge range of execution times, and it's order of magnitude. And it matters to us if I wait one hour or I wait a day before the job to complete. And the second criteria, of course, is how accurate aligners are. And here we are talking about sensitivity specificity. We are talking about percent of read aligned. And second is how properly it's aligned. And what would be the good approach to do this? What would be your suggestion? How would I do this type of study, sensitivity specificity? So I have my human genome, for example. So if we have file coming from the sequencer, probably not the best way to use it, right? Because I don't know the result, I don't know the ground truth. So usually what people do, they generate synthetic data sets. How to do it? I take genome and I either randomly or just bay-by-base shred this genome in 30 basis pieces and create artificial FASTQ file. That's very easy to do, right? And for every read, I would know location, actually. I would actually know where it comes from. And then I feed this into a liner with the same human genome. And I also can do random sneeps to those reads and do many, many things. But the easiest is just to feed it and see how many reads out of my synthetic data set will be aligned properly and how many are not. This is a good exercise. That's what people did. And as you can see, again, for properly, there is a huge, huge range of... Well, not huge, but reasonably large range of values. And this is percentage of align. So some of the aligners, for example, Botai has rather good Botai, the first Botai, has only 50% of reads aligned, but those which were aligned were aligned 100% correctly. Now, another one, BWA, aligned 96% of reads, which is good enough number, and it aligns 96% correctly. And then there's other numbers. So you always have a balance between percentage of aligned and properly aligned. And you would like to see that both maximise. And this is more recent, 2017. And the reason why I add this is because they also add the... Those are academic, those are free available aligners, and they also have NOVA line, which is not free. One has to license it and buy a license plate. And multi-class, that means it's good. Few classes don't really include. And you can see that, for example, this not free commercial aligner has lots of multi-classes, but it has only one class in computational time. So it really does drop very nicely, very, very sensitive aligner, aligns lots of reads, however, takes a lot of time. So we will be talking about BWA aligner because it has a good compromise between precision and between computational time and all parameters we just discussed. So this is reference to BWA. Any questions so far about this part? So this is reference again. Now, what parameters are important? This is more or less for every aligner because those are very, very logical commons and parameters. First of all, as Martin already explained to you, even when we sequence 100 bases or 125 or 75, we never use the whole read. We never start with aligning the whole read. And the question may be to you, why? If I have 75 basemare, why I don't want to align right away the 75 basemare? Well, because of course, yes, because the chance of polymorphism is high in 75, that's one. And then we will be in troubles. The second is, you remember this plot you saw, this fast queue, right? That quality winds down. So the error rate grows with the length of the read. So presumably at the tail of the read, we will have higher error rate. Yes? So, I know the shorter reads are cheaper to sequence. I think now the key is just 75, even the small, even 100, right? That's the question. Why? Because it's cheaper to have a shorter read. Yes. Why I do 135 if we're not going to align it? That's right. I think we get that the seed region is different. It's different from the read itself, yes. The extension, right? So we're lining this, each is stepping. So seed first, then we extend. And that's where the 100, but it's true that the incremental gain of the last, you know, once you're to 100, the probability you're 90, 90 to 93%, so you gain a little bit more. Yeah, I will talk a little bit about this. But I think the cost of sequencing now is just determined by a lumina key, right? So we are not sequencing short. It's different. It's different. But we are still using 32. Why 32? Because, like, well, if we go shorter, it's dangerous because you can check it. Random 20-year will be aligned to human genome somewhere. So if you just have a sequence of DNA, this is 20 bases and use blood in UCC browser, for example, you can align it somewhere on human genome, very likely. So you don't want to use your seed as very short. However, from all this, like, error rate growing at the end, more polymorphism, et cetera. But then there is another issue, like my issue, say, computational issue. Why we would like to have our seed shorter? Because even for suffix tree, size of the suffix tree, or the table, we want it smaller. 75 bases, it's 4 to the power 75. That's the size of this table. So this is just enormous space. We would like to have a, the size of the table is just ginormous. So that's why we would like to have a smaller, as small as possible. Our search will be faster. Our aligner will be faster if our seed is shorter. But there is a compromise. So what happened after? Well, this is the seed length. This is the first parameter. Now, the second parameter is number of mismatches in the seed. And the typical default is two. And, well, how many, how many in human genome, how often we would have polymorphism? If, say, any of us genome will be, will be tried to align to the reference, to NCBI reference. Do we know? Do we remember? So every thousand bases, we have a mismatch roughly. Right? We have a million, about a million, each of us. So probably it's not a problem for us if we have two, two mismatches allowed. I think those parameters come from early days. Most of the read nowadays are aligned without any mismatching seed. But if we have a mouse, as we already discussed, mouse, which is not reference strains, then of course there are every 200. And there are hotspots of polymorphisms. So we would like to have this, this freedom. And after we align the seed, after we place the seed, we anchor the read somewhere on the genome. Then we finish the job with local alignment. The rest of the read is adjusted to this location and double checked with the sequence in this location by some local alignment. So after we perform the alignment, we have, of course, uniquely aligned reads. Yes. Yes, yes. For hash, yes, yes. You have to have any possible certificate to merit, which means everybody's. Yes, that's the only way. So now I say we have aligned reads. So imagine, and it happened to us, it happened to me, it happened to me recently. I take the file from after alignment and I have zero reads aligned. What can it be? Wrong species. Wrong species. That's a good thing. Yes. Wrong species. Like for example, the wrong genome was used. So I saw this mouse, but while mouse and human probably will still be, still have some reads aligned, but something went really wrong. Or I used, it was actually solid reads, but I used, I thought it's Illumina reads. So it could be, it could happen. Just check. Yes. That's another thing. If we have, if we have a doctor in every single read happened, there is like 11 mere, it doesn't belong to the genome. That's a problem. Yes. Why would a solid read? Because it doesn't, it doesn't use the alphabet. It has the colors, right? It has these numbers. Yes. Yesterday I did this accident. I didn't look. I just, yes. Or day before yesterday. So unaligned reads. So that's another, another, another, another part of the file after, after we have alignment. So luckily for us nowadays it's very, very small fraction. Usually we are 99, 98 reads are aligned. So which is, which is great. And so when they are not aligned, we basically throw them away. Or maybe they are of some use. Maybe you had a spike in control in your, in your data. Then you take those reads and align to spike in genome. So everything depends on the problem. Right. So then, as we already discussed, most of our libraries, even for Chipsick are fair ended. And then we have another characteristic when pairs, reads in pairs can be properly paired or not properly paired. Yes. Well, there is a range. Like for us, we, when we started, like 30% mapping was okay, but very quickly quality become, so 50, 60% for Chipsick was more or less okay number. It all depends on your problem, right? It all depends. I mean, maybe Martin can comment a little bit more on this, but amount of starting material, for example, right? What is, what is your library? Like what went into your library? What kind of DNA, right? So the library can be naturally contaminated. So then you're not expecting that everything will be, will be aligned. But for Chipsick, the minimum not ability to get something less than that. But there are many high-quality Chipsick libraries with that low not ability. Because you've got a, you've got salmon sperm, you've got other spikies, it's not spikies, but other material that's more brought down in your IP. So we've seen libraries as well. Yes. So a good quality library should be 75, 80% or above, right? Well, even now we have much higher than that in 90s. And of course, if 50%, but then I have 49.9 and you have this library and you go to your PI, and yes, still, we still use it, right? Of course, it's, it's a discretion. Even 30% of it can be useful. You, if you really understand what happens and it's not, those alignment are not accidental, probably it's, there is no such, of course, when you try to publish it, you will need to explain why you're... So, so here is example, and you already saw it, of two flavors of Chipsick. One is sonicated and this is a parent distribution. It's after we align our reads and we can computationally recover our distribution. It should remember Agilent traces which we observed, which went in. It should resemble, but it will never coincide because the PCR steps involved. So don't, don't be surprised. This is a typical, typical curve. Normally you would like that, that sometimes around 200 basis, 150, there is a mod of this distribution and everything binds down at 300. What is the critical? Like when we look at the distribution like this, this is parent, what is the critical for us? When we have several samples, we would like, and we would like to compare those samples at the end. We would like those distributions as similar as possible. If they're different, then we will pay the price because we, this is essentially what gives us resolution of our experiment. Because in every given location, we know that the binding occurred somewhere in this range, but we don't know exactly where. So ideally all our reads would be, all our fragments will be like say 150 basis long. That's it. But that's not the case for syndicated data. For native chip, for MNAs, MNAs one day just a chip, it's a little bit better. And here we have a very strong peak at exactly, when I saw it first, I thought, whoosh, 147 basis. It's exactly the, so. But yes, it choose in the spacer between two nucleosomes and the maximum at 147, we have a little bit more and a little bit less. And here we have dinucleosome and that's a typical parent. So when pair is outside of this range, where the distance between two reads is after alignment, or they align to different chromosomes, or sometimes the distance is negative even. So you can imagine, right? So this is possible. One read aligned before the, read one aligned after the read two. This is something went wrong, but it's possible. All of those are not properly paired. And this is special in the statistics of the alignment file. This is the special information we always have. So now the question is very briefly. Non-properly paired reads can be indication of something interesting, actually. About something, for example, we have structural variation in our data. It's possible. And if we have a cluster of reads aligned on chromosome one, read one aligned on chromosome one, and read two aligned on chromosome X, but they're all clustered, right? So we have situation when this is chromosome one, and we have a bunch of read aligned somewhere here, and then we have chromosome X and bunch of read aligned here, and they're all linked. So this is read one and read two. And to this method could be one and two. They can be on negative strand or on positive strand. Then it means maybe we had a translocation here, right? We had structural variation. So maybe it was this end of this was joined. So we had that, we had a structural variation where here's chromosome one, here's chromosome X, and our pairs were actually spawning this junction. But after we sequence those and align, they end up on different chromosomes. So sometimes it's interesting, but if you don't expect those events, just we ignore them. So we have duplicated or multiplicated. People call them duplicated, but sometimes it's multiplicated reads, reads or fragments. I'm using terminology for reads, but it could be fragmented. If it's single, that's another argument for pair and sequencing. In single-end reads, of course, we only have one read, and if it's 70 mere and our coverage is higher than 70, then for sure at least two reads will be duplicated. There is not just room to have start in a distinct location for the start, so to speak, right? And I have it, I think, on this slide. Yes, so here you can see that if we have M-based pair read lengths and we have a stack of reads higher than M, then at least two reads will start in the same location. They will show up as duplicated. This is when we have a single end. Single-end experiment. When we have pair end, we have much more room. So although one read can accidentally have the same start, the second read, we are actually sequencing different molecule. It's not PCR artifact. Then the second read will likely start in different location. Yes. Do you remove these in single-end? We remove maybe not the right one. We remove them, but not all of them. We leave just one. We collapse. We collapse. If they're 10 starting in the same location, we use only one. We're not completely removing them because there's tons of things people can do it in some experiments. If you really have a lot of them, you can study them, you can see where they come from. You can only consider, let's only generate profile only from duplicated reads. And are they coming from some locations of the genome? Are they having some specific sequence? Maybe it's due to some specific sequence, et cetera. But typically, you just collapse them and use just one copy. So we mark, as I said, we mark them and we collapse them. So then, yes. So it's after a lot. Some tools, it's after alignment. Yes. So the first thing, so some tools. So we need to use something to mark them. And this is actually a tricky business. How to mark duplicated reads, because it's non-trivial, because you have to look up also the sequence. If the sequence itself, if start position is exactly the same, but sequence is different, there are different criteria of how to call reduplicated. But we usually use Picard or the new tool which calls Samba Bamba, and I will talk a little bit about it and we will do it in the tutorial. So there is this mark duplication tool in these packages. So we have our alignment file. We run through this and it marks. Then to your questions, in some tools, and we are coming to this, we can say filter duplicate reads. When you say filter, it collapses. So it leaves one copy. So you have to manually, you have to do something. If you really want to get rid of, you have to do some extra. We can talk about this. If you need to do this, we can talk about this. But it's not readily available. So removing duplicates mean collapsing them. But I think we can double check. In some tools, for sure, it's just collapsing. So you're taking one copy. So now there is also multi-marked reads. We already touched on this a little bit. So our genome is highly repetitive. And as you know, in human or mouse, it's about 40% repeats. Although we call them repeats. They're not 100% sequences, not 100% identical in those repeats. These repeats are mutating. They're ancient. So but if we take, say 50%, sorry, if we take 50 base parallel reads, then about 15% of the genome won't be unmappable. It means single-land read. It means that in 15% of the genome, we will have similarities. A read originating from 15% of the genome can originate from somewhere else. So here is a little bit of cartoon that Martin was having for you on the whiteboard. So we have repeats, we have red reads which are coming from the orange repeat. They can also come from this part. So while green are uniquely mapped, those have multi-mapped positions. And when we use BWA aligner, it tells us that this read multi-mapped, there is a flag which tells us that the read multi-mapped, not flag, property, let's call it property, in the BAM file. However, it doesn't, it kind of randomly tells us those. So if there are 10 positions where the read can be aligned, it just randomly give us three. It doesn't give us all. There is an option when we can report all of them, but the order of the positions also is random. So it's a little bit tricky. Now this is just what I said, two red reads coming from here, two other red reads coming from another repeat. And this is single-end situation. So in single-end, we have to drop all these red reads. No way we can rescue those. Now this is a pair-end situation. And look at, look what happened. So the green read aligned uniquely outside of the repeat, another green read, and those two red reads are uniquely aligned now. So that's how we leverage pair-end information. Now here I give you some numbers. And this is for mouse. And per chromosome, the blue indicates fraction of the genome, which is uniquely mappable, and the red is unmappable. So degree of non-moppability or non-uniquely mappability depends on the chromosome. So different chromosomes are having different properties. So X and Y, of course, more prone because of high symphony between X and Y, they're more prone of mixture between reads and low-moppability. And here you can see single-end. So this box plot is a distribution essentially for those bars. So it's per chromosome. So the range is from somewhere from 9%, 9% of the genome non-mappable to up to 16%. And pair-end shifts by 3%. So pair-end helps, but doesn't help that much. Now this is for the reads of the length, let's say 75 base pairs, which when we go to 100, it doesn't help that much. 150 doesn't help that much. So the reason why, if in order to really go, have a next significant step in mobility is to sequence like 350 bases. And the reason is because we have a lot in mammalian genome, we have a lot of ALU repeats, which is about 300 bases. So that's the thing. So we shouldn't be that greedy when we decided 75 or 100, they're almost equally okay. They're almost equally the same, either 75 or 100. And why, like why mobility or this like knowing of mobility very well, why it's important for us? Like the question to you, if like say I just ignore, I get rid of those reads which are multi-mapped and don't care about them at all. Why it is important for us? There's several situations when it's important. For example, I study one genome, one particular sample and I'm interested in absolute answer. I really want to see where on genome enrichment in my Chipsick library happens and where it doesn't happen and I then do functional conclusions out of this. And sometimes I see lots of enrichment and then I have disappearing of enrichment. I have a hole and I can say, wow, so here something happens. Here I don't have enrichment. Maybe I have dense chromatin or something biologically meaningful. However, this hole can be just due to repeat and just due to our inability to align with there. So we have to be careful. We have to remember that this is possible. The second situation when we comparing several data sets of slightly different conditions, although I was telling to you that mobility depends not significantly on the read length. It's only true after we pass 50 bases. If we go to 30 bases, then mobility starts almost exponentially decreasing. So for 30 base pair read lengths we have 20% of the genome not mappable. So when I'm comparing library with 30 bases read lengths in this library with 75, I will be discovering difference just due to different in mappability between these two libraries. I have alignment in one library and non-alignment in another one. So that's the thing. So here in this slide we show screenshots from UCAC browser. Most of you probably familiar with this. It's a great tool for all of us and there are tracks for mappability. Somebody pre-calculated mappability for us just using synthetic alignments of reads to the genome. There's 50 bases, 75, 100. And you can see that there is a huge line repeat about 6 kilobase here and nothing can help. Neither of read lengths 50, 75, it's the same. We have a huge gap here. However, for some nuances 100 bases are better but not dramatically better. That's pretty similar. So let me cruise. Yeah, we discussed this already and so I had a question mark. I had a question for you. How to study if I have one library 75 base pairs and another library 50 base pairs and I have to analyze them together? What should I do? Well, typically people trim longer reads to 50, although that's kind of sad, right? We sequence it. It's very tricky to justify. But of course if one has time or one has a co-op student one can do easy exercise. One can take 75 base pairs, align as 75, take exactly the same library, trim it, align as 50. Guillaume already suggested this approach. Align as 50 sees the difference. It's a nice exercise to see how many places we see the difference. It's a bit tedious but yeah, that's what. So now we talk about alignments. We talk about parameters for the aligner and now we are coming to the format of the files. So every aligner on Earth if one to exist stick to one format which called sum format. It's a short read alignment. Short aligned read format. Sum, S-A-M. And then there is B-M which is binary aligned read format. So probably you are already a little bit familiar with this. So I put a reference to specifications. It was updated very recently. Constantly updated rather clear. And my suggestion like really can't influence people much. But my suggestion even if somebody did alignment for you and you already have those files and if you want to work with some files spend couple of hours read documentation once in your life carefully everything then you know it. It's your luggage then. Because that's at some point it will work. It will be useful. So BAM stands as I mentioned for the binary. So we have some tools to manipulate some BAM files. There is also very useful APIs. One is Java from coming from broad and another for Python when we can seamlessly read those files. So if we have all our sequence read in a sum as in a form of sum file in a form of BAM file in the binary file usually this compression going from ASCII file to binary file give us about 5 times reduction in size but still for Chipsic files this is the size of the file. So when you need a disk space just keep in mind. This. So some file or BAM file contains in two pieces. The header has information about genome which was used for aligners for alignment. It also contains information about aligner and all post processing for example marking duplication duplicated reads etc. It should in ideal case it should contains exactly the command which was used because maybe 3-4 years ago 3-4 years later not ago but later somebody wants to see and all information is there. It's very, very useful. And we will do it as a tutorial we will look into the header of the file. This is a summary what fields of the of the BAM file are. So first is read name the second is just flood which is just a number and there is a very nice website from Broad I have URL on the next slide one of the next slide where you can just check mark a box which value which property you would like and give you numerical value of a flag which can be very useful for you. Didn't exist in the early days so you have to create a piece of paper or envelope chromosome read start and I will talk about this this created a lot of problems in the past about read start read mapping quality which gives us property of alignment information about mismatches insertions deletions etc which tells us about the mate read here I call mate it's not mate read it's pair read but in the sound terminology it's called it's mate read and mate so it's actually pair partner so don't confuse this question about mate reads it's still pair and reads but it's called mate in third fragment length and here you can yourself you can filter you can see that if it's more than 1000 bases something wrong with this pair if it's within the range you see numbers like 200 the in third fragment length can be negative if the read is the second read downstream then the read length will be negative then there is the sequence sequence itself and the base quality and those are obligatory always existing fields and then there are some properties of the read maybe let's keep certain things you are familiar with flags or we will do this during during that tutorial part so there are flags and we can filter we can filter it we can filter those which are not aligned we can filter those which are duplicated or collapse duplicated reads we can only always select only those which are we can filter those which are chastity failed, Martin was talking about chastity, quality value for every read we can filter on those we can also select reads which are aligned only on a positive strand sometimes we want this or only on the negative strand so things like that and this is a website to understand the flag this is very nice maybe we can well sometimes during later during today you can punch this URL and to see it can be very very useful so the read start it seems to be very unambiguous very very simple property however like read start can be interpreted differently when we have read aligned on a positive strand it's unambiguous that's the start of the read on a positive strand it's this point on the genome if read is aligned on the negative strand this is actually this part this is 3 prime M of the read it's not 5 prime M as you would expect and there was many many issues in early days when people were considering this point to be this point actually that's you in order to have the actual 5 prime M of the read you have to go downstream on this case like down on the genome by the number of bases which is the read length so now when we have a single end single end read and we extending the read by some dimensional way that is important we have to extend starting from here not from here is it clear what I was talking here okay there's still quite a few single end libraries around people still using them a lot so this is still maybe an issue that's why I would like to mention so and you also understand why I extend the read why we would like to extend the read in a single that's a usual strategy in analysis of chipsick data we have single end data right because the actual DNA fragment which was which was coming from precipitation is much longer we sequence only part of it but the binding was happening somewhere inside this longer fragment so when we generate those stacks of read we don't want to miss actual binding location and that's how we do it we extend for example this average fragment length from the data and if some of you are interested or we have time I can talk about this it's not on my slide how to infer from single end data how to infer fragment length it's possible it's rather simple and then we can extend it by usually it's like 200 bases and then our pile up will be nice if we only pile up the read themselves we can actually miss the binding location it will be very very noisy but this is a way to do it mapping quality mapping quality is a field number 5 and usually we just say minus q5 when we use some tools when we filter so we ignore everything with quality low than 5 by doing this we discard reads which are multi mapped or reads which are nearly multi mapped so lower quality it's the same concept it's a fret score there lower quality means that read is not unambiguously aligned so here is you already were introduced by the way something I forgot to mention which is interesting we spent quite a bit of time talking about base quality when we're in the sequencing so I'm not aware about a single aligner except one actually the author of this aligner passed by like couple of hours ago when we started our session the person who tried to create an aligner who is going to use base qualities so normally BWA BOTI to this method no align they just ignore base quality so you use base quality to select your reads just say well I don't like this read at all chug those reads however aligner doesn't use those so the mapping quality is based only on the property of alignment itself so we take DNA sequence we try to place on the genome and depending how well it plays on the genome we have a mapping quality base quality is not used where do we use base qualities we use them somewhere yes yes that's one of the one of the applications probably there's something else well it's in the extension in the extension I'm not sure it's using in the extension so you see I'm we're not using but there are many parameters in how you extend the read past the it's heuristic yes it could be the sum of the base qualities of all the mismatch or if you had a higher base quality then you had a mismatch you might have less or more confidence in that extension that's a parameter that can be set in BWA where it uses the base quality but it's true in the C it doesn't okay so I missed this part thanks Martin okay so the next one is cigar string field number 6 and this is quite useful because looking at cigar for example if we see 100m and we had 100 base pair long read then it means this read is perfectly aligned no mismatches if we see something like that so this means certain first basis was hard cleat for example nova line really likes hard cleat and cleats looks at the read I don't like this basis hard cleats then remove them and then align the rest so certain basis were just removed from the sequence two basis for soft cleat so they they were left in the read sequence but they were not really used in the alignment and then this many was marked no mismatches or 3 matches then 35 soft cleat 8 hard cleat and then there is D for deletions and I for insertions etc so here is nomenclature for cigar match reference insertion to the reference this is insertion to the relative to the reference deletion relative to the reference then we skip this is rarely the chip seek alignment but often you would see in rna seek alignment in rna seek alignment we have tons of ends when we have intron insertion when we have split reads we have intron inserted ends soft cleat hard cleat well let me skip those I got this kind of exercise tutorial when we go through this cigar maybe you can do it or we can do it as a tutorial so this is tools which we use to manipulate with some some of BAM files one is called some tools some BAMBA another tool more recent tool the advantage of some BAMBA is that it uses multi-threading so you can usually any server now you can use multiple cores you can run simultaneously on many cores your program you can use multi-threads for this you can 10 times speed up your analysis time but for small jobs this is pretty so what we can do with both of those tools first is indexing so we talk a little bit about indexing genome we can also index our alignment file when we would like to access so my file starts with chromosome 1 chromosome 2 but I would like to access chromosome 19 when my file index contains information where to start looking into the file it will be really instant so instead of going through the whole file if I have index I will look at chromosome 19 right away the next one we can we can edit we can do some file activation we can calculate statistics we can ask how many files are duplicated how many reads are duplicated so here is the summary so you can see floods lots of things with this so any questions so far about manipulating another tool which we are using and will be using is picard tools coming from broad and one of the tool is mark duplicates when we mark duplicates and another for example we can merge usually or sometimes we can have data coming from several sequencing runs and we would like to merge them together the example would be I have a trial run I see that my data is okay I see it's more and then I merge them together and we would like seamlessly merge them because we would like that reads from chromosome 1 merge together reads from chromosome 2 etc so the merging can be done with picard or with samba bomb for visualization after we are done with alignment we would like to visualize what happens that I think we are trusting our eyes a lot so we would like to see the data that's the first thing I would like to see although how much we see I don't know but we have an illusion at least that we see something but we can't study the entire genome visualize the entire genome it's very difficult but we generate a file in certain formats which are visualizable in IGE or boshu genome browser or it's called epigenome browser I think and we see c-genome browser and we can visualize at different levels we can visualize at read level or at read file up level when we lose read identity but we only we only look at the coverage at every base we know how many reads are covering this base right so we are losing individual reads at read level we can study for example SNPs or insertion deletions and all little details at the pile up level we look we lose this but we see a little bit of better global picture so here I'll show you example of pile up and this is 6 marks, 6 epigenetors 6 install notifications a CK-50 cell massaculation I think it's yes it's for human rights stem cells actually that data we are going to look at htk4 in A1 htk4 in A3 htk9 in A3 k27 so you see we look at different pile ups we look at input DNA which is control and this looks pretty uniform covering the entire about 6 here this I think 6 megabays almost 6 no 500 so half half of megabays so the entire I think is 1 megabays so it's pretty uniform isomars showing different stuff and there's two transcaptive factor binary experiments which are look very different from chip seek which is like peaks so basically that's our visualization we look at this and it looks reasonable and yes well this is this is just whole coverage of course after the GMA like so what we are doing there is the following we we have our reads aligned and as we have pair end libraries we have our pair reads aligned so we have our fragments and our fragments have different lengths and if this is my journal this is fragments they are aligned like that and then maybe you single read aligned here and then we generate coverage profile so essentially we generate coverage which will be so here is one read up to here it will be 1 then it start 2 then here we have 3 reads and then further we have 4 reads and then 4 it stops and then it's 3 again actually it's very little 3 and then it drops to 2 and then there is 2 reads and then there is 1 read so that's what we see there if I zoom in and probably in the further slides I'll have something something to zoom in that's a profile we see here so in usually when we run conversion of bomb file into the weak format file this is called weak format we just get those integer numbers now the total number of reads is different if I have library 1 and library 2 I can have different number of total number of reads of course then if I really really look at the number on the left in one case it's 160 and in another it will be 20 it can be little bit misleading because if you have now and you see it's not like there's lots of complication I guess we can have way out and the way out is simple we can either do some normalizations but let's talk a little bit later about normalizations it will never help us 100% because in order to normalize property you will be asking should I normalize on the background reads or on signal reads and if I normalize on total what is signal to noise in my library so many things but what I can see I can look at my truck and I can see are my peaks distinct from my background if they are then my library is ok even if the absolute number is 160 and say imagine this this other library like say I don't even see the number here but let's say it's basic but still I see clear distinction between peak and background between foreground and background it's ok if my library my IP library looks like that this is input it has some kind of peaks but it's very duty then I'm failing I think we should cruise a little bit we have half an hour but let's so this is about visualization are you familiar with bad format BED format or maybe a few words because here there is some some tricks and some gotchas so if you look at the nicely described that you see there is nice frequently asked question page and all details about formats are there so BED format is basically chromosome start and end so that's for visualizing objects which are just genomics so when I have something for example exon which has start end and chromosome that's where I use BED format BED format has three columns chromosome start and end and it can have some other column name then there is BED graph format which is similar as BED chromosome start and end but it also has score it has some amplitude so for this fragment at the genome I have certain number as such to it now what is the problem with this format in most cases what means by zero base when I say here is an example so if real coordinate is 400,601 in the BED file the number will be 400,600 I don't know why they did it of course mathematically it doesn't matter if you know the rule you know what to do but for the like regular person it looks ok this is my coordinate I would assume that here my segment starts but actually it starts one base later where it matters for example I have a file with coordinates of cpgs just dinucleotides and all of a sudden I see that there are three bases instead of two so those things one has to remember it can be now other formats also from UCSC they are not zero base they are so called one base would be if there is one here so this is BED format very simple chromosome start and and it's like a column so you can have millions of this chromosome start and you can load this file at UCSC and you can see something like I have an example here no I don't have an example later I will point you to example what was the BED file visualizing at UCSC so now another important visualization format is weak format is to visualize exactly those coverages and it's done in order to compact in a compact way to visualize those coverages we will talk a little bit about this during our tutorial we generate actually this file hopefully we will have time for this and maybe let's talk about this format later this is very unusual format for us it tells us where my cluster reads this starts because imagine like in this example I have cluster of coverage then I have nothing up to this width right and I have another one and then maybe I have nothing again and maybe I have fragments here and then I have another cluster but then I will have gap so in order to have a reasonably small in terms of size file I only detect the start of this cluster and the coverage within the cluster then the next start and the coverage the next start it's very logical right and that's how the weak format is organized the bed graph I mentioned so it's again it is chromosome start and then some score in this case just coverage now to generate the weak file from BUM we will be using the out tool which we created that probably some other software tools to convert aligned reads into the coverage trucks now the interesting complication here just for you to be aware sometimes when we feed the genome into our aligner aligner doesn't care how we name our chromosome we can call chromosome one apple chromosome two peach etc it will be totally fine that now UCC browser does care because it has its own nomenclature it knows what chromosome one what chromosome two now sometimes our genome instead of CHR one has just one in the FASTA file and then when we get our aligned reads in our BUM file which we discuss in the third field which is chromosome we will have one then we convert when we convert our large profile again chromosome will be called one we try to load it into the UCSC that I know chromosome I don't know it because at UCSC chromosome one is known as CHR one so this one has to be just be careful with that so now we are coming to the questions of analysis usually the question we have or our supervisor has find me in rich regions that's alignment, find in rich regions and we can ask of course what are in rich regions and let's talk about this a little bit in few slides later what are in rich regions actually but it can be one of the problems we have just one dataset and we would like to find all regions for this dataset the second we know where to look we are not agnostic in the first we just agnostic in the second we know so to speak regions of interest and it could be just because of experimental design and we would like to compare and look only in those regions of interest and sometimes it can simplify our analysis for example our regions of interest are repeats or LTRs and if we just do normal alignment we talk already those are repetitive regions these will be multi mapped they usually toss the way but if we are only interested in repeats we can design the way we are looking at those so that we can study those repeats for example instead of looking at genomic location we are looking at repeats families we are looking at all genomic locations which are belonging to certain earth and we look at the one one region of interest and then the next one could be potentially we could do a comparative between different marks for the same cell same mark for different cells etc so now this is like and I don't know maybe just me but I had a hard time to convince myself and understand water in rich regions in rich relative to water and the kind of simple but probably probably right way of looking at this is the following that we look at the locations at the genome which are different from other locations and not due to technical technical issues truly biologically different to other locations on the genome and that's bring us to the to the following so if he have IP where 90% of the genome is actually positive so our antibody is sensitive to the 90% of the genome is it a good study probably it would be very very difficult to interpret because we almost looking to into anti enrichment we almost want to search for the locations where IP was not happening so this is something to keep in mind so the first one locations which are different from other and the genome however now we have a control almost every IP experiment required to have a control now and it's typically it's dename input control where we sequence read which pass all steps of sequencing library construction alignment all steps are the same except using antibody right that's the only step which is kept so we have input we have control and we would like to see that those regions which are special in our IP are not special in the control that's how we are looking at at the data and that's how pick color they still called pick color algorithms are essentially approaching the problem yes for MEDIP well it depends often yes I know that Martin doesn't doesn't sequence input and almost everybody doesn't but I saw data where people did sequence input and input is used for the same like let's come to the usage of input and we can discuss it so the question is what is what is the ideal input ideal control probably different people have different opinions there are publications input control seems to be dename input as a control seems to be good enough it's easy enough assay it's cost effective many people are using it for MEDIP yes let's talk a little bit later so now why to use input dename input so now these are exactly the same data set which you already saw this is those profiles which we discussed and you see something like that all the marks now of course if you have six marks and see picture like this probably each of us will be little bit suspicious of what going on there because we have identical profiles but imagine I just sequence one mark or I have one IP like this one and I'm perfectly happy with those peaks I have very strong enrichment there now the black one is dename input and dename input has exactly the same enrichment as our IPs so this definitely tells us that it is alignment artifact happening there and fair enough there is repetitive region by the way here you see see track this is from so this is Chipsick data K27 acetylation I think from many many cells as the end quad and they also see enrichment here so that's allocation in the human genome where we always see the reads so we use dename input to read those away those alignment artifacts that's another example so those alignment artifact are not only specific for humans they're also in mouse so alignment artifact coming we don't understand we are trying to research those a little bit but they're there they're probably due to our not knowledge reference genome and the individual specificity of the sample relative so if there is low low copy number repeat so there is more DNA coming from different locations but in reference genome just one location aligns in one place and it piles up there so it's pure artifact of the alignment and why it's dangerous because having those we will have a mistake we will erroneously call places which are enriched and although they are not that many maybe there are several thousand genome white and they are less than one percent of the genome but they are very persistent and very consistent between different experiments so we can have some kind of discoveries by not removing those artifacts so here I'm showing you in the log scale on that y-axis is log of the coverage and on x-axis is chromosome one so you can see that the coverage is pretty uniform the black however there are those reds red towers which are actually alignment artifact so in some locations we have this huge coverage it's because it's very very high coverage just technically we have variability there which is also bad but also we see that say in this sample which is which is column we see artifact here it's human in another human in another mammary gland or terroid we don't see it here so this is sample specific so if we just did those artifact we found them once and would like to use once and forever we will steal only about maybe 80% of those artifacts will be taken in account but maybe even less so it is somewhat individual specific let me skip this now what are challenging so we used our input we removed alignment artifact what are next challenges next challenges as I said is to find enriched regions can we use input as our background question to you and I ask my question myself no we cannot and the reason why we cannot because well first of all input is still slightly is different experiment so background is somewhat different but why what else argument why we can't use input directly compare our IP to input because our depth of sequencing is different our total number of reads we don't actually know what although presumably the true background very very unspecific so still remember that in our IP it still reads which were pulled out with antibody they are likely not specific but they were still function to the pull out in our DNA input we didn't have this step so we also don't know what is a fraction of my background in IP and we have like say 50 million reads sequence in DNA input what is actually fraction how many reads in IP the sequence IP also sequence million reads or sequence million fragments what will be fraction of background and it depends on many many parameters it depends on the type of antibody it depends on the quality on the signal to noise etc so just comparing input to IP we cannot now if I would know which fraction of my reads in IP is actually background I can try to sub sample my DNA input to this number and to use it but I don't know this fraction so that's that's why it's difficult and we also don't know of course for the IP experiment we don't know analytical form of neither background nor signal so to this matter I think the chip seek analysis is the most complicated it's more complicated than for example sneak calling I don't know you probably would like if we look at the sneak as a whole genome shotgun so we have we have reads aligned we have stock of reads aligned and in some position we have in the reference genome C and we just look okay what happens C and then we have A C and then maybe another A and then we just count and then if we see we have 90 reads stuck in this location and out of 95 are C and 35 are A so although it's not 50% but close enough there is some statistical model we can rather simply infer there is probably heterozygous if it is like two A's we will question it but there is a clear statistics in this case by no email it's like a dice we honestly throw dice with with two numbers and we see how many times it's A how many C and we can assign P value and we can have a conclusion here the same is RNA seek we have very little part of the genome less than 1% say which is coding where our reads goes to fall and then we have some background of course but we just counting reads in our exons and everything what we count is essentially we use for quantification of expression in comparison between libraries it's easy in chip seek we have a lot of background we have a lot of biases coming at every single step of this one of the first slides we had and this is a problem right to decipher so when we have a little bump this bump is just due to fluctuation of our background or it's already start to be significant enrichment that's what we are doing analyzing chip seek when we talk about peak colors many many tools are available and there is a I can say heroic efforts of many many groups in the community which people try to address this question and many of those really nice tools actually failed and they failed one of the reason is that people honestly address the problem they create a tool they use them for their own analysis most of the time they even see that for their own data it perform better than some available tool and more or less that's it because they didn't do tool which would be available for production or conveniently used by other people so there is one tool which is which is Wiener of this game at the moment which is Max I will be talking about this tool later and people at the table here all of us are involved in chip seek tool development so we will mention this as well so maybe a few words about the the peak peak calling unless maybe you want to interrupt me with some questions at the moment yes the different biases that go into creating like that yes I can let me think how better to do it let's let's go through to the process and I will be mentioning some biases during and then during this and we can yeah we can be biases are very important that's in it yes what do you think is the major bias or one of the major kind of common determined bias for antibody this is one of the yes of course the specificity of antibody and like crosstalk between two like say I believe this is H3K27 ME3 but apparently H3K36 ME3 mark will be also sensitive to like say 20% specificity to this antibody and that's the courage Martin showed right you saw that there is H3K4 ME3 which is the easiest antibodies 90% specific I think and it has still it can pull other modification this antibody also sends other modifications but it's only few percent but if it's 30% that's already dangerous this is one of the biases yes and this is bad bias because it will it's not it's not kind of uniform it is location specific bias unknown location specific bias this is not good another bias which I will be talking is GC bias and GC bias is a little bit of a common denominator of many biases because this comes into sequencing this comes into all kind of amplification steps which are not maybe sequencing or not sequence related steps and so this is one of the one of the biases which is important and can be positive it can be negative what is another sorry you have a question yes I think I always just go down and check all the marks and just kind of look at there's a peak as well so I always imagine that's all when you look is okay right and back to look how do you our eyes what we actually will be looking when you look that we have a peak then we look at the flank of the peak okay it finds down nothing I look at the input I look at the flanks this level I look under the peak also nothing it's not really what the algorithm is doing well this is the principle but mathematically yes this is the principle that's our general principle we look at the enrichment in the location A so it's different from the flank we look at the input it's all different however you can't do this directly you can't infer flanks by eye we can determine where is flank and where is peak and we have to teach our algorithm to do this and we have to do this within the IP itself because the level in input can be completely different from the level in our IP so it's more than you calculate in your sample okay it's like a 40% increase to the other region and you're going to input this thing here I only have a whatever 2% increase something like that it's very logical exact algorithm may be doing it slightly different exact algorithm tries to optimally discriminate in your IP genomic locations which are lightly enriched versus genomic locations which are lightly input compare those require that discrimination is maximum or optimal then look at the input sorry look at that what was called background compare this input and see that it resembles input this would be ideal scenario resembles input in terms of the distribution not exactly because we can have 5 million reads background reads in our IP and 15 million in input because this is calm statistics those are Poisson like processes Poisson distribution is not like normal distribution normal distribution is symmetrical distribution we shift it around and it still stays the same Poisson with the lower coverages has a long tail right it looks yeah it looks like that if we have Poisson with low with low mean it looks something like that very asymmetrical now when our mean is 50 say this our mean is 2 if our mean is 50 our Poisson looks almost like normal so we can't really directly we need to do a lot of scaling and if we want to directly compare input that's what all these colleagues do they try to do something along this line that's right but not using input directly to determine property of your background that's now I'm very precise in what I say now when you visually visually end, there are tools even which are visually like analyzing analyzing those things you can do it the question is when you have 100,000 enhancers will be a little bit hard to do it for each and all of them so you have to train your system somehow and then still do some well you train it you train on some locations maybe and then you try to analyze your S computation so that's it yeah and on top of this of course we all a little bit subjective when we look second the UCC browser make a subjective because there is a pixelation right when we look at the picture and whoa that's a perfect peak and then I zoom in zoom in and I see like a comie dinky structure there because it's a pixel resolution right so it's a tricky it's a tricky thing now about biases I want to throw bias because we talk about biases digestion and syndication now like this introduced bias as well we know that actually genome is not one string it's erupt in a very complex 3D structure some is heterochromatic so it's very hard to syndicate for example so there are some locations which are hard to syndicate however there are some locations which are fragile when we have a fragile location we have break occurring in this location more frequently and then even in our input we will have we will grow a peak which is will be just spurious peak this is bias unfortunately sometimes we have biases for example in nucleosome free regions and we know that at transcription start site exactly at the transcription transcription start site where we often have a mark like a 3K4M3 or we have transcription factor some protein binding in our promoter we have nucleosome free region and we have frequent break so we will grow if you profile input around TSS and I think I have this you see a peaks there and it depends on the technology 5 minutes yes so yeah we talk many things already which I am going to talk so here is just some complications that we have to work with data we still use word peak color but we are actually analyzing not peaks we are analyzing peaks which are punctated or localized marks and then we have dispersed and I don't think we have it in your so some of the slides are not in your printouts but they will be in PDF online so everything updated version is online so this is three pictures three images they all come from one megabase of human genome and you can see one mark is perfect like peaks and another one have almost half a megabase regions which are enriched and when we call enrichment I don't think we can call those peaks this is and this is ACK27ME3 and then if we look in ACK4ME1 which is enhancer mark it's very complicated because it's a mixture so sometimes it has very strong cluster of enhancer problers maybe it's like you heard this super enhancer of the word where we have a stretch of DNA which is covered by binding sites and then we have really localized peaks so we have both so why is the problem typically when we analyze the data in every signal processing task people like to have a scale a single scale at what scale I would like to look at my signal and here we have multiple scales that's what creates the problem maybe we have a single scale and this is for you to think it's homework to think I'm thinking about this we're all thinking about this is what is actually scale of the histone modification when we have an enzyme which is a writer of histone modification how many nucleosomes just oppose nucleosome it can affect it can't affect half megabase that's for sure so it's probably affect several is just spreading of the mark but it consistently happen for millions and millions of cells so what tool to use that's personal questions well now I also already was talking about this we can have single data analysis or we can have multiple sample comparison I would like to show you some interesting exercise we've done this mouse data when we just look at the signal at the promoter and I probably have already 2 minutes not 5 we have at the promoter of the old genes of 20,000 genes and calculate row signals there what I mean by row signal integral of the reads read count in the promoter and then I took 1 plus minus 0.5 kilobase 3 kilobase and compare it with 3 kilobase random genomic locations and let's see what we see when we look at input the blue curve is genome white the red curve is promoter we have exactly the same peaks and what it is nice right so input in promoter look exactly the same as input anywhere else so we don't have much biases when we look at the 3KB resolution when I look at h3k4me3 that's interesting I have genome white which is blue which has almost zero signal and then it winds down and then in promoters I have a strong signal and almost looking at this data I came by eye so I can say that do I need a peak color here if I just have one data set probably not I can just plot this I say everything above like 10 is my signal all promoters which contain which have signal above 10 are marked and everything below this is not marked it's very transparent and easy now if I look it's not always that easy if I look at k9me3 mark in promoter for this mark I don't expect this is not mark which marks genes I don't expect a strong signal there so and that's what I see my signal in promoter red is anti-enriched genome white and I expect those it's heterochromatic mark it's repressive mark I expect those around repeats and I see genome white the signal which is blue stronger than that promoters it's anti-enriched now it's interesting mark it's 3K20 polycomb again it's anti-enriched genome white blue curve it's stronger peak the red peak is almost at zero however I have this long tail so there are some promoters some genes which have strong hcK27me3 signal and those are repressed or bivalent genes and here is the hit mark which shows correlation between again just calculating signal in the promoter no sophisticated peak calling no nothing hcK4me3 versus hcK27me3 mostly they are repulsive so I have promoter with this mark or I have promoter with this mark so very strong k27 very very weak k4me3 and vice versa and we have those dots which are about 1000 bivalent promoters which as explained let me you have all this in your handouts let me skip so here is interesting another orthogonal way to look at the data where instead of calculating signal in every promoter individually we look at the profile around tss so this is profiles and this is a typical picture 7 years ago one can publish those now it's almost like a qc plot so you see different profiles oh what is interesting the black is input at tss you remember what I said we have a peak in input at tss and this is just because of nuclear zone missing there and we have easy syndication so I am more or less done I am showing you also the clustering which we can do with just the signal itself hcK27me3 this is Ramsey roadmap NIH roadmap data and what is interesting that with this mark we can really cluster cells very close to the cell of origin or to the germ of origin which actually makes us think that maybe it is through what Martin in his introduction that epigenomes is determined of a cell cell so that is good I am almost so just mentioning the tool which we are developing which is finder and maybe we can I don't know and can I eat 5 minutes from the lab from the tutorial instead maybe everybody wants a break now I will be only couple of slides before we start next it will be also good so it's again you have to deal listen to me I continue talking with the same slide so basically just come back so what I was saying here the very simple exercise quite a few I think more than 60 different primary cells from NIH roadmap we look at the profile which we saw multiple times already we took TSS of coding genes very crude actually we just took major transcript of the gene TSS calculated signals there uses a vector cluster unsupervised clustering hierarchical clustering you probably familiar all of you that's a way of looking at the relationship between different data sets and it's clusters really nice you can see all kind of H1 embryonic plus IPS and H1 derived cells there is a right cluster here very separated from anything else there is a blood cluster there is a brain cluster there is some liver, kidney tissues etc so I think it's encouraging then like a peak color so I just few words I guess it's bad and maybe also good I talk about something which I understand thinking about for couple of years it's a finder we had a tool published some years ago which called find peaks which was developed in a very early it was kind of from necessity we had to analyze gypsic data and this tool was based on the principle of whole genome shotgun sequencing using statistics of whole genome shotgun sequencing however these statistics broke when we look at the modern modern gypsic data and we started developing something else with different statistical model so this tool is simultaneously analyzing all, so it's written in java I start from maybe from the end so it's pretty fast it simultaneously analyzes several marks so if you have six marks you can just run all of them what it does we talk a lot about mobility but in real life how to detect mobility that's a big question because we can't really for every experiment for every read length we can't really run this synthetic data set and determine mobility so in this particular case finder infirm mobility from the data itself and it uses a very simple trick for example we have our pair end data aligned on the genome we have six different marks plus DNA input so we have seven different data sets then we bin our genome let's say 75 base pair bins and we look at the middle of fragment falling into this bin and then we look at those bins which have zero count in all seven assays those we call unmappable even if they are mappable they are not useless for us they are not useful for us at all they are completely useless there is no data there in either of those seven data sets we are analyzing so that's how we call it not mappable now input is pretty uniform input of course is not whole genome shotgun is not as deep is not like 30x but it is like 4-5x but sometimes it's pretty deep so it's input for sonicated or in particular digested input is of course an interesting creature because if we start, if it's deep enough and we start calling peaks on input we will see single nucleosomes so it's actually like this is one of the things which actually broke our old fine peak tools and the quality of the data become too good yeah, that's a problem so instead of having input just reads which are randomly and nicely distributed we actually in input we see something like that we perfectly see nucleosome positioning and if we have MNAs digested data and we do Fourier transform for example we analyze the frequencies of periodicity in this data we perfectly get a single nucleosome periodicity there so let's just be aware about that so now finders in first unmappable reads but it can also we can input if somebody comes to me and says well this is my file in the bed format chromosome start and which give me unmappable genomic location take it and read it then it does this alignment artifact it calculates alignment artifact the way I showed you the pictures now the important things probably that it takes into account GC bias so it uses DNA input and like maybe using whiteboard I will explain some of the things about GC bias if I look at 75 base pair beams genome white I can calculate GC content in every bit I can just count nucleotides how many A's how many C's how many G's how many T's sum C plus G divided by 75 and I have my GC content if I plot it and here is 100 100% and here is 0 then I will have a peak looking like that so it's around 40% will be like that sorry yeah it's we don't want to we want to be human so it's about 40% maximum is GC so 40 and essentially goes from like 20 to 70 and nothing that's the distribution now if we take our sequence treat and look at the distribution distribution is not going to look like that we have a bias so if we have our IP this may be expected Martin already alluded this a little bit if we have H3K4M3 mark this mark happens predominantly in the promoters of the genes we know that those locations are GC rich so our treats will be skewed so we will have peak looking like this skewed on the right but if we have an input we don't expect this we want to see something which resembles our genome resembles this curve however sometimes input looks even like that so we have shifts on the left however input looks like that so we have different type of biases and what we are doing we are correcting on this we are trying to transform input or we are using input transform our data back into the genome and we believe that GC bias is kind of surrogate for many different biases different steps we get the bias which is converted into bias and DNA at the end of the day what we have we just have sequence that's it and this sequence is skewed instead of being as expected from randomly at genome we would expect uniformly so an input we also would like to expect uniformly bias is what I just said this nucleosome and of course with nucleosome we know that M&A is preferentially cut at AT rich basis but nevertheless we would like to expect something that is relatively uniform like this expectation so if we normalize our frequency of our process in every bean to the DNA to the DNA input we mitigate at least partially this GC bias that's what we are doing so here is an example it's again it's like I'm like a salesperson but I'm showing you how it works so these are a secret link to cell massaculation and this results of the finder it's actually a little bit more fine great details than marks it gives a little bit more resolution now also compared to marks we treat every mark there is no such thing as punctated or extended with just one approach again coming from ID or conjecture that the biological process behind histone modification setting is similar so the writing of histone modification is similar for different parts just recruitment to different locations happens so this is like extended region so that's available from this website and you have it in your slides if you want to have a look so now marks 2 marks 2 is the most successful tool in the business and one of the reason for success well it's one of the earliest tools it highly cited in the literature and it's also it's working it's rather easy to install and it will run it is supported also one has to be fair that if the data is of a good quality and rather simple so it works it just works well if the data is of a good quality as simple problem comes when the data is more complicated marks explicitly assumes Poisson model for the background and here we just have luck that it works because background is not Poisson at all for chip seek and we and it's maybe Poisson in the ideal world but with this many many biases which I mentioned which are of stochastic there are always shifts there are not statistical properties there's something which repeats itself it gives us some kind of shifts in variables however there are so many of them and they work in different directions so that the end of the day our data is look like it's some stochastic modification happened to it and we proven multiple times that when we try to take DNA input for example assume that it's Poisson, fit Poisson and the chi-square for that fit is like just very very bad for 100 data sets it will be one which is perfectly fit by Poisson so what people were doing people who are trying to beat max what was the approaches the following, so max doesn't work then let's use negative let's use negative binomial for the background negative binomial is another distribution which is almost like Poisson but it has two parameters Poisson has just one parameter, mean mean of the distribution just one, one value so if I would like to describe data with Poisson just I need one, one value negative binomial has two of course when we have two parameters we have more freedom to describe data to fit the data then negative binomial failed people introduce more parameters so that's maybe why we have more than 50 tools published but they don't last that much because they were tuned, the heuristic so all of this is heuristic including max, heuristic was not really sustainable for the next data set so max assumes Poisson so how max derives the single Poisson parameter lambda in several modes so first max bins the data and then it evaluates evaluates this parameter lambda, so here I'm talking about this defines lambda using the background, using input which is subsampled and using 1k around the location 5k and 10k and takes the maximum so that's kind of conservative conservative way what tells us that statistics maybe is not really exactly there is in order to get, so maybe let me step, so we run peak color, we have our peaks let's say for htk4m3 we have 50,000 peaks it's a little bit too many we have a p-value assigned to htk4m3 htk4m3 typically is the promoter of the gene we have our p-values threshold which is 1%, which is totally good value, if you talk to statisticians, this would be a recommendation it's already multiple testing corrected then we say maybe I don't like it I reduce it 10 to the minus 3 and that's how it works with max what it suggests to me it suggests to me that as we should really reduce my p-value that strongly in order to get reasonable amount of enrichment, it means my statistical model was not that precise but it's still good quantity because we rank our locations according to this p-value and it is, we can choose ourself what threshold to use and that's what everybody is doing if we're not subscribed what we do, that's okay well, I skip this Martin already said that very important to take care about the data because it's actually from experience very, very hard to save the data these computational tricks we try, but we never we never and that's a segue to our tutorial we saw it, if they ask me and they ask you anything you don't know just say it's due to epigenetics people will be happy so any, maybe questions before we start to this much more tedious and much more uh, yeah kind of yes yes wait a sec wait a sec so I didn't mention that you can't compare to we are doing this that's a goal of consortium that's a goal of many, many publications what you cannot do you cannot compare, for example, IP directly to input you cannot say let me look at the promoter of the gene sp1 count number of reads in my IP then count number of reads in the input compare and say whether this is enrichment or not I cannot do one I analyzed yes and this was somewhere there are multiple ways to do this you can have your IP set 1 run through marks have a list of individual regions set of peaks then have another set of peaks for set B and then do overlap like with bed tools I have start and for one start went for another one and see how many are concordant how many are unique but then you have to really think what are you doing from those are unique maybe you would like to look at the signal there and you would like to see those which are unique maybe they have weakest signal so it means it's kind of at the edges so I'm not capturing so the easy visualization of this would be the following one H3K4 and a 3 peak in one data set maybe something like that in another data set here I call a peak it's like that in another I call a peak it's like that if I overlap those they give me yes they are concordant if I look at the number of bases I see well almost 30% or 40% are different but then if I actually look at the signal I see signal is very weak here so it just because somewhere my tool has to threshold my data it's a binary so it continues signal which I'm binarizing and I'm introducing those type of problems yes it's always question enhance specific right so you are ranking like using for example p-value in max or in finder or in any other tool you will have a score that's ideal scenario you'll have some kind of score and then you can have for every peak you can have you can look at concordance between those scores right and if you see that majority of the peak concordant then of course you can question those peaks which have say rank 10 in one data set and run one million realistic 30,000 in another data set they're very very discordant so you can even do kind of differential peak calling using this type of properties right if my rank is very different people are doing this or you can even set your threshold using concordance if your replicates you have two replicates that's the way to set a threshold what is a true threshold max give you some p-value but it's almost arbitrary score 10 to the minus 3 or 10 to the minus 4 10 to the minus 5 some people using threshold which is unrealistic so it's up to your problem if all your peaks are discordant then something with your data set if majority of the peak are concordant and some are different that's your differential peaks maybe let's start tutorial and we can cover some