 So now, good afternoon everybody again and so we have quite a few material to cover so I'll just jump into the module 2 which will be Chipsick Alignment Identifying Enrichments. We will talk about ideas, how to identify enrichments and we talk a little bit about visualization and it will be strongly connected to the tutorial session which we'll be following. So certain slides which you will have in your package, I will just skip them because we will either talk about them in tutorial or it's like your reference material because there are too many to cover in this hour and a half but I still decided to keep them for you to have them. So I hope you all can hear me and see me. So that's the learning objective of this module is to basically understand how alignment works for the short sequence files which we obtained in our experiment in Chipsick experiment. So many parts of this module will be actually common if we talk about whole genome sequencing or all other next generation sequencing experiment but there are some specific features for Chipsick which we will cover. We'll talk about files, alignment files, the standards and then discuss computational tools, how to analyze enrichment and what is enrichment actually. That's actually a key question. What is enrichment? So the outline here, I'm not going to read it for you, just let's jump into the next slide and the next slide is basically a graphical summary and I stolen it from the URL there and a little bit annotated it and the upper part was already discussed by Martin. So how we arrived to this stage when we have those short reads and what will be the next logical step? The next logical step would be to find what part of the genome these reads come and also whether we have recurrence. If we have just one read, we have many, many reads, we sequence but then in the location A we just have one read. It's probably not interesting for us but when we see locations where we have a stack of many, many reads then it's interesting. Then it means that in many cells we sequence the same molecule coming from different cells and it means that something is specific for this particular region of genome and we would like to know. So the first step is alignment and alignment, this is again a graphic graphical presentation of alignment process and the alignment process will contain several components. What do we need to have in order to know where our reads are? First of all we need genomic sequence and this is the reference genome and we will talk about reference genome and look into the reference genome when we do our tutorial but that's basically a string of AC, G and T. Then we will have FASTQ file and you are already familiar with these files and formats. We just worked a lot with this and then we have a black box which is short read alignment and the only thing which in our disposal there is parameters. What parameters we can control when we do alignment and we talk a little bit about that and then we will have alignment file and this is very important because we have standard for this alignment files and currently whoever writes new aligner or whoever works with with alignment files for next generation sequencing using this format that's rather convenient. This format is evolving a little bit but it is it is still it has enough flexibility to absorb new features if people want to come up with new features but keep standard and then there is analysis of course. I will mention something to you which we are not going to discuss in this course. It's not very relevant for chip-seq or epigenetics analysis although it's possible so what is alternative to alignment and it's it's written it's written as a header of this this slide. We can do assembly so instead of having reference genome we have a bunch of reads in our fast queue file and we can try to stitch them together without using any reference to use Dinova assembly. So if we have a good chunk of DNA which we pulled out in chip-seq it also makes sense. We can assemble a small cantik which will be covering region for example H3K27ME3 which is extended region. We can technically speaking we can also do it but this is not a common practice however I would like to mention to you this is a very very active area of research there are many assemblers and people people do this as well. So now reference genome it's in a faster it's a faster file what is the main feature of a faster file that we can have name for every cantik so it's organized as a collection of cantiks. We have name and then we have sequence and name is important because the whole coordinate system will be referencing to the name and the name for us is chromosome name chromosome 1 chromosome 2 chromosome x and then there is you will see in the modern references like the genome as NCBI provides us with a reference genome there is lots of unplaced chunks and they are real pain when we try to do downstream analysis because many tools don't recognize those names many tools breaks with those names and they are important because they contain genes and it's real sequences the only things that our modern assembly knowledge doesn't allow us to place them properly in chromosomes we know that they're only only this many there's only 24 chromosomes but we don't know where to put some what is when you look at the genome you mostly see ACGT letters but they're sometimes ends and we shouldn't take those ends literally because if we see the stretch of ants which is like 200 ants it was probably approximately from cytogenetics people see that maybe the size of this gap is about maybe of course not 200 bases but 2 megabase and then they put 2 megabase of ants but it's very approximate however those ends are very important because if we have more ants then all our coordinate system will be shifted and that's why when we use genome we have to pay attention what build we are using so if somebody gave us alignment to say AG 18 the previous version of of the human genome the previous build and then we try to visualize it at AG 19 or or place those streets and the G 19 we will get completely different coordinates so we have to be aware about this and then this I learned hard hard time that apparently even AG 19 genome genome which was originated from 2009 it which is much more very very common in the literature in all publications people mostly using this genome there's three locations in the three billion positions in the human genome and there's three locations where we don't have we have letters which are not a CGT so if you so I did just discovered it because I wrote a tool which only assumed that there's this four letters plus and and this tool broke so it was it was saying it was giving me an error and that's because there is this you park characters you know that there is code which for example codes for a or C C and G is redundant characters so this is like just example for you what are sizes of genomes we working with and of course mammalian genomes is several geeks typically human genome is a little bit longer than mouse genome and then we have the whole range plants have very long genome very large genome the Norwegian spruce is 20 gig and yeah we go down to very short genome up to several kilobase for viruses yeah I gave you I gave you an example where you can find those genome sizes so any questions so far so genome is quite important we should really pay attention what genome we are using so sometimes experiment was done for human people try to align to mouse and for sure many reads will align alignment rate will be low but it still will be not zero because we have a good syntany between two genomes but that's that happens so now we are coming to the second block you remember the first was on top was reference genome and then we coming to the second block which is a liner and the two major classes of short-read aligners short-read aligners is a special industry it's different from Sanger sequences aligners it's a very specific type of aligners and it's more or less like string matching because reads are relatively short we can find an exact match on the genome right it's like word finding you have a dictionary our reference and we find where in this dictionary our sequence happening and that's suggest how aligners are built that two major classes one is built on hash tables so what what what does it mean we take reference for example we take reference and we know that our reads will be longer but about like 50 bases then we shred our reference in say 30 bases we take every possible 30 mere in the human genome and then write location where it happens some suit emers happens in couple of locations some in many some in unique but then we have a table we have a sequence and location sequence and location sequence and location so what our line how our alignment will look what look like when we have a fast queue files which you discussed we discussed in with Martin how how we will find how we will do alignment it's a very simple process we just look up in this table we have a sequence right in the just look we have a huge table we look where the sequence we found the sequence and then in this table we have key and value and our value is location position so that we that's very simple let's of course computationally rather involved because we have to traverse huge table right because like we have four four possible letters in every position of the human genome then if we have dinucleotide we have 16 different dinucleotide if we have three nucleotide we have four to the power three and then if we have 30 we will have four to the power 30 which is pretty big dictionary to traverse okay so this is based on hash table and then people relatively recent it's around 2008 people found the transformation which compactify the search space our search space is much smaller but this transformation is completely reversible so although we compress our sequence space but we can come back to the real sequence space and cause burrows will it build this transformation and there are some aligners built on this paradigm and it's called suffix it uses suffix or prefix g's these aligners are faster they memory efficient but they're a little bit less sensitive so I put here it's rather involved but it is review or written by authors of aligners so they know what they're talking about and this is a good detailed review about all guts of net generation sequencing short with align so you can look it but for us for now we have to basically have this concept of having hushes of short sequences and look up for those that's how how aligners are looking how aligners are built of course there is some caveats to this we have mistakes when we sequence short reads we sometimes we have a wrong base and this would be sad if in our 75 base read read with one base when we have wrong wrong letter to draw to just say well let's drop it so that's why alignments short little line aligners typically allow for mismatch so we look at this 30 bases at the beginning of the read but we allow for mismatch to happen one or two if it's more than two you it's you for you to decide how many mismatches you allow so here is comparison of aligner on execution time because of course if you have especially quite a few data sets to analyze you will be considering how long will take to align align my my fast queue file and that's that's an example and from from this paper it's basically we we don't know exactly how many reads are aligned but that's not our point it's linear obviously our aligner will be linear in this in the size of our fast queue that's why what would be the if you have very very large data set how would you align it faster what would be your idea how to align large fast queue file you go for the fastest you go for the fastest aligner and it's still too slow what to do you don't know about chromosomes yet you just have your your reads but you you you your point you have a very good point you just cut your file into small files and do it in parallel so you can you can of course do it but here we have comparison for one particular fast queue and you can see that this red aligners which are based on the transformation they're faster than the aligners which are based on hash tables and they about order of magnitude faster which is very very important for us and in the old days we used Maca liner and of course we have to wait several days even this very very small data sets at this time typical chip seek experiment was 3 billion reads which is 10 or 100 times less than current chip seek experiments but we still had to wait long time so now we using yes no they all they all base they written by different people they based on slightly different they so globally if I come back they two classes based on hash tables or basic green will be always based on hash tables and based on suffix or prefixes but all those aligners they heuristic they didn't give you optimal solution they gave you always approximate solution for this particular set of reads they didn't find you absolutely optimal so this heuristics is individual for every aligner and then of course implementation they use different languages different computing computer languages etc of course most of good aligners you see which is the fastest and so that was a speed and as a speed is pretty good for this BWA that's a liner we will be using later and that's a liner we use a genome sciences center in Vancouver and in the production and this study well so now what is the second of course speed is not everything what else what else we will be interested how accurate our liner is and how's how sensitive is our liners so so one thing is how many reads from our fast queue is placed somewhere on genome that's first question but then also the second question out of those which are placed how many are placed correctly and this is in principle for aligners it's a very clear cut question it's a very definitive question to ask and we know how to how to answer this question it's pretty easy task to do it's laborious but pretty easy how would you do it how would you answer this question if somebody gave you an aligner exactly so you take a genome you shred your genome into those reads and you can just choose million arbitrary locations or you just meticulously shred every single position and then feed this fast queue into your liners and to see how many are aligned correctly and how many are not aligned correctly and that's what was done and here you have a table and you can see that some aligners are for example let's look at the let's look at the mark you know this was relatively slow so the green was slow aligners right fast was faster liners so it has very high percentage of read aligned but properly aligned only half of them of course we wouldn't be happy with this type of aligner because we would do many many discoveries with not properly aligned reads especially if there is some systematic bias in this aligner it's erroneously aligned in the same and the same spot then we will have concordance between different experiment etc and it all will be spurious so we don't want it so again the our choice would be bwa oops bwa which has relatively high alignment rate alignability rate and relatively high sensitive specificity so it's properly quite a few of them are properly aligned and it's fast you can yeah yeah it's it's it's very similar principle and it just yeah just your choice but I do nothing wrong with that yes yes yes what like so if it's less than what percent I can say that if I understand your question correctly you're asking like is 80% is good or good or not well there is no universal answer on this question and I think maybe Martin mentioned I don't remember him mentioning but in the first chip seek experiment the alignment rate was ridiculously low it was like 10% 15% and was like pioneering experiments but we had to accept the fact that the sequencing platform was not good so we're losing a lot of reads but there's still some useful reads so nowadays of course if you are somewhere below 80% that's let's take your number you should worry a little bit you should look at those reads which are not aligned of course whether it's good or not you will see only after analyzing your data I mean first of all there is an absolute number you know the length of your genome if you work with the elegance you need less reads if you work with human you need more reads and then you roughly can calculate your coverage how we would calculate the coverage mean coverage from if you have 80% so we have a fast queue file with 20 million reads let's say single end reads and we know that the read length is 75 and we know the length of the genome we can calculate the mean coverage of the genome with our experiment right it will be the read length times number of reads divided by the genome length this will be mean coverage of course that's not what happens in cheap sick because our reads to have tendency to cluster in certain locations for whole genome shotguns this will be right estimate because it's relatively uniform or from DNA input it will be right estimate because it's relatively uniform so then you can judge but of course it's a very good question what are those reads which are not aligned and that's where QC of your experiment should be done right if you are really concerned you should check are they coming from contamination from other species maybe they align to some maybe this bacterial genome or maybe you your genome if you work with mouse maybe your mouse strain is different from the black six which is reference mount genome and you have too many snips which not along quite a few reads to be aligned so maybe you choose your you change your alignment parameters a little bit to extend your in improve your line ability but that is given in a sense at the end of the day what you got you got it quality of your experiment but just to ask you I mean I mentioned that right so if it's a recent histone mod IP you should be over 90% if you blow that means you've got some problems okay so for example like if we when we get data from like you know Quebec the data we get get is aligned toward a future yeah suppose the our lab is using like your version of the genome say I'm done and you know if we have the data then definitely there was some rates which were missed out like not hundred percent aligned which you know Quebec give us is it some way to drag the leaves from vampire which were not aligned yeah that's in the cigar story yeah yes so not aligned reads and that's a good question also now after alignment something to mention for you after alignment that different flavors of different decisions were taken by people who wrote those aligners for example some people some aligners will not write in the alignment files those which which will not aligned dump somewhere else or just forget about them some aligners will keep all reads and it's a big difference right for us because multiple times we would like to come back and back to our original fast queue files and if our aligners doesn't keep all reads it means we have to double the space needed to to store our data normal bump file or some files alignment file I a little bit ahead of me I didn't introduce some files for you but the alignment file if it contains all reads it should be recoverable we should come back and recover all reads fast you which maybe I don't like this a liner anymore I would like to use another liner so but we will talk about this more so now when I talk about choices of parameters for aligners that to keep are there many part if you if you for any aligner if you run it without no parameters that you give a long list what you can choose the same as BWA and we will do it you know practical but the two key parameters one is seed lengths obviously when we have a read length 75 bases we don't want to require that the whole 75 bases are aligned because this will be a waste of time the longer it's the concept of hash table that the longer our reads look up read the huge that the larger will be will be so probably the this number is not doesn't exist in universe if you have 75 base period it's four to the power 75 this number doesn't exist we will not be able to write this number on the whiteboard on anything this is a very big number but but so we choose smaller number and typical for WA a line it is 32 bases of course if you read the shorter than 32 bases then you see the shorter but basically that's the part of the read typically at the beginning of the read to be used for a liner and the question to you why we choose at the beginning of the read why is the seed is typically at the beginning of the read or always at the beginning of the read exactly that's where the quality is still good and at then it deteriorates and we would like to so what happens to the rest there is different heuristics and typically it uses optimal Smith's water mal algorithm to place the rest of the read so now if we have human genome and we have random 20 mere sometimes we can random 20 mere can be if we use blood like the historical aligner we can place it in a human genome so if we above the 20 base pairs it's our alignment it should be pretty good so our seed length should be more than 20 equal or more than 20 so what is the second parameter the second parameter is how many mismatches and this again it depends on your sequencing quality if your sequencing quality high you probably shouldn't allow many mismatches because then you don't know what you're doing you can misplace certain reads or you can place reads which don't belong there and you will have what will happen with when we allow more and more mismatches we will have more reads aligned a multi mapped we will have reads aligned to position a and position b and position c because we allow many mismatches so we would like to keep this number also in some kind of control in typical numbers to those are default so that's a reference to the aligner which we are going to use and now I will I will go through this quickly because we will kind of come back to those questions when we will be doing tutorial so when we aligned in our file we didn't introduce file yet but what we what kind of classes what classes of reads we expect first uniquely aligned those are good and they should be more than 90% of those should be of fast you read should be uniquely aligned then it's unaligned reads and this can be this can be anything the reason for this can be anything and that's for individual experiment and in some experiment we can expect we can expect some some fraction of unaligned reads then for parent experiment we have properly paired reads and not properly pair reads so properly pair reads are those reads which are first of all within the expected fragment lengths and Martin already introduced to you this fragment length distribution and those distribution can be seen not exactly the way they reconstructed from alignment but we typically can run some traces during the library construction step and to see what is a fragment length or you get an idea what is your average fragment length what is your maximum fragment length so if you so those pairs which have lengths maybe one million probably they're not properly aligned unless you have something in your in your in your in your in your data some structural variations for example so here I show you example of fragment length distribution and the first one is from syndicated data and the second one is from native chip M&A's digested data which we already saw in Martin's presentation and you can see in both of them the blue one is both for K-27 M3 mark histone modification and the blue one is corresponding input and although in syndicated data we don't see this shorter shorter fragments but there is some trend that in chip seek we have longer fragment lengths and typically what we have to remember that fragment lengths will have a long tail so the majority is the mean of course is good it tells us something is about 200 bases but then we will have a long tail so not properly pair what else can be with not when the reads are not properly paired what else can be so we have pair of read so for example read 1 align on chromosome 1 and read 2 align on chromosome y that's probably some mistake unless again we looked we collect all those and look at structural variation maybe we have a translocation and that's how many tools which are looking at structural variations are working and of course if genome has some structural variations we will see those type of problem in our chip seek data as well so no that's essentially what what I just said it could be deletion inversion and this or insertion and this will all will be resulted in non-properly paired reads and last class of non-properly pair reads when we have pair and experiment when one read is aligned and the second is not aligned again can be many reasons for those but you will have this class typically for chip seek we ignore those but if they are very very many of those you can consider keep them and think how to extend those reads and how to keep them so now we also have when we align some of the reads will be duplicated and we already learned about PCR duplicates this was mostly libraries were dominated by PCRs PCR duplicates in in the in the past new Illumina platforms there is also another source of duplication is optical duplicate duplication so basically now the adapters are seated they have specific location on the flow cell and when they're too close or or we kind of miss reading we can we miss reading one we confusing two reads and then we will have optical duplicate again if you have many many many duplicates in your library that's a red flag something to check if the duplication rate is less than 10% less than 5% it's okay so duplicated reads you basically lose your reads you have less reads because what we are doing is duplicate it is we collapse them if and when we say duplicated can be also multiplicated right it can be 10 reads coming from the same exactly same location so now what are duplicated treats what will be characteristic of duplicated read read or fragment after a line first let's talk about read what is duplicated if two reads are duplicated how we will see in our alignment they have the same coordinate yes so reads will have the same coordinate so it means that they're duplicated answer is stuck on top of each other now what is our expectation can we expect can they happen can that happen in principle yes so if I go to the next this slide can we expect duplicated treats and yes if we have single-hand experiment and we'll have 50 base per read and we have a stack higher than 50 bases 50 x 50 x 50 reads on top of each other then of course at least one read will be duplicated there is no room for the read to start right so we only have if our read length is 50 bases we only have 50 position to start so if we have coverage higher than 50 in principle we can have a duplicated duplicated read but on the other hand now with pair and experiment because we have a flexibility between two ends we have a start of one end as one read at the start of another read and the chances to have duplicated fragment are nearly zero in Chipsick experiment so the current practice that we collapse those duplicated fragment or multiplicated fragment and just don't consider that why can't we use duplicated treat as a increasing coverage because it gives you the same information right so you can but it's it's a bit dangerous because it gives you the same information but like first of all it it was just one molecule like fit biologically it was one one cell gave you gave you this piece of DNA but then you have duplicate and it say multiplicate 50 times so imagine it was just a just background this molecule was noise it was not properly specifically IP then it was PCR amplified we have 50 copies of those and then you'll have a nice tower fantastically looking power like enrichment but it's completely false because you don't have enrichment say this is first and then for example if we looking at whole genome shotgun and we're calling we are calling sneeps this will be and we have one mistake in this read then we will have wrong sneeps in game because we'll have enough coverage and we will have come as I guess yeah this is general common it's a removed from Chipsick data for that reason right you're trying to estimate enrichment over background so so they should be removed I think Misha I think you know folks just give them what your advice is and I mean there's lots of caveats to all of this but I think it's important for beginners to recognize what the current which is removal of remove duplicate yes but it's okay we have we have all these questions yes that's right yes exact same coordinates yeah yes and we sequence there are some there are some so they read sequence can be because it's it's PCR right it's prior to sequencing we have secret and have sequencing error so the sequence itself can be slightly deep but there is very strict requirement that sequence similarity also okay and then the last class so after duplicate it we can have reads which aligned in multiple locations so basically aligners struggles to assign unique location so it finds several locations and those reads are coming from regions of genome which are repeated for example so if read comes it obviously obviously it comes from one spot but when we try to place it we find all perfect much here perfect much 5 kb away etc so in this case BWA is arbitrary not randomly which is which is not good arbitrary assigns assigns coordinate and again as a recommendation in most Chipsick experiments we just ignore those we don't consider those because we don't know where to place them there's some there's some exception to this and if you don't want to talk about this we can talk later after how to deal with multiplication for example if you actually your experiment dealing with repeats then there's some some approaches so this multi-map treats they typically we talk about my ability so what happens that actually not the entire genome not entire mammalian genome is accessible for us so when we have 75 base pair long reads only about from 90 to 80 or 80 to 90 percent depends on chromosome of genome is uniquely mappable and the rest unfortunately we we don't know what to do so here is cartoon why it happens so you can see those green reads that so the orange is repeats on the genome so green reads are perfectly mapped no problem but then when we have reads coming from the from the red red reads coming from the repeat then when we try to align they align here and they also align here we don't know how to place it and when we have parent information it helps because what can happen in parent experiments at one read aligned uniquely and one read aligned in the aligned in the multiple multiple locations and then we can nail nail the second read using the parent lengths where actually it comes that's how how alignment works yes it will it will increase your line ability for sure because like what is what is the major main danger in human genome or mammalian genome is aloo repeats right about 300 bases so if you have two reads in pair inside the repeats that's it I get about it now that's 300 bases a typical like 200 is typical fragment size so now the longer my understanding and Martin can comment more that you have limitation in your library construction with Illumina you can't really go beyond like 500 600 bases so it's very long you it doesn't breach amplify so right that's correct but now we are talking about read one with two two pairs so we matched first 32 or 20 depends on what in one location we matched in another location and it happens that second location was in repeat so one read uniquely mapped the second read is mapped in multiple multiple places and we see that one place is actually 500 bases away or 300 bases away from our uniquely model made and others are different chromosomes or different locations and then alignment is smart enough to decide that the right location for the second read will be closest that's how you're probably gonna get better understanding as you go through the module all alignments are done single-ended so the pairing is done after the fact BWA the seed blank is 32 by default two mismatches well the Misha's showing you that there are other options such as a smaller read length the problem is is that once you get past 100 base pairs what's not mappable is kb in length so going 200 or 300 or 400 fiber isn't going to help resolve those guys it asked you know you get to a plateau and the incremental increase in mappability is very small is get longer and longer and longer and it is true on the Illumina platform you're limited by 600 in terms of your cluster that's just the nature of how the clusters are formed and also the read length because of the phasing but you're going to go through the mapping yes in the practical so maybe this is just are we good every say everybody's okay so that just an example for you that a line ability or this map ability is different for different chromosomes and it's kind of expected right we know that for example chromosome X and Y they have lots of symphony between them so if there is coming from there as a week you can see here this is first first two bars the map ability is about 80% but other chromosomes and this is 75 base pair reads and I've done here it's just simulation exactly which we discussed I just shredded genomic 200 bases take 75 bases on both ends and just use BWA to align and see how many I will recover and that's so like about 10 to 20% of the genome will not you will not prop and we can easily visualize this and for example you see see genome browser which we talk later about it if you're not familiar but if you're familiar just search for mappability tracks they are pre-calculated mappability tracks for different read lengths this is single end and they look like that they show the number from 0 to 1 and when you see 1 it means this position is 100% mappable when you see 0 there is no every read which will overlap this position will not be aligned uniquely and typically look at this screenshot so there is a repeat and in this case it's IRF some LTR this black black bar and all read lengths 50 70 and 75 and 100 are not uniquely aligned there so that's now maybe jumping a little bit but just to give you some interesting interesting comment here where it will matter where this mappability mappability will matter when we have integrated analysis of several data sets and our data sets have different read lengths so one data and it happens to us every we want to we have a new new experiment hundred base pairs with a new aluminum platform and for all the experiment is 50 bases and this is our wild type and this is our new and we would like to compare and one is 50 bases and as these hundreds hundred base pair along and so it will be a little bit dangerous because mappability for 50 and 100 is different you can see even from this screenshot for example this location in 50 we have a big gap but in 100 we have some coverage so in hundred base pair experiment we will recover this area we will see coverage in 50 not and then when we compare to experiment we will see ah there is differential differential enrichment there which will be completely due to mappability which will be wrong so that's we have to remember yes you had a question oh no okay I thought all right so let's move on because so I would like to skip this and this is exactly those green header that's we will talk a little bit in there because I'm a little bit afraid that sure we will not have enough time so this is basically about some file and some format so some file is standard alignment mapping format it has 10 sorry not 10 it has 11 obligatory fields and then some optional optional fields and basically question to you what this alignment file should contain what is absolutely necessary to know when we have an alignment coordinate what else yeah that's coordinate quality quality of alignment mismatches whether there was mismatches not mismatches what part of the read was aligned for example sometimes some alignment they say sorry I can't align this first 10 bases I soft clipped them I kind of mask them and start from base 10 so this type of information is contained and we will go through in in the practice I will I will skip skip this maybe comment a little bit here so now we have a single single and experiment and there's still quite a lot of data which was which was still useful and in public repositories which you would like to use in a single and so now when we have a cheap seek so let's see user and let's talk now about prescription factor DNA binding experience and say this is DNA and this is a binding site transcription factor binding site so now our fragments will be covering this sport if our antibody specific we pulled this type of fragments but we sequence this fragments from one end from single end so this will be our read here this will be our read here positive strand negative strand positive strand positive strand so now if we only take the read and profile like drops and on on the genome what will what's picture you will have we will have something here some tower here from this reads and one read here this is not what we want right so intuitively we would like to profile the entire fragment because if we profile those four fragments what will be our picture our picture look completely different look something like this with the maximum seating in the actual binding site so now but we only have single single end so the typical practice is to extend this read with the average fragment length and that's what I'm explaining here so we just take the starting coordinate we know that our average from the library construction we know that our average fragment lengths is say 200 bases and we extend it and the caveat here that we have to remember about start and end so basically just remember that positive strand and negative strand are slightly different creatures so the for the negative strand coordinate given for the start of the read but we have to extend from the end of the read and this is something just maybe just try to remember okay cigar everything so something you can you can play with some questions I put some questions about cigar we talk about this so now after we have this some file alignment file we need some tools to analyze this on alignment file and some tools are very very popular maybe some of you are familiar maybe please raise your hand if you know if you heard about some tools who heard about some tools quite a few so that's great so some tools is very useful they're well tested the entire community is using them and we can really do a lot with having our alignment alignment file and some tools so it's indexing we can use use in for indexing we can view it and we can also grab statistics and do quite a few things now there is a brother or cousin of some file called bomb file which is actually binary binary file and this binary file helps us to save space without losing any information so we can convert using some tools we can can convert ASCII file into into bomb file so now we will do this practically during the tutorial session but we have after alignment we also have to mark duplicates as a separate task aligner by itself doesn't know and there are multiple tools to do this now we will be using Picard there is another one which I tried and recently and actually in the genome sciences and in the production it's moving they're moving to the some Bamba and I tried it and it's very good tool actually it's fast and the point is that we have to mark duplicates after we have after we have our bomb file in the separate step the separate steps of sorting we'll go through them now I have a question to you about yes yes yes so there is so generally speaking is it is a line there is a line of coordinates so chromosome one base 1000 and another chromosome one base white 1000 it will be considered I have to look up this certain criteria how many mismatches you allow when you consider those molecules are different but it's pretty loose so you need reads to be but then it's not aligned almost not aligned so rule of thumb if they're aligned to the same coordinates they they consider to be duplicate and this will be filtered not by quality but filtered by duplication flag so by running this mark duplicate we will set the flag which is positioned to fill two in the Sun file which will indicate to us that this read is duplicate and so it's it's flag 11 so we should we should use some we should use plug filtering and we will be doing this so now sorry just let me answer as a question about quality what we will do with quality we will do also quality filtering and by doing quality filtering we will remove those multi-mark tweets because the quality alignment quality will tell us how well alignment can find the place for this read on the genome and if it finds a place here in one location and it's nearly with the same accuracy can find in another location then the quality will be low of course this location can have mismatch then the quality if it's identical in both location with no mismatches then quality of alignment qualities that's what Martin says there is a read quality and then alignment quality aligning quality is zero aligning quality in BW as a number from zero to 82 something like that so it's zero and then if it's nearly nearly multi-marked so it's aligned in one position with no mismatches another position with mismatch then it will be quality one or two etc. so typical threshold is five when your quality read quality threshold is five then you then you basically resolve your locations pretty well just one second he had a question so once we are like putting some flags on the categories suppose then after that we are converting converting the file to a back file back format will the back format file will contain the duplicary grades or not? well of course back format doesn't doesn't have no standard quick answer is no instead of just Martin duplicates isn't it a good way to remove duplicates? if you ask me that's personal very personal view no because you're losing your information yeah but we are looking here for the enrichment we are not it's a personal preference you either remove it and forget about this or you remember at every analysis step put the flag just don't take collapse duplicate the treats so yeah that's what I was asking if we are putting flag that means the other software like if I'm using whatever software but you know it's I'm so sorry for under I think why would you convert alignment file into the bat file it's kind of typically bat files used for already some features right so it's not like it will be huge bat file okay so let's go another let's suppose I want to calculate the rate counts and map into transcripts or genes yes then if I'm using R or some other tool for that consider the flags typically yes then then of course you already filter you collapse your multiplicated or duplicate the treats and your bat number so bat file we didn't talk about bat format but I I'm going to talk so everybody or most of you know what bat format is bat file so it's basically chromosome start and and some number some float number giving you a score so it's a very simple so it's in this position of the genome we have this float number and it could be coverage of the Xon or it could be a signal of chip seek experiment in a TSS of the gene it could be anything so yes for those but that's not the read file right we don't we lose primary information about read so my question was a bit similar I guess it will it will but that's it will and except the fact that the mismatches and sequence quality right so the sequence itself so what you're suggesting is but also like how would you do it you have this file with 100 million reads so you kind of have to sort through this file and find those sequences which are identical it's quite a computational involved step so the duplicates are discovered by aligning yes so now where else we can use Picard or Samba Bamba now typical our typical one of the flavors good or good practice to do chip seek experiment to multiplex so we have several lanes we have the same reagent but we place certain pieces in different lanes then we avoid some biases right we avoid batch effects so in this case we have for one histone one antibody one histone mode we have several alignment files they coming from different lanes we run mark duplicates and then we would like to merge them so we use we use merge some files in Picard or there is a merge option in Samba Bamba so we need to merge them together and now the question to you will it be good if I mark duplicate in every individual individual lane then merge them are I'm am I ready to do my analysis or I do something else I have to do something just let's think about little bit about this what happened so I have this the same reagent loaded in different lanes I sequence them I have my fast queue files coming from from different different lanes of course I can cut those files together and align but as we talked before maybe from the speed speed up of my process I will align them in parallel I mark duplicates I merge them everything done correctly yes again of course because what can happen that different duplicated reads ended up in different lanes so just just the word of course so now visualization and the different in its person specific I like UCSC maybe because UCSC was around for quite many years and it's very customizable although sometimes it's low hopefully it's not going to be slow when we are doing our practical today we can use WashU there is WashU Epigenomic Browser which is very very it has lots of functionality but it's very involved it's kind of advanced usage or you can use IGV both UCSC and IGV allow you to look at the reads at the read level or at the level of coverage profiles and coverage profiles will look like this so here's an I put an example for you that's for H1 and this is exactly the set we are going to work with so it's H1-6 histone modifications for H1 human and moronic cells and here's input here there is two transcription factors tracks for the same H1 and here is RNA-seq so you basically that's very visual right we that's what we were discussing there that we would like to drop some sweets and to see where they collapse in the same location and it's stuck on top of each other they're not going to stack exactly but they will fortunately for places where they should they will create these little towers and we can look at now there is a notion that people call them peaks of course genuinely speaking when we talk about histone modifications some of them are extended regions not just peaks so I prefer to call them in which regions so that's a bad format and we already talk a little bit about bad format there is one thing to remember about the bad format most cases it doesn't play any role but in some cases it's important and I've been there many many times that it is zero based so when we have a bad file we have chromosome start and but the actual start is not the genomic start is one base less just it's not necessary I don't want you to understand this completely and nobody understands why authors chosen this because it's very non-intuitive so the end is exactly where where the region we are describing is but the start is one base less but that's something to remember and it can create some problems then the second popular format is Vigil Vigil file format and we will generate this format in practice a little bit more complicated but we look into it and I'm I'll just skip it and this one is one base so when we have position four hundred thousand six zero one it's exactly this position on genome it's not shifted by one and the last one is bad graph it's exactly like bad but it gives you some some number it has some extra number here so now we're coming to analysis so now we have our alignments we generated our files to be visualized so what kind of what kind of questions we can ask and I would like that also you guys actively participate in the in this part what questions we can ask and like I put I put some some possible scenarios so we have single epigenome analysis and it could be all these six popular histone modifications which Martin introduced to you and I just I just showed you the screenshot and we can ask the question find in which regions let's in a few slides later we will talk what are in which regions how to move because that's an interesting concept so it's kind of de novo right we have a data and we will data minds our data set find locations which are different from other locations this is one thing we don't have any biases prior analysis the second one we already know something we would like to look at the regions of interest and what could be example of region of interest could be gene body we only would like to see whether transcription fact or for example we only would like to see whether transcription factor a binds in the first intron of all coding genes in a human genome then we don't need to look at the entire gene entire entire genome we just look at this regions and comparing those introns which have signal comparing to those introns which don't it's a little bit easier task right a little bit guided task it could be also histone modification flanking some repeats etc. this is second scenario so when we are not doing de novo we already know where to look and tomorrow you will be hearing about DNA methylation and DNA methylation has its own chip-seq technology maybe you heard about MA deep or HMA deep hydroxymethylation or methylation so it's basically chip-seq with antibody sensitive to methylated CPGs and this chip-seq analysis differs from any other chip-seq analysis strongly because we know where to look there right so in any other chip-seq analysis we kind of de novo we would like to explore the entire genome in MA deep we know where CPGs are so that's a big difference there so now the search scenario what would be the search scenario of analysis when we have several conditions for the same cell type or we have several tissues and would like to compare some that's what came to my mind and maybe you have something else please raise your arm and raise your hand sorry and suggest something I don't know maybe later we can we can talk about this so now this is something bothers me and literature is kind of softly avoids to answer often softly avoids the question what is enriched what are enriched regions how we define it it's very intuitive but when we would like to formulate because when we do analysis it's at the end is computer science or mass right we have to we have to run certain tools and do some analytical analytical studies so we have to be very precise what we are doing and the best what I can come up is the following there are two criteria for enrichment for a region to be enriched the first that in our chip-seq experiment this region is different from other regions which would be not enriched sounds like I don't know but that's that's actually that's it so most of the genome probably is not enriched and some locations are different so now if we have histone modifications and there are some which actually most of the genome enriched so probably in this case experiment would be looking for anti-enriched or not enriched areas right because most of the genome is enriched this is the first criterion the second criteria is that we would be typically in modern experiments we always have controls in early days people run chip-seq experiments without controls but now we have a negative control which is again very often DNA input there is debate there are some other type of controls but DNA input is the easiest one and kind of follows the whole the whole way for the library construction and everything so except the IP step and when we have a negative control we would like that this is this enriched regions are not special in the negative control so they're special in IP but not special in the negative control and essentially these are only two criteria we can apply the rest is heuristic how we are doing this but this are two global criteria so we have statistically significant locations which are different from other locations and not different in in the in the input now about negative control so here is a reference for you people discussing that for example chip-seq experiment with histone so the antibody grown to the entire histone 3 is a good control and I agree but this is more difficult experiments so typically input is is our control now why input is important this is a good a good example for you so we already did those we generated those files those coverage files in the big format we visualize them in the browser and all of a sudden we see this picture of course we see six marks input control transcription factor experiments and we always have the same the same type of coverage we would be a little bit suspicious but imagine we don't have luxury to have all these six marks we only have one mark just imagine we have just one truck it's perfect perfect enrichment so apparently it's not and input control helps us right when we look at the black black truck which is input control we can see that there is a huge enrichment in the input control and it's violates our requirements that this location is not special in the input in the input control so we will blacklist or exclude those locations and that's how we use that's the first step where we use we use input and they they're not that that harm is those locations because they may be not that many typically why they why we have this this situation this phenomena it's due to alignment artifact to give you an example we have a tandem duplication in in the genome which is not captured in the reference genome and then we have much more reads in actual in actual library then we have sequence in the genome and those reads are coming to the same spot although they're technically coming from the same place in the actual DNA in the in the actual chromatin but they aligned in the same spot uniquely no problem and they're not not not not multi-plicated this is one example and so they basically they're technical how many you would expect it's about thousand five thousand genome white but what is dangerous about those spots that they will be the same the same the same for different experiments so if we don't do this filtering using input then we will discover concordance for those locations between different experiment different experiments and this is probably not what we would like to do so those regions are good to good to remove and this is another example so it was h1 this mouse embryonic stem cells another mark h3k9 me to different conditions input you see perfect huge tower will be huge nice enrichment but has to be removed so here I give you just the entire chromosome one example of entire chromosome one human genome and those red so this is a coverage on the y-axis it shows coverage on the x-axis position on the chromosome one and three different cell types so it's a colon mammary gland and toroid toroid cells and you can see that it's pretty consistent so there are some locations near centromeres which are always you give this huge towers so the coverage is much higher than than average coverage average coverage is perfect flat this is log scale so nevertheless it's it's perfect flat however as we see there are some locations which are cell type specific so for example here we see location which is should be blacklisted in colon but in other like this one in other cell types it is not present here we have location present in colon and mammary and almost not present in tarot etc so that's why it's one of the pro to use input for every cell type so its input is done parallel to your IP luckily we have typically for epigenome so classical epigenome now at least for roadmap was and for I had consortium is six different histone modifications we need only one input but it's it looks like that it's a little bit cell type specific or donor specifics yeah here it's another example so in principle in the old days encode published black listing list list of regions which has to be blacklisted both from human and mouse however it doesn't cover everything so the blue blue dots there is what was blacklisted but you can you can see that there are some red red things which are survived and then some locations in our cells we have perfectly perfectly normal coverage and in cortical blacklist because it was based on the shorty reads and single and reads a different technology we can resolve those alignment artifact better now so if you have any questions to this part please ask maybe because now we going to enrichment detection yes well but in this case if you would say I'm not interested in those locations which enriched in both then it will include those artificial locations as well so you only interested on so you would expect those artificial locations the same and therefore if you are looking at the difference yes it will cancel out however not not recommended first of all input is used in many tools for other purposes there's only one purpose we are coming to this second when you have this huge huge coverage this coverage is 10,000 in those spurious locations you have fluctuations and say if your cancer sample you have coverage 5000 and in your normal sample you have coverage 15,000 your differential analysis tool can detect the difference although this difference will be just due to fluctuations they both are super enriched it depends I think it's kind of possible to get away if you have the only thing that you will never be able to ask the absolute question say forget about my cancer sample I would like only enriched regions in the normal scale you would not be able to ask this question okay we have probably 20 minutes left so I I think I think we have enough time to talk about the something about enrichment maybe I skip skip this I just wanted to mention that in terms of peak calling tools they are very very many it's a whole people trying to use different tools and it's still you know view somewhat open open area although many tools are published tools published on different principles and like just to give you an idea they're different as such so some tools have explicit assumption about the background analytical form of the background because the major problem of enrichment calling given those principles our principles what are enrichment that we don't we don't know the background we don't know what to expect that why I think this type of experiments are even more complicated than any other kind of experiments like RMAC because we absolutely don't know and we are at the mercy of quality of antibody many many many things so what is the noise what is the background in our you know it depends whether your cells were frozen or fresh many many factors will affect it and we don't know analytical yes there are some typically people are comparing different tools typically every publication at least last publications last few years they have to compare to existing tools and they they're giving but it's a very difficult question yes I mean from the user perspective there are many many many requirements like sometimes you download the tool sometimes tool are written in a very exotic language you try to run it you don't have your system the system didn't install this environment for you you don't know how to run it many tools you download them they're not supported they were published few years ago they're not supported they're not they don't run and I I was there many times that I can't run it so then you can talk to us or some sort of group that tries to organize these tools and discard the ones that are redundant well I think we try like and not not discard I mean this is not well first of all people took a lot of effort to to write those tools and people sincerely found the best heuristics for the data at their hands at this very moment so in every publication you would see that my tool performs much better than ABC available before but this was done only on this particular data set on those set of parameters which were tuned for this particular data set this is unfortunately that's the way how we're doing we a little bit overfitting things yeah visualize the data run some tool visualize the data see how happy I if I will have time I hope I will have time I would like to introduce to you some approaches which are not using any cheap pick calling and can answer some questions without pick calling but still back to pick call so what what people are trying to do is the following is to try to use some analytical shape for the background and of course we are dealing with count data so Poisson or Poisson like distributions my work and sure enough for some libraries it's perfect so the input control will be fit by Poisson perfect and then you get another library you try to fit this Poisson whoops you see that it's not then you say well maybe it's negative binomial which is slight extension of Poisson Poisson has just one parameter me that's it negative binomial has two parameters and you have more degrees of freedom you can fit but then you have another library you can't fit this negative binomial so unfortunately cheap seek experiments have lots of biases which are not of statistical origin so probably there is no unique statistical analytical distribution for the background and there are that many heuristics there is supervised learning approach there are many many approaches and it's we are developing our own tool as well which is kind of parameter free which is not assumption free but it's it's it's a little bit it's a little bit of banging the ball experiment banging the ball experience for us at the moment but we kind of you move you move five inches forward and then four and a half backwards this is type of so but still half an inch yeah so now when we analyze our cheap seek cheap seek data we can see that there are different kinds of marks they they some some maybe you heard about this that's a narrow narrow marks and broad marks probably some of you heard about those and this is just appearance of the mark so one mark is mark appears and if I come back to this slide oops this slide so there are some marks which appears like a punctated peaks and there are some marks which appears as a broad covered regions and it's just the nature of the enzymes which are loading those setting those marks although even those peaks for histone modifications they're typically tens of nucleosomes it's not a single nucleosome it's very hard to imagine real enrichment happen on one nucleosome so there are multiple nucleosomes and here here I give you give you examples this one the red one is htk4m3 and it is about 3kb and the purple one is htk36m3 which is typically histone modification associated with elongation transcription elongation and it appears in gene bodies and it in this particular example it's 100kb region so this kind of the ology there and that's detail of the 3kb peak you can see that when you zoom in there's actually multiple peaks there's kind of a fork there and that's kind of addresses your question what tool to use and this is difficult difficult question and very often people just use what what one can run and looking at the data whether it's whether it makes sense so that's yes I agree I agree that's in my opinion it is to some extent open area and there is lots of if you really want to be very very careful with your data you can do many things without using tools and to do a lot of QC not QC at the read level but QC at your biology level that what you're doing is correct so and now what I would like to spend the remaining time is to introduce you to some approaches which is prior to enrichment detector what we can do we can actually learn a lot from the data without doing any peak calling and so we'll be single data example in the example for mouse embryonic cells and we will be looking at the region of interest regions of interest and our regions of interest will be promoters of coding genes and of course how to define promoters that's a question here we define it as a region TSS plus minus 1.5 kb and we only take the major transcripts we are not talking about alternative promoters we are talking about major transcripts and then what we were going to do is we compare the signal we calculate the coverage of the signal in those promoters to 3 kb regions genome white so it's okay right it's clear what we are going to do we take these profiles these big files we calculate the integrals the signal maybe next next slide explain it so we that's that's our that's our marks so here is our profiles and this is a gene in particular gene and his promoter in this particular case is plus minus 1.5 kb what we do we take the integral we sum we create the area under the curve for all those tracks and this will be our signal then we do the same for every 3 kb being an entire human genome so it's quite a few of those things right 3 billion bases divided by 3 kb it's it's 10 it's 1 million yes a little bit less than that because some of these unmarked and then we compare then we compare distributions and the very first one that's a plot when we take input DNA and we do this in promoters and genome white and what this plot shows us it's a histogram so every on in x on x-axis there is a mean coverage so we take the integral we take the area under the curve divided by the length 3 kb this is the mean coverage and on the y-axis that's a frequency so how many beings out of the total number of beans have this this mean signal and it's normalized it's normalized but because we have different number we have only about 20 thousand promoters and much more genomic beans and you can see that it's it's not bad so it tells us what what what this plot tells us right away so that's right that's right so in input DNA our promoters looks exactly as everywhere in the genome which is which is good which is already resolved actually in my in my view so so now we do the second second plot we do the same for 3k 4 me3 and this is a red red or orange curve while the blue curve is genomic beans so as you expect genomic beans for h3k 4 me3 is kind of falling so it's so it's background so we expect zero it's if our antibody would be absolutely perfectly specific then we will have zero reads everywhere except we have the modification and we know that this modification marks it's kind of linked to transcriptional activity it marks transcriptional active genes so only in the promoters of transcriptional active genes you'll find it however our antibody is not perfect and we have this falling blue curve and what is remarkable that for the some of the promoters red curve follows exactly the blue curves so it means that some of our genes are not active our promoters are not marked and the curve looks like genome white curve but then we have this nice nice peak so it means that the mean coverage is about 30 right here somewhere here in this in this peak and we can do a very simple experiment we visually we look at this plot we say so everything above 8 or 10 is enrichment everything below this is not enrichment and we right away get a list of genes which mark by this histone modification and the next step we can so of course it's I'm not recommending this as a analysis final analysis but if we want within one hour or two hours or few few hours one day to get the result which we could would like to look at and see whether data makes sense this is a good way to do or nothing wrong with this with this way to do this is the way to do analysis and this data is not particularly so it's a good data but it's not star so it's not that I selected something which is like really really good very very clean no this data is normal normal data so the next the next will be for another histone modification a CK9 ME3 and maybe some of you familiar with this modification it's a very well people who work with repetitive regions very well familiar with this mark because this is a repressive mark and it is linked to hetero hetero chromatin sorry and it represses retroviruses in mammalian genomes so what we can say so again the blue curve is genome white and the red curve is promoters we can probably say that promoters we're not interested if you're looking if only interested in promoters of the genes this mark is not that important because this mark is kind of anti-enriched so our peak is below the genome white average are you with me it's kind for almost obvious however you can say well this is genome white average so in genome we will have something some locations which are not marked at all and then there is some genome white average probably background and then there is a tail so there's some genome white locations and then there are some promoters which also have large signal and true there is a handful of gene maybe 30 genes 30 or 50 genes which are germline specific genes which have this modification their promoter but those are much less this modification is much less abundant in promoters and other modifications then the next mark again repressive mark htk27m3 we do exactly the same exactly the same plotting so what we see here on the first look it's anti-enriched so majority of promoters have very little signal of this because this is genome genome white however you can immediately see that there is a long tail so there are some promoters out of 20,000 maybe 10% which have this modification and again somewhere here we can put put a threshold and decide which which mark is which promoter is marked which is not and here's another look it's exactly the same data I plot a hit map of what so it's all every dot here is a promoter a single promoter and on y-axis is htk27m3 mark and on x-axis is htk4m3 mark what we can see here immediately that those marks are anti-correlating so everybody everybody sees this right so it's kind of that we have we either have the bright color means a lot of promoters so a lot of promoters have have zero htk4m3 and some signal and htk27m3 so along this line and a lot of promoters opposite have a strong k4m3 signal and very weak k27m3 signal so the most of them are mutually exclusive but there is a class of promoters here which have strong signal or kind of intermediate strength signal for both of the marks and maybe you heard about bivalent promoters and those are promoters which are comarked mechanism of this comarking is debated it can be cell heterogeneity or it can be true bivalency but at least in the experiment the way what we see in this particular experiments only we can count promoters which have both marks and just looking at this plot I just draw by hand effectively two lines and I said everything which is here is k27m3 no k4m3 everything here is k4m3 no k27m3 and everything in this middle quadrant is is bivalency and by doing this I can correlate my marks with expression I just done done this and then I would like to see what expression because I know what to expect right I know that those genes which are marked we don't is essentially which are repressed by k27m3 by polycom complex they're not expressed or lowly expressed and k4m3 alone should be highly expressed so if you look at the lower lower box plots so here is a ck4m3 alone and on the bottom is log 10 of RP chi and RP chi stands for read per kilobase per million that's an expression metric and you can see that for a ck4m3 alone we have much high expression that's a low scale right so it's from here to here there is like thousand times difference so k27 alone is very low expression not marked by both is almost like k27 alone and bivalent is in the middle and here applaud both embryonic stem cell H1 and breast myopetyl basal cells and you can see that the pattern is slightly different that for example not marked here and here is our different and then when we look at this is a distribution but if you look at the number of number of number of promoters in myopetyl we have much more promoters which mark by h2k27m3 exclusively while in H1 we mostly have bivalent promoters and it's known phenomena is considered that during differentiation those bivalent promoters are resolved in either repressed or active so they lose their bivalence and they resolved and just one second I finish this and then answer the question so and on the top is actually it's basically what I just said the normalized h2k27m3 signal versus expression I separate highly expressed genes and low expressed genes and why in embryonic stem cells it's almost the same so in high and low so the mark k27 is the same but in myopetyl those genes which are have low expression have high much stronger h2k27m3 signal they are repressed that's a good question of course I need to have some some name for my promoter and this is like g90 for example so I keep g90 for every promoter and I have g90 for every expression and then I just join those files yes yeah I have RNA-seq prediction RPKM for every gene labeled by gene ID and for every promoter I have label every promoter is labeled by gene ID and then just agnostically I just to create those signals and without without doing any sophisticated analysis we can learn something so here just an example that we can are totally look at our data so instead of calculating signal for every individual promoter we can look at the coverage around TSS and aglomerate those coverage as Jadon white so we just draw so make sure you draw it on the board so we have we have our Jadon and you have genes here here Jadon opposite strut this way and then for example we have h2k23m3 which will be marked this way marked this way and here we have marked this way patterns of all three genes you have the mark then I take the region plus minus 1.5 kb here 1.5 kb here 1.5 kb here and aglomerate them into one combined profile throws it in the same 3kb ruler all of them and do it Jadon white and then I compare I this is a good metric to see how mark looks like globally and this is a good way to compare different marks how mark behaves near transcription start side and maybe like 5-7 years ago this type of plot would be publicable now publishable now it's not obviously but it's very easy to to generate but it's not novel at all but few years ago it was yeah the whole so you can you can see that h2k4m3 has this very specific profile around TSS it has a deep it's it has a nucleosome free region around TSS and it has a deep here while like k36m3 which typically covers gene body only starts and you can also see relationship to input just give you give you an idea this all can be done can be done using those profiles week files and so just an example again of the row signal without any any peak peak calling we calculated htk27m3 row signal in the promoter and we have I think here is about 40 plus different cell types and we just did the hierarchical clustering you're familiar with unsupervised hierarchical clustering we just try to be generated vector for all for all genes in this particular case it's all coding genes so it will be vector of a size 20,000 and we have those vector for every human cells here we have many cells so all from NIH roadmap project and then we start looking at the clustering in this space and what you can see that actually how cells are clustering according to biology we did nothing we just calculated the row signal you can even count reads in the promoter after your alignment we did nothing sophisticated but the old brain cells are coming together all blood cells are coming together h1 and derived cells coming together as well as IPS cells so that's that's an example of at least you can consider this is a good QC because you've done it and you can state that your data look right so here is I'm talking about already peak colors this is tool which we are developing which is called finder maybe we talk a little bit more about it in the practicum I think my time is up so I have to finish soon some details on finder this is a version 1 now we have version 2 which we are hoping to finish soon and publish and then there is max 2 there are many many tools as we said available and I only will present here max that multiple reason that's actually two major reason why max first of all if you PubMed max max 2 it you will get like 2000 hits so this tool was cited in the literature enormous amount of time the second reason and it so it doesn't drop roughly although it's kind of more and more people are aware in the field people are aware that the behavior of the tool is sometimes odd but nevertheless it roughly does a job especially in the older days when the accuracy of the libraries were not set great and the second that you can run it you can download the tool and run it no problem so that's that's one of the thing so what does the principle behind max very very quickly is the principle is Poisson Poisson background as as assumption of Poisson background that it needs input or it can run is in input free mode as well so basically it finds a fragment length it's designed for a single end and for pair and it's also kind of mimicking single end mode it determine fragment length it shifted reads towards to each other positive and negative negative strands it calculates number of reads in every bean so if in every specific location 200 base pair beans calculates count of reads and then looks around 1000 bases around 10,000 bases around and genome white and assuming Poisson distribution it calculates lambda the single parameter of Poisson and then it compares expected count if it would be Poisson with this lambda and lambda will be maximum lambda from 1 kb around the region 10 kb around the region and genome white I think I'm talking in the next slide yes so you you rather conservative you're local but on the other hand you're rather conservative and then you comparing your actual read count in the bean to your your background so that's how the tool works and yeah that's I think that's my almost last last slide is comparing comparing two or more data set maybe couple like the challenges of course different depths of sequencing then we had the mercy of normalization different signal to noise we also at the mercy of normalization different read lengths already mentioned this different fragment length distribution even if the real length is the same but our fragment length distribution is completely different than our data will have some biases and many many other biases as sequence specific biases GC etc so we need somehow harmonize the data and for any big epigenomic program project is a challenge how to do it and typically it's a loss of data so that you have to use the minimal determinant you have to basically the worst data dictates you what to do their data be short the three lengths a the most shallow reads you have to sub-sample down and that's yeah that's how people do it but there is a compromises the compromises and there's no ideal ideals ideals this solution or one of the possible solution which is not frequently used but I think you can it's like a doctor when the pediatrician has a has a baby who got sick with some unknown disease typically doctor takes a healthy baby and compare healthy and baby is sick so do the same so take one library unambiguously call enrichment or identify enrichment in one library and then use another library as a lookup just just do this so you kind of using region of interest you define your region of interest in one library and then you use second library to lookup that's an idea and basically just to second Martin's that garbage in garbage out so if the data one of the wisdom we learn hard times that dealing with chip seek the special call the special care should be taken on all QC steps alignment steps because if something went wrong and every preliminary steps and then analysis step is not not valuable has no value thank you