 Okay, so we're going to move into the next module. So this is module two. And what I'm going to do here is kind of catch up to where Edmund has taken us in terms of the alignment. And I'll be kind of providing a little color commentary in terms of some of the topics and file types and some of the parameters that Edmund had discussed. And then I'm going to catch us up and move us into the actual chip seek module. So the max two. Let me do that. All right. Of course. All right, so the learning objectives of this module is really again to get into a little bit more of the conceptual details around what's going on in terms of the sequence alignment. Then I'm going to define mapping quality, which is different than base quality, but scored on the same matrix. And then we'll go into a little bit more details on the sandbam file as Edmund has already introduced. And then at the end of this module and obviously work that you've already done as part of the, the, the tutorial is being able to perform a basic BWA alignment and understanding the scope and purpose of some of the selected non default runtime parameters. So where are we in the workflow we're here now we're we're getting ready to do the alignment so you know what what is the issue that we're trying to solve here computationally and it's really one of similarity searching. And when we think about alignments we can think about them in two main ways, either a global alignment or a local alignment. And so an example of a global alignment would be something that tries to find the optimal alignment that includes all the characters from the sequence that you give it so for example cluster some of you might be familiar with is an example of something that provides a global the objective of a local alignment is somewhere, something that's trying to find the optimal alignment that only includes the most similar local region or regions within a particular sequence string. And so that would be an example of that would be for example blast or of course is we're going to get into BWA. So, when we think about local alignments, we can kind of broadly classify those into two kind of main groups. When where you're, you're trying to provide a rank solution of all possible answers or all possible alignments of a particular sequence string to a reference, typically. And many references and an example of that would be, you know, a blast search that you do say you had a sequence string and you're, you were trying to see what was the best match for, let's see the end you know the NT nucleotide database gen bank database, you might use something like blast. And another way would be if you wanted to find the optimal solution for or the best solution for the alignment. So when you're searching many, many, many sequence against one. And of course that's the example that we're going to be working through today and one one tool that we that that works in this framework is BWA. We talked about bow tie we talked about star, we talked about some other other algorithms that do the same thing. So when we, when we work in this space, the statistics are important in sequence similarity searching. And we really need these statistics to discriminate the difference between real and artifacts so false true positives versus false positives in your alignments. We do this using an estimate of probability that the match might occur by chance and in blast of course, things like the recall and E values are used, but in short read aligners and BWA being one example. It reports a Fred like map ability prop map ability probability as we'll discuss. Like anything we do in the in the wet lab or the dry lab, you got to know your reagents. And so changing the reference that you use for your alignments changes the the space the search space that you're using and changes the statistics. And also in the context of chip seek would change the coordinate structure that you're using. And so you must know the reference build that you use or the reference sequence that you use, if you want to do any downstream analysis you can't be matching data that you use, for example, the previous patch of HG 38 with the newest patch you can't use the T to T genome. If you, if you're using that and and use those coordinates against GRC 38 because the coordinates are going to be different. And so the first thing you should ask yourself is what's the appropriate reference. And then use that reference throughout your your analysis and obviously crossing references. You know if you've got half your data aligned to one version of HD 38 and a half aligned to another. That's problematic. Okay, so we talked a little bit about indexing and so I'm going to kind of go through that really briefly in the next few slides just to kind of give you a conceptual framework of what the heck is going on under the hood. And I find that can be a little bit helpful as you're going through this. So we talked about this concept of indexing. So, really, really this is all about increasing the speed of the alignment. So think about the problem here we have a read, let's say it's 100 nucleotides and length, and we're trying to align it to the to the human reference which is 3.2 billion nucleotides in length. You know, summing up all the chromosomes. We would start on chromosome one position one, and see if it matched there and then move to position to position three position for that would take an enormous amount of time. It would be highly accurate but it would probably never come complete so how can we solve this problem or how can we speed this problem up, or speed this challenge up and so the way that that's really been solved for short read in the last few years is generating these indices and and I think someone put this actually in the slack channel can really think about it as like an index of a book I think that's a great way to think about it. And just I made a note down here that there's this boroughs wheeler transform that's used during the indexing I'm not going to get into that, but if you're interested in how this compression works, you can follow this YouTube link. This is a super toy example just to get your head around what we're talking about, but when we're generating an index so here's our reference sequence and we just talked about chromosome 19 in in bioinformatics. When we count reference strings they can either be zero based or one based and so that's why I'm using zero is the first position here. You need to know whether you're zero based or one based and that depends on the browser you using. But nevertheless, we basically start from the end of the of the fast of the fast file in this case say the first position of chromosome 19. When we're generating a reference into let's say for more former words so for nucleotide length of strings, we can compress this reference sequence basically into a table, where each of these formers are listed and the position in which they occur in the string is is recorded. And you can see right away even in this very toy example I was able to compress position zero and position six which have exactly the same sequence string twice and so now, instead of looking across all, all five or all six positions. You can look across five positions, and of course hopefully you can appreciate how this would scale. If you're looking across the human genome at a short enough word length that it would be highly repetitive within the context of this of this table. So then how do we actually do the alignment. So now that we generate our index and this is exactly what we're doing when we do BWA index, we then take our sequence string and we extract the first five the first former so this is the toy example we're using on the five prime end of the sequence read, and then we look it up in the table. And you can see in this toy example that of course that ATTA matches position zero or position six so now I know that my sequence read is aligned to the reference and either position zero or position six. So here we are here it is, and so I'm either here in with respect to the reference, or I'm here with respect to the reference. So I've done that seed matching to the index. So then the final step of the, of the seed based extension, or the seed based aligner is to then to extend the read. I know where I am in the reference I go back to my reference and I say okay what's the next base the next base the next base, and then I see when I extend my, my sequence where it aligns optimally with respect to the reference and you can see in this toy example. Actually, it aligns to position six. And so this is the position that is selected by the aligner is being the optimal solution for that, that alignment. So that's kind of a toy example hopefully that is a little bit useful to you in in, you know, practically how does BWA do this so for BWA align which is the other commonly use module for shorter reads in BWA is a 30 tumor. It generates a 30 tumor index table for MEM it's 22. And then essentially the, we go through the same process we take the first 32 bases of your sequence read and align it to the table. And we allow for up to two mismatches in the seed region that's just a default parameter that that's used in BWA you can actually change that. So this is extended from where the seed matches to the reference to sunset threshold and we use the base qualities as a measure of how far out to extend that read. And in the, in the default, it's essentially the sum of the base mismatch base qualities that we use to decide how long to extend that read. The read can be truncated after that threshold is reached so let's say for example, we started extending and we had a, let's say a read extension threshold set at 100. Then if we had, you know, two mismatches with a Fred score of 40 that would equal 80, but when we got to the third one that had a, you know, a Fred score of let's say another 40, that would extend the threshold and the read extension would be truncated after that position. There are two types of truncation that are included in BAM files either soft clipped or hard clipped soft clip just means it's clipped conceptually in the in the file and it's actually just lower case, but hard clip means that the sequence string is actually removed that's fairly un, it's not usually typical, most, most typically the reads are soft clipped. Okay, and then once that's done the read is assigned a mapping quality. So what is a mapping quality. Again, it's very similar to a base quality except as with respect to how confident the aligner is that the read is mapped in the correct place. So the probability that the read is miss, miss, misplaced in the genome. So, you know, I place a read with a mapping quality of 20, one times out of 100, that read is going to be misplaced mapping quality 30 etc, etc. And then the mapping qualities themselves are derived from the base qualities and the number and frequencies of mismatches for the best alignment, although, versus all other possible alignment so you can imagine, if I align a read into a particular position in the genome. There's another position in the genome, where the read can also align but it's got another mismatch, but the base quality that mismatches slightly lower. So then it's going to rank the one with the slightly lower base score as the potentially the or with the higher mapping quality. And again, they're the reporter in our Fred scale. Here's the, the primary literature that discusses how these are computed in more detail. So what's important to know is, and probably the most use of mapping qualities is with respect to repetitive alignments. So as you probably know, the majority of your genome is actually a repetitive sequence. And that repetitive sequence causes challenges when we do alignments because if we have a read that aligns to that that can align to two places in the genome that have identical sequences for the length of that sequence string. What do we do with that. So how do we handle that pose essentially get assigned a mapping quality of zero. So in the BAM file they'll have an MQ value of zero. And typically we would then filter to remove these repetitive sequences. And typically that's done when we filter using the bit flags as Edmund had just discussed to remove sequence reads that are multiple multi mapped in the genome, although there are other strategies that you can use in some cases. The, the randomly or the multi mapped reads will be assigned a one position on the genome randomly. And then that can be included in the downstream applications but in most cases we strip out these multi mapped reads. Okay, so I've seen I've shown you this before sequence fast queue BWA plus reference output is your SAM file. You guys already been introduced to BWA but if you're interested this is the primary paper and the manual is and the GitHub repo is listed there. When introduced there are two main modules. Align and Mem align is for reads shorter than 75 nucleotides and when I first started in ship seek that was the vast majority of data that we generated actually all the data that we generated was less than 75. But over the last, you know, five to six years that's really changed now, where the vast majority of data is greater than 75 and so Mem is is really become the default. And the difference between Align and Mem as I introduced is really just differences in the seed length and then details on how the gaps are handled in the alignment. And again if you're interested in understanding the gaps, how the gaps are handled. I would refer you to to the manual and just add though that indels in short reads are notoriously difficult to deal with, and are still problematic today. There's also this other module that's used for much longer reads although I've never actually used it. So in terms of how the alignment is run and again you've been introduced this. So I'm just going to go through this really quickly. The first step is to index your, your reference so that's, that's shown here. So BWA index, your reference sequence and this is a FASTA file. This FASTA file can contain one chromosome. So basically the greater than sign and the name of the chromosome, or it contain 10 chromosomes or 20 chromosomes or 23 or 24 chromosomes or however you want, it can contain as many entries as you want. And then the second step is then to use your, your index and, and then give it your reads in this case your, your paired end reads read one and read two. Remembering that these two files need to be read names sorted for BWA to work with them and then it would output then a SAM file or BAM file depending on the parameters that the runtime parameters that you use. One thing to keep in mind and I can't remember where the admin had said this but you, it will, it needs to know only the path to the reference FASTA file that you use to generate the index and it, and the reference FASTA file must be in the same directory. As all the indexed output files. So when you ran the index command, you would have done it inside a directory. And then when you point BWA to it, you point it to that original FASTA file, but that FASTA file must sit within the directory that all of the index files are in for it to run otherwise it'll throw an error. Oh, and here's the manual for you. Okay, so we've done that. Don't need to go through that again. Edmund talked about reference genomes. Again, here's two sources of reference genomes. We use the, the UCSC browser reference genomes. The other major source is the NCBI and I just want to briefly go through that just to introduce one other concept. There's many reference genomes available at the NCBI you can use the genome function and search for, for example, Homo sapiens and it'll take you to a page that looks like this or similar. You can select your reference genome, and this is GRC 38 patch 14 which is likely to be or I believe is to be the last linear representation of the original reference human genome. This is being, we're now transitioning to a new world, although we're not going to talk about today, where we move from linear representations of genomes to graph based representation of genomes. This is really work that's been pioneered by for example the TTT consortium in collaboration and working closely with the HPRC so the human pan genome reference consortium. And I imagine that in the next five years we will be transitioning possibly to to these graph based representations as opposed to linear, and that will be a topic of a workshop I'm sure in the future. So you can go to that link and it will take you to the to the to the main page. I'm going to give you accessions for the, for the assemblies and then you can go and look on the FTP. And this gives you a list of the various file types what I like about NCBI is it also gives you all of the annotation files so the G F F files, which are basically gene or whatever coordinate or protein or whatever that you're interested in on the same genome build. Obviously, the gene annotations are also specific for a particular build so you need to make sure that whatever resources you're using are aligned with the genome that you're working with. And again, you know, here's the, the fastest file for the entire genome now so this is all the chromosomes that you would be working with. And you can just download this one file it's about a gig in in size, and then go through that BWA index process, and then you generate an index for the entire genome it's it's that simple so exactly the same as we did for chromosome 19. And then, I'd like to point the whole reason one of the reasons I went through this is to point this out which is something that you might not be familiar with which is an MD five checks on. So what is an MD five checks on. So this is a way of uniquely identifying or generating a digital fingerprint for your file type. I'm going to tell you how many times I've downloaded a reference or or some data set from the internet and you know I've had my pipe, or I've had the connection drop and not all of the file has been downloaded. And there's no error thrown it's just there on my file system, but for example the file is truncated it's missing some part or what have you. You check for that these files are huge sometimes it takes a long time to do the download. Well MD five check sums find provide provide one mechanism to do it admin introduced others you know there's you know obviously looking at the file size and things like that is also important. So the check some which is is in your in your bash shell you can just run it. You do MD five, you run the command MD five, give it the, the name of the file you downloaded and it will compute a alpha numeric string. And then you can look at that same file this file here which basically has a list of all these check sums, and you just check to see that your string matches the string that's there and that tells you the file is identical. And if you decompress the file that checks them will be different so it has to be in the compressed format to match the checksum that's presented there so very useful tool, and when I would highly recommend you get to understand. Okay, so once we downloaded our reference and we've index our reference then we align our reads again you guys know this now more, you know this through going through Edmunds. I did want to just take a brief minute to talk about how read one and read two are actually leveraged in in in short read alignment I've kind of gone over the algorithm for you. But just so that you understand BWA aligns each of the reads independently and following alignment it then pairs the reads. So, how do we actually use the read pairs in any way that's going to be useful. And that's shown here, and I apologize for the pencils the pencils are from my class. I put there for students to for slides that they need to understand but and I think they kind of bled over into this presentation so you can just ignore the pencils. But essentially, the issue is this. This is an example where read one so one read of the pair aligns to a repeat in the reference that's present in two different places okay so at this point if we had no other information we would have to flag that read one is a mapping quality zero, because it's multi mapped and remove that or filter it from downstream analysis. However, if we have the read two alignment so the mate or the pair for that read. And that read to aligns to a unique place in the genome that is within some specified distance of read one and this is typically the mean insert length of the library that you're that you that you've sequenced. I think plus one standard deviation, I can't remember exactly. But if it's in within that specified distance, then read the alignment for read one can be rescued because now we know that read one is actually in this repeat and not this repeat. And so we're able to rescue this alignment and that gives us a little bit of a window into alignments into repeats and and Guillaume and many others have leveraged this kind of this feature to be able to ask questions about epigenetic rates of repeat regions, where all we have is short read data because it allows us to then ask questions about what's going on here with read to read one with respect to that read. And overall is it is it is that a big improvement in terms of alignment it's a little bit, you're able to rescue about 2% of reads at 75 nucleotides and length by using this read pairing strategy. Okay, so you've seen this to death when you show that again. So now that we've done our alignment, we've generated the SAM file. What is the SAM file. Again, Edmond gave you an introduction I'm going to go through a little bit more detail just to give you a little bit better understanding. The SAM file stands for sequence alignment map file, and it's basically a generic format that almost all short read aligners, or maybe all short read aligners output and this is a great standardization that has developed within the genomics community, over the over the last 10 years, that allows us to work together and allows tools to to to work interoperatively on the output of the alignments details of what it is are here. So this is the primary publication that first described the the SAM format, and then the authors who drove that drove that process forward. So this is what it looks like you've already kind of been, you've kind of been introduced to this, but if you did open up your, your SAM file. So this is a graph that looks like this. So how do we break this down. It's basically a, you know, a column deliminated file, where each of the columns have a specific have specific feature associated to them. And in case that's not already clear to you a BAM file is identical to a SAM file it's just the binary format of the SAM file so instead of storing it as a text file we store it as binary values because it's, that's how a computer thinks and it doesn't mean and it means that we don't have to interconvert from text to binary, do some computation and then go back from binary to text so. So by keeping it in a binary format a it's smaller and be it's the way that the way your CPU thinks. So it just saves time and space. So it's a good thing to do. You can't read it because it's binary not text so you need to convert it back to a text using SAM tools or card or whatever you're using to be able to read it again in your in your shell in your bash shell so if you just try to open a BAM file. It's just going to give you a bunch of gibberish, and you have to use a tool to convert it back to text if that's not clear already. Edmund introduced the header file, and I can't tell you how many times when we provide data to two collaborators they asked me what genome build the data was aligned to. And of course that information is in the header so the header is a super useful component of the BAM file. You should get familiar with your header so when you get a file let's say you had a BAM file. Take a look at the header because the header has the information, for example what reference was used, you know what the reference length was can be an empty five checks on in there. It will give you information in terms of what program was run, whether it's been do marks what kind of sorted was used and those kinds of things so the header can be can be very, very useful. If you were to caution if you just use Sam tools view for example it'll strip the header off. So if you if you do Sam tools view and then write it to a new file you'll have a new Sam file without a header, and that won't be good because it will then choke downstream could choke on it so you have to make sure that when you're when you're viewing and moving your files around that you keep or when you're viewing your files and doing some kind of manipulation that you keep the header intact. Okay, so here are the columns the 11 columns I'm just going to briefly go through them just so you can understand them. I would encourage you to do this in the BAM files that you've generated is for you know just open up one of the BAM files input or K 27 acetylation and take a look yourself and try to get your head around what it is you're looking at. So, you know the the first, the first column here is is basically a string that gives you the the query name so this is what is this name here this is the same name that I showed you in the in the fast queue file right. So, again, you've got your tile, your lane your tile your x your y coordinate and we know these two roots reads in the red box are paired because they have the same x y coordinate right. So those are, or those are paired. And in the output the primary output of of BWA the reads will be paired because that's how it comes to through the through the through the alignment process the read pairing happens at the end. But of course this will not be true if you coordinate sort it because you can have read pairs that span each other and so they'll no longer be paired in the way that they are straight out of BWA. And here's another example of a read pair. The first column is a super useful column. This is the bit bitwise flag that that admin introduced and is used quite extensively for filtering for generating subsets of data from your from your from the band file. And essentially it's computed using a sum of the bits and the bits each have meaning. So here is the breakdown of what the bits are. So and I think there was a little bit of discussion in the slack channel about this, but this just sort of gives you the full description of how these things are generated. And just if you're interested, you know, this is actually how it's computed. So, you know, for a flag of 83 it's basically one for that it's read pair read mapped in a proper pair to 1664 that's adds up to 83 and that's where the 83 comes from. And similar to the 163 is summed up so it's basically just summing up these values for each of the bit flags and then these can be used in a variety of different ways to sort the reads. So that's what I'm saying here. And I think Edmund has already introduced it to you. So you can, for example, you know, here's a bit flag you can use 3844 to filter out this set of reads right so chastity failed reads which we didn't talk about but that's essentially a quality filtering that was used on on the order to raise that are used now where chastity filtering is not really a big thing, but it used to be a big problem on older systems, it's just very briefly. It filters out polyclonal clusters so clusters that are nucleated from to two different DNA sequence strands that generates a polyclonal cluster. And the sequencer reads it, it obviously is problematic because it's actually derived from two independent sequence strings, and there was something called chassis that was developed to filter those out. You might see those in some of the data that you download from from the SRA for example, and that's what it is and typically those are filtered out. And so you can use these to filter out reads makes it gives you a quick way to do it. So what else do I want to say, I think, again, Edmund talked about secondary alignments. So you can set the parameters in BWA to store more than one secondary alignment so you can store 10 or all of the secondary alignments if you're interested in your BAM file so then for each, you know individual read or for each of the fast queue entries that had multiple let's say it was aligned 100 times in 100 different places with the same, you know, well aligned with the same 100% alignment in 100 different places there would be that read would be represented 100 times in your map in your BAM file. And it would have a secondary alignment flag, and then supplementary, supplementary alignments which are a little bit more tricky but basically, where non overlapping parts of a sequence aligned to multiple locations so those are just some of the additional flags that are that can be handled by the bit flag. So the next column essentially provides the reference sequence name in this particular case is a bit redundant but it's there. And then the position with respect to that reference so that would be let's say for example chromosome one and the position on chromosome one. The next column is the mapping quality so this is 292937. So you know one in, what is that one in 10,000, roughly. And then here's you can see an example of a mapping quality zero. So here's an example of a read that, given its string which you can see as a bunch of ends here is was unable to be uniquely aligned to the genome and it's given a mapping quality of zero. So here's a toy example of that. The next column is the so called cigar string you may have heard of that. The cigar string is not particularly useful for chip seek but just as an FYI this is used to encode information about the sequence alignment segment. The descriptor is here so it's basically telling you how many nucleotides of your sequence string, a line to the reference genome in this case. This was an alumina sequencing run that was 250 base pairs in length, and this is telling you that all 250 base pairs of that read matched with the reference. Here's one that's a little bit more tricky so there's four substitutions 202 matches and then 44 substitutions on the end so the beginning. The beginning didn't sorry the beginning with soft clipped the 222 was matched and then the 44 at the end with soft clipped. And that was, you know, how that information is stored in the cigar string and again here's one that shows 250 matches. What is this, how does this actually computed again this is a good toy example just to get your head around what's going on you've got your reference sequence you've got your read. The references aligned, or the read is aligned to the reference and in this particular case. There's an insertion and a deletion. And so the, the, the, the cigar string that results from this alignment is 33 matches days T and the T, an insertion with respect to the reference, three alignments, a deletion with respect to the reference and then five matches. So let's just kind of a toy example of how that information is stored, typically for chip seek you wouldn't use that. But it can be useful in, in other strategies. Following that you, you the next column and for some reason this box is really big. But the next column basically gives you the position of the of the alignment of the mat of the mate. So the pair. And so you can see how this is, if you look across here diagonally. You can see that this is the same as this as it should be because these reads are paired. This value is the same as that value, because these reads are paired. And then the next column which is the observed template length is a value that's used downstream. And this gives you an estimate or it gives you the, the length of the alignment with respect to the reference and I'll just talk about that really briefly. I want to point out this one where you can see that the read, although they're made it. The second read didn't align to the genome was given a map and quality of zero. And so, and so the although this read aligned to a position and so actually that that position of the one read of the pair that aligned is given to both. Just how, how BWA handles that. So what what are these, what is these values, negative 532 and positive 532, well it's calculated like this. So one of the features of BWA or the SAM file that you should be aware of is that it stores the left most position of the read. And that's this position of the read and I'm not saying the fire primer the three prime because it's the left most with respect to the reference that you're using so let's say this is our reference this is position one, the position we store in the SAM file is the lowest position. And so that's on the five prime end of this read, and on the three prime end of this other read but it obviously depends on what strand you're on. And so, we so to compute the total length of the insert. You can see that it's the, the, the left most position here, plus this left most position plus whatever the matches in this particular case 250 right because it's matched 250 so we just add 250 to 484 41143. And then subtract those two positions to get the, the observed template length, which is the length of the DNA fragment in chip seek that was associated to whatever protein it is your immune or precipitating let's say, associated to a particular column. Okay, so that was essentially what I wanted to go through. Obviously the other two columns is the sequence itself. And then you can see the mapping quality. And then there are some additional fields that are shown here that if you're interested in and again I would I would take a look at them at the manual but there are, there are other optional fields where additional information with respect to the alignment can be stored. Right. So, now we're now that we've got a better understanding of the, the of the alignment process, what a map what a SAM file is what a BAM file is, we're going to go through the chip seek analysis workflow. So we're here in the workflow. We've, we've got our sorted BAM file that that admin went through. Now we're going to go into Max to, and we're going to generate files, so called bed files or bed graph files. So this is the sort of last file type new file type that you'll be introduced as part of this module. And this is how we store that information that allows us to visualize it or to do some differential analysis as admin will be talking about in module Why do we use Max? Max is a, it works. It's highly, it's been highly used in the community was developed as part of the encode consortium by Shirley's group and Wei Li. And it's, you know, last I checked it had over 8,000 citations. It really has become a default there is a newer version I think Max three, which I haven't really played with. What we do is still in Max to, and that's the, the peak calling module that's within the IHEC consortium container as well, developed by by the encode encode encode group and encode uses Max as well. The majority of their peak calls. Okay, so how does Max work. So, again, what is the objective. The objective of this experiment is to identify regions of enrichment in your IP, comparing to control and the IP is we in the nomenclature of Max, we call the IP the treatment. We call the input the control and recall that the input is typically the, the material that you sheared prior to subjecting it to immune or precipitation so you're doing your experiment. Let's say you used MNAs to chew up the genome, you would take a little aliquot of that MNAs digested material and you build a library out of it and that would become your input, and you take the rest of it and you would do an IP, and that would become your treatment. Okay, so I'm going to go through the steps as it works so when you provide it with a bound file, it will look, scan across and look for significantly enriched bins and it uses the bin size that it searches so a bin is simply a region of the genome can be anything like a bin could be 100 base pairs or 200 base pairs or 1000 base pairs we call those bins, and can be an arbitrary size in this particular case it's two times the average fragment size. How does it compute fragment size, it uses the observed template length that we just talked about so it uses this value here to generate a an average fragment size. Looks for bins where the number of reads counted within that bin, and this is for your treatment is M fold higher than than the random genome average so basically, then taking those same reads and scattering them randomly in the genome. It has a thousand of those of these randomly chosen enriched bins, and it uses that to calculate the difference, as I'm going to go through in the next slide so don't. So I will explain this a little bit more detail, the distribution of the reads on the negative strand, and the positive strand, and from that it computes something I calls D. So this is that D value to shift all the reads, depending on what strand you're on so on the positive strand, and then the negative strand subtracts by D over two. Okay, and I'll explain that in just a minute. So it scales the control experiment to the same number reads as the treatment, and this is important because you can imagine that if you're using your control experiment to to estimate background. If your control experiment is sequenced much, much, much, much, much deeper. Let's say a sequence 30 x coverage but your, your treatment is is only 10 x, then there's going to be nothing that's going to look enriched because everything is going to be less than, or the vast is going to be less than, except maybe blacklisted regions, less than your control. So your control must be down sampled to the same to the same number of reads as the treatment. But there's no way to inflate your control. And so that's why it's important that your control is sequenced to at least the same depth as your treatment is for Max to for the Max to algorithm to work appropriately. If you give it a control that has much less, but you know that has been sequenced to to to a much less number of reasons at the right time to to a reduced number of reads. That's going to underestimate the background and it's going to give it's going to arise in much, much more false positives for the reasons that I'll show you in just a minute. Okay, so what's all this repositioning nonsense. Well it's really to take, take account this effect which is, we have our protein DNA that we're you know protein DNA interaction that we're that we're interested in assaying let's say an epigenetic space this is h3k4 trimethylation. And when we share our genome, and then we add our adapters, we, we have an orientation of those reads with respect to, to the, to the DNA in which they were generated from. So, because we can, because this orientation is retained, like when we sequence from this end and when we sequence from this end we're on different strands of the DNA. So we have a set of reads from the random sharing that occurs on one side or that occurs on the other side with an orientation associated to it right and so that's kind of looked here so if you look at a density plot of where those reads align. You get a bunch of reads on one side you get a bunch of reads on the other side and then you don't get any reads in the middle, because of course this is sterically occluded or blocked by the presence of whatever protein and in our case the protein is on DNA wrap which obviously is not an area that's going to be subjected to to either tagging or digestion, or what have you, you can plot the distribution of these peaks of the of the of these distributions. It's shown here, and then you can compute the distance between the, the, the peak on one strand and the peak on the other strand. And this gives you D in this case, I can barely read this but I think it says D. I'm reading on a very small laptop D equals one over 126 or 120 I can't quite read that. And that's that's the shift value that's the D value that's used to shift all the reads, such that the reads then become positioned where the, the, at the center. At the center of the protein DNA interaction. Okay. So, once we calculated D, and we've done our shifting. We then actually go into the peak calling and for the peak calling we have to make an assumption, because we need to assign a statistical confidence to our enrichment. And the best assumption is that the reads follow a person or normal distribution. Okay, so we assume that that's a Poisson distribution for the read count to distribute distribution. This is true in Max to obviously you're probably familiar with other other tools that use different distributions. So we can certainly have a discussion about that Max to uses Poisson and I'll get into why that is in a minute. So you essentially, then break the genome into two times D bins so you've got D so in this case if D was, you know, let's say 125 then you'd be searching the genome in 250 base pair bins. And for each bin, you calculate the mean number of sequences so the mean number of sequence reads so you're looking at those, those read starts from your IP, in this case treatment that aligned within it. And so that's basically the number of reads divided by the length of the bin and nucleotides that gives you the mean number of reads for that bin. And then you calculate the mean number of reads for. I don't know I got mean written here calculate the mean number of sequence reads you can ignore the second mean number there for your input. And this is your control. So you do it for the bin that you're looking at plus the next one KB five KB and 10 KB bin and those are fixed bin sizes they're not dependent upon on D. And essentially what you're trying to do is you're trying to see where are the regions near where the bin is that you're looking at where you have a lot of input signal or where is the maximum input signal that you have in these larger bins remembering this is a mean. The length is so the number of reads that align within those bin sizes is going to divided by the length of the of the of the bin so you're not going to get really inflated values it's going to be the mean. And it gives you a way of sampling across across that space. And the maximum value and either the bin that you're looking at or the one KB five KB or 10 KB bin is becomes your Poisson lambda value. Okay. And that's the one parameter that you need to compute a Poisson distribution as I'll talk about in just a minute. So the control mean becomes the Poisson lambda for that bin. When using the Poisson distribution for lambda you calculate the p value for your IP mean, and then of course because we are doing many, many tests across the genome. In fact, hundreds of thousands of tests. We need to do a correction for multiple testing using a Benjamin hush and then in the case of Max to it uses Benjamin hush for correction. So what the heck am I talking about what is lambda what is all this stuff. So, I'm assuming most of you are familiar with with distributions Poisson distribution is one type of distribution where the only parameter you need to know to compute a Poisson distribution is lambda. And from that you can, you can, you can compute a distribution for that particular, for that particular bin. So how do we use that. So essentially, let's assume that we had a, a bin where your treatment been said had a mean number of 16 reads, aligned within it. And then you had a control bin where you had, let's say a mean number, a lambda value of 10 so the mean number reads aligned within control been was 10. So you would take that lambda 10 you would plot your distribution, and then you would take your treatment then you would see where it intersected with with lambda, and then from that of course, you can calculate the significance of that enrichment and that's it. So that's how we're assigning p values to each of the, each of the enriched bins in the genome. And to ensure that we correct as I just said for multiple testing, and in the context of Max to we use Benjamin Hoshberg, which is essentially, you know, we uses a rank based strategy so it ranks the p values and then, and then adjusts for the number of tests that are done in that particular experiment and you must you and of course, the output of that is a q value or an FDR value. And when you're doing any work downstream with Max to you should always be using the q value, not the p value. If you want to assign significance, and by default, the threshold for Max to in terms of reporting is point zero five FDR or one zero five q value. So what what does Max to require to actually run and again you're going to go through this eight minutes you're going to go through this with Edmond so I don't need to spend a lot of time on this but you, you are going to, well, the minimum it requires is is the IP. So that would be a in the context of chip seek that would be your IP. Remember that for attack seek, we don't have an input so we run it in, we run it with just the, the treatment alone or just the attack seek alone. The treatment file is the only required parameters I say here, the fam the file can be there a bam file which is what you're going to be using for chip seek, or it can be a bed format I'm going to talk about what a bed format isn't just a minute. And that's what you would use, or that's what's used in the encode pipeline for attack seek because of the adjustment for the TN five digestion that I talked about in module one. Yeah, and I think that's it and the other thing you need to know is the effective genome sizes although I think Max to has those pre computed as well so you just have to tell it what genome you're using. There are seven major functions in Max to we're only going to be covering call peak in this lecture if you're interested in learning about some of the other functions. You can issue the dash H command and it'll give you the full list of functions call peak is the main function in Max to and you call it like this. This should be second hand nature to all of you now so this is on command line Max to call peak. And it's going to give you the set of options. And I'm just going to focus on a few of the most commonly used options in the next series of slides but you can see, for example, you can set your own q value you could set your own p value if you wanted to. You can force the, the extension size of your reads, you can force the, the bin sizes, etc, etc. And in terms of the input file options dash T, basically, is the option that you use to specify the input data file, again the only required parameter dash C, and then you point it to the path of the control as admin will go through. dash F tells it the format of the input file, but you don't need to do this it will decide automatically based on the extension. And then dash G is the, the mapable genome sizes I just talked about and I think you can specify the genome build, and it will, it automatic, it has the effective genome size. And just to be clear, the effective genome sizes is the amount of the genome that's alignable in that particular genome build. And so you may, you may or may not know but for example for hg, hg 38 there's about, you know, 200 million bases that are annotated is in and in and in the genome and so you know the effective genome sizes obviously much lower than that. So what are the output arguments you need to tell it where you're going to put it. You can add a pre pre pre pre effects for the output file so that you can control your file naming. And then, you know, a most commonly used parameters is is dash capital B or or double dash BDG does the same thing. Basically that stores the fragment pile up for the control lambda. And then computes and stores the log 10 p values and negative log 10 q values in in the bed graph files as I'll talk about in just a minute. Peak calling arguments again there the default is point zero five but you can adjust that you can also adjust the p value. And then as, as admin will talk about the other major argument that's probably worth knowing is broad. So, by default, Max will run as a narrow in the narrow peak mode. And narrow peak mode works well for for for punctate marks like a 27 installation k4 trimethylation for, but for marks that occupy larger spaces in the genome or larger contiguous segments in the genome. It's kind of problematic because it will break the max the way the max to peak caller works as you can see it scans across with bins. And so sometimes you can get regions where let's say you had a mega base marked by K 27 trimethylation, but there was a dip in alignment because let's say there was a repeat sequence there or for whatever reason, you know there was less reads accumulated in that position and so as Max. He said that that been in that region it said okay well that region is not significant so it doesn't call it. So you can get these kind of chunky blocks of peaks within these large segments which you actually know, based on the biology and how the mark is, you know, propagated in the genome that that's actually a contiguously marked region. The broad mark essentially calls the peaks using a using a narrow peak strategy using the bidding strategies as talked about, and then stitches the peaks together, based on some parameters so for example if the peaks are within some distance of one another, it will stitch those two together and call that entire segment as enriched and assign it that Q value for that entire segment so that's, you know, at a very high level the difference between narrow and broad. Broad is never I mean, I wouldn't say it's an opt it's not a perfect solution to the problem because there are lots of nuances and how the stitching is done. And so, again, you know, you know, I think it's something you need to get familiar with whether you're, you know whether you want to run your, your analysis in broad or narrow mode. I know that Edmund will go through both in the tutorial so you'll get a flay or you'll get a feel for how how these are how these are done, typically for K 27 trimethylation, we do do broad mode, as is indicated in the in as we'll do so in the tutorial. Okay, so what are what are the outputs of Max to so you get a series of files and I've highlighted some of here. And then you'll get a file of the peaks so these are the region, the regions of enrichment, you'll get a tab file or tabular file of the peaks and also basically just a column delimited file. And then you'll get to bed graph files which are similar to the bed file as I'll talk about one for treatment and one for the control Lambda. And then you'll get files for the broad peaks, or gap peaks depending on the type of region that you're doing, depending on the type of, whether you've done broad or narrow, reached a broader narrow mode. So, so what are these things look like and again I'm going to go through what a bed file is specifically in a few slides. So what you can think about it right now is just a column delimited file where you have the, the, the chromosome or the contig that you're aligning to if this was, you know, a single chromosome you would have that but so chromosome one. And then the start and end position for the particular feature. So this is the enriched been in this particular case. And then a, you know, the name of the peak some that basically an integer of the score for that peak that's displayed in for example the UCSC genome browser. And then the status statistics for how confident max two is in that particular region so this includes full change over control. Log 10 p value and log 10 q value. Of course you'd most often want to be looking at the log 10 q value for any kind of analysis and then the final column is the relative summit position from the peak starts so basically where is the the summit of the peak, compared to the peak starts, where the peak starts. So that would be 8,000 and, you know, 840,000 and 81 plus 158. That would give you the maximum peak height for that particular feature. And then a tab file so that just looks like this again it's a column delimited file with a few less features, and this allows you, you know, you can load this in our or in Excel or whatever you're looking for, or whatever tool you're using that allows you to sort for various things and you can see in this particular case, the only, the only value that's stored is the is the negative log 10 q value. You'll also get bed graph files. As if you specify the dash B option. And I'll talk about what bed graph files are in just a minute. And this will give you the essentially the treatment so your IP pilot bed graph file your control lambda bed graph file. This is also a Q value bed graph file and and these are the file types that you can use to then do visualization as we'll talk about in the tutorial. So, you know chip seek is typically analyzed. Well, once we've run through max to we generate bed files and bed graph files which become the, the, the, the file types that we do downstream analysis with. There's a variety of different tools you can use IGV being one of them that you can run locally. You can also load these onto public browsers like the UCSC genome browser to visualize the files bed bed graph file bed stands for browser extension data format is again it's just a column deliminated file. The minimum number of columns is three which is essentially a chromosome, a chromosome, chromosome start and end position. One thing to note that in the bed file the position is is half open. And that simply refers to I think it's it half open I'm pretty sure it's half. That refers to how what coordinates are sorted or stored in the bed file. So you can imagine that you could store both the start and end position or you could store the position next to the starter and or you could, you could store the just the start position and not the end position. And that's just, you know, it's just a convention it could be any of them and half open as convention that's used. So essentially the bed graph file is very similar except that we have the chromosome start and end position and then we add a data value and this can be a Q value, for example, for that particular position. There's also the header of these files, we can include a track definition line on the track definition line looks like this and essentially just provides information if you're using a genome browser. For example, the height how you want to scale it the color all of that can be encoded within the header file of the bed graph file. You can then convert these bed graph files into binary versions of binary index versions in the same way that we did for the BAM for the, for the reference file so we converted it from a fast file into a binary indexed version of that graph file. And that just allows for browsers or whatever tool you're using to visualize the bed graph file allows it to to operate much more efficiently, reduce the space and allows it to navigate through. And that's my last slide. All right, so