 Good morning everybody. So, like I just explained, this lecture is actually usually covered by Jared Simpson. The slides are prepared by him, but unfortunately he can be here. I'll try best to explain everything he's included. If you have any questions, feel free to ask him at any point. Or if something seems ambiguous and it's extremely lengthy, I might just talk to you about it afterwards or ask Jared and he can get back to the answer as well. Everything in this lecture is also covered by the Creative Commons license, which means you can reproduce or share anything presented as long as your slide deck is also covered by Creative Commons then as well. And then some of the lab material was also taken from Ben Langmeade and Aaron Quinlan, who both are, Aaron Quinlan especially, is in assembly and makes assembly style programs as well. So, module 4 is talking about mapping reads to a reference genome. The learning objectives for this module is going to be understanding how mapping reads to a reference genome work, how we generate these reads as well. Some of the basic formats we use when mapping reads to the FASCUE format and the SAM BAM file format. The FASCUE is usually your raw sequencing output and the SAM and BAM file formats are usually your standard for alignments. Learning common terminology when using described alignments, so like insert length, coverage, mapping quality, UFRED scores and so on and so forth. And then in the actual tutorial, we're going to run a mapper and explore the results to try and get you familiar with how to run your own sequences. So, the original human genome was sequenced using Zyger sequencing, which was a chemistry developed in the 1970s. Although, despite this giving us a fantastic reference, it cost us a few billion dollars. But once we had this reference established, researchers looked forward to try and figure out a way that we could sequence multiple individuals and then compare it to a reference so that we can determine and associate phenotypes with any of these genetic changes. Just like Trevor was talking about yesterday, we do this for cancer a lot. We sequence our cancer genome and we compare it to the human reference to try and see what might have changed, whether it be SNBs, indels, structural variants, copying of variations, and so on and so forth. So, Zyger sequencing was first chemistry developed and then multiple technologies were attempted or multiple chemistries were attempted to try and create the next generation sequencing, which gave about in Solexa, which was the company that Illumina actually bought. And their chemistry platform was massively parallel sequencing, so being able to sequence Zyger sequence, any of these reads at the same time. Now, the reason why Illumina is kind of dominant in this field for second generation sequencing, even though others exist, is because they have massive advantages over their competition. You get huge amount of throughput. So, in the Illumina high seat, you get about 3 billion pairs of 100 base pair reads. So, about 600 gigabases. Illumina X10 gives you about 1.8 trillion bases in just less than three days. The accuracy of each of the bases is extremely high, which with their generation sequencing has actually become a huge problem. The read lengths are actually fairly good for second generation sequencing. So, even though it's 150 base pairs compared to other technologies, this is actually quite nominal. But the disadvantages are still that 150 base pairs is too short of a read to map uniquely in repetitive elements in our human genome. The repetitive stretches longer than 150 base pairs would be completely unknown to us on where our read would actually go. But because the chemistry is fairly quick to develop and run through and the process is fairly standardized, we still use Illumina for most of our cancer analysis because you can get sequences in as little as one to two days. So, now I'm just going to explain how parallel sequencing actually works and it's called sequencing by synthesis. So, what you end up doing is you take your target DNA and you share it to, or you take your DNA and you share it to a target length. You take these lengths or the shared DNA as templates and you attach to the surface of a microscope slide. So, you can see two of them attached as an example below. You then run PCR or PCR on these templates to try and build clusters of each of these templates. So, even though six clusters or six templates have been shown per cluster, these are usually on the order of 10,000 to 20,000 templates per cluster. Now, this is where the actual sequencing comes in. So, each of your clusters is represented by these single strands over here and you mix color labeled nucleotides and DNA polymerase in with these cluster templates and there's a very accurate, very expensive digital camera watching the position of each one of these from directly above. You then run your sequencing machine for one polymerase reaction for one nucleotide nucleotide replication at a time and as each nucleotide is actually inserted onto the replicated strand, it releases a fluorescent color. The camera picks up these fluorescent colors at each of these positions and the computer is able to trace back and figure out what base was just inserted. So, the first base is added. The camera actually sees positions like this on the microscope slide and as your cycles continue to go forward, the color changes tell the computer what bases were currently added and in what order. So, you end up getting these time-lapse images and then those time-lapse images are just converted into sequences for each and every one of these positions. The problem with this comes up when our clusters or replication on our clusters doesn't actually happen in synchronization. So, for each one of our clusters where we have 10,000 or 20,000 different templates, the number of nucleotides could sometimes be added one more than the current position or one less than the current position and if that happens, well, you end up getting different fluorescent colors for that specific cluster and that creates some ambiguity. So, a computer is able to deconvolve or the sequence is able to deconvolve this combined signal of red, red, green and orange, it's supposed to be orange on the screen, and try and work out which one of these template strands has been moved out of sync by keeping track of the bases that were added before and also keeping in mind that there's another base that's going to be added after. Because you get these ambiguities, quality scores are also included for each one of these calls, keeping track which bases were confident and which ones were a bit difficult to try and resolve. When you first start off your sequences, your your first batch of bases are going to be extremely high quality because all of your clusters are in sync, but as your read lengths get longer and longer and longer, you get less confident with what the reads actually are. But at the end of all of this, you end up getting your output which is in a standard FASQ file. This is what you'll most likely be or mostly be seeing whatever you're dealing with on a sequencing, DNA sequencing, any of these sequencing formats and the FASQ format is a set format and it follows just as it's labeled on the side. You have the name of the sequence, so it's the name that the sequencing machine assigns to that specific period. You have the actual nucleotides, the bases that were called by the machine, you have the space folder just delimited, and then you have quality scores for each one of these bases in that given read. So the quality scores are actually these hs, gs, and so on and so forth are actually fret scores. So you have to convert them back into a fret score and the fret score itself is a probability score of saying how inaccurate is that position. So a fret score of three means that 50% accuracy is given to that base. A fret score of 40 is that the base call being inaccurate is about 0.01%. 40 is the upper limit for these fret scores. So when you get your reads as an output, you want to be able to keep in mind which bases were and were not high quality because when you're aligning them, this becomes an important parameter to find the best alignment match. But now that you have your reference sequences or your sample sequences anyways, we want to be able to go to our reference and find where this read maps uniquely onto the reference. It's equivalent to getting a line out of a book and then skimming through the book to find the exact page number that ends up matching off. But just like if you had a very long book or a book that kept repeating its sentences or its words, it becomes a problem because we have repeats and because we also have multiple sites that they could be found. So this is what the mapping problem basically is, finding the best match for a read in a given reference. So just as I said, the genome especially human genomes are the human genome is extremely large and repetitive. So a read which is only about 150 base pairs, for example, could align to any one of these red positions. And we had no idea of actually figuring out if this was a perfect match to each one of them. We have no idea of figuring out which one it exactly is. The other problem is because you're searching through the entire genome for each one of these reads, using simple algorithms that will just scan through the entire genome is extremely slow, extremely computational expensive, and it's not realistic at all. So we need to be able to efficiently find where this reads maps onto the reference genome, all the possible sites at the exact same time. The other problem comes in is that the reads that we have ourselves also contain errors. So it wouldn't exactly, so there isn't a case that it would map exactly onto a reference. We have to be tolerant of the fact that some bases might be disqualified, we might have insertions or deletions in a read itself. And this is completely ignoring the fact that you might have SMBs or indels or mutations on your actual sequence itself. So this is purely from a mapping problem. Mappers are more tolerant of bases being mismatched than inserting indels. So if your whole task for your project is to find indels, it's more difficult for programs to identify these than compared to just SMBs. So the mapper that we choose as well must also be aware of these facts and be able to tolerate them. So although many softwares did come out with different speeds and different sensitivities on how well the accuracy can work out, what the community ended up landing on was the burrow's wheel liner. The reasons for this is it's extremely well supported. You can find multiple documentation online where people help and explain and troubleshoot any problems you might have. Most of the S&P and somatic mutation colors downstream of it are compatible with BW alignment. And that's also something to be aware of that all the tools that you use must have a compatible input and that input comes from our previous output. So sometimes you'll end up spending about a few days if not a week trying to configure your inputs just to run a specific program. And the main advantage of the burrow's wheel liner is it uses this FM index here, which I cannot exactly explain to you how it works, but Jared certainly can. But it's a way or computational method of being able to find all the exact alignments in a reference genome for a given read at the same time. So it helps save us all that computation problem of where read might map onto the genome at multiple locations. So once we have our burrow's wheel liner, what it'll do is it'll take our read and I'll map it to all the different locations in the genome that could possibly map to. But then we arrive at a different problem. Which one of these would be true is a read mapping to this chromosome 10 location 1,020 with one mismatch of better mapping or better mapping location than on chromosome two position 2,139 or the mismatch of two. And this is where the number of mismatches you have as well as the base qualities at these mismatches is important because if the base quality of these mismatches is extremely low, we weren't expecting there to be a good match over here. But if the base quality called for this A is extremely high, we would have expected there to be a match up there. So what the mapper ends up doing is it takes the number of mismatches, the number of indels, and the base qualities of each of these positions as well and converts it into a mapping quality. So the mapping quality helps the aligner figure out which one of these two alignments is the one that's going to go with a higher mapping quality is the one you want to go for and that's the one that ends up selecting. Yep, and this goes over so that if you have a lower base quality scores to estimate a mismatch, you end up giving it lower penalties if it's a mismatch or if the rest of your alignments are super high quality with high base pairs, your mapping quality goes higher and higher. The highest mapping quality you can have is 60, that's just the upper threshold and the probability works the same way. It's the probability of a mismatch is 10 over, yeah, it's one over 10 to the power of negative whatever this number is. So in this case, it'll be a 1% mismatch versus 10% mismatch over here. There are other cases where you'll end up having matches to the reference genome where both positions are exactly the same with the exact same mapping quality. And in these cases, there's no way of actually resulting which one of these positions is actually true, which one of these is false. Because they're both the exact same, the BWA aligner will actually end up picking one of these at random, but assign it a mapping quality of zero because it's totally ambiguous. You have no idea of where it's going, and you can't actually infer any information from it. So Illumina ended up understanding these problems as well, and realizing that just having the one-sided reads does create these ambiguous positions, and so we ended up arriving at paired-end sequencing. So the way paired-end sequencing works is you take your DNA fragment, and you end up ligating primers onto both, and you end up having reads that start from either end and work towards the inside. So 5 prime, 3 prime, 3 prime, 5 prime. Your total DNA fragment over here is actually called your template size, and then the distance between these two arrows over here is actually called your insert size. This is just mislabeled by it. But now because we have this extra piece of information of our paired-end read, and we know what our average template size should be, we can use that information to go back to our aligner over here, and see, or our reads over here, and see where our second read ends up mapping. And you can see that for the first one, our paired-end read actually aligns with a high confidence onto the same chromosome just further down, whereas our second read doesn't really map all that well. And because of this, we're able to resolve this ambiguous read to map to chromosome 16 location 4987, and this read is disregarded, or this mapping position is disregarded from the entire vampire. So because you're trying to collect all of this information of where your read is, where the reference genome ends up mapping, what the mapping qualities are, and all of these different metrics like, well, where the mistake's happening, where do you actually have insertions or deletions? We actually have a standard format for that called the sequence alignment map format. It's used for storing read mappings. Sam is the human readable text format. So you can do a head on a SAM file and just read it. The BAM file is a binary version of this format. It's much more efficient in terms of space. It's how you usually store your aligned files and how programs usually take them in. But if you do a head on a BAM, your entire terminal will end up going all over the place because it's machine readable, not human readable. So in the tutorial itself, we'll see how we can convert SAM files to BAM files, because it is a stuff that needs to be alert. So what you see over here is a typical line from a SAM file. And we're just going to go over each one of these characteristic positions in a SAM file to be able to decode what is what. So what we first have is the read ID. So this is the exact same read that came from your FASTQ file. So you can go back and pull up which read was, is this SAM alignment talking about. And then we also have a flag. So the flag isn't something that you can just look at and automatically determine what it indicates. There's a online table, then there's a link to that table for SAM utilities that'll convert your flag to explain how that alignment actually works. And the flag can indicate stuff from where whether the read came from the reference send, if the read spare was not successfully, if you have only one of your reads is aligned, so on and so forth. So it's a way of actually getting explanations on each one of your alignment metrics. Then you have the actual chromosome and position that your read ends up mapping to. Your mapping quality is also stored right there. So sixties are ceiling. So this is a perfect or it's the best match we can have. Then you also have a cigar string. So the cigar string actually explains how your line, how you read maps onto the reference. So in this case, we have 76M, which means 76 matches. Over here, there's two examples with different cigar strings. The first one explains that you have four matches, ACGA, one deletion, and then six matches. This is supposed to be one dash rather than two. And then the other one is your four matches, one, two, three, four, one insertion because it's missing. It's supposed to be one dash again. It's missing in our reference and then four matches once again, AACC. So that T over here is one dash. This T over here is one dash down here. So the cigar actually explains how your read maps onto the reference. Now the important thing to note is the cigar string only cares about which positions it ends up matching. It doesn't explain that it's a perfect match. As you actually look at the second one, this A and a T aren't the exact same. But because you're a liner forced these two to not be a mismatch, to count them as a match, your cigar string will only encode that information. So if you want the actual sequence from the reference, you'd have to go back and pull up what position and what number of bases actually mapped in that reference. The last two flags are just where you're prepared in sequencing, where your pair ends up mapping. So if it's an equal sign, it's the exact same chromosome and what position it's currently at. And then the last piece over here, the insert size isn't the insert size, it's actually your template size. So it's the entire length from end to end. So these are some resources that are typically used. So the SAM toolkit is extremely useful for processing and manipulating your SAM BAM files. You can sort them, you can create indices for various purposes. You can take out reads that only map within a specific location of your region of interest. If you have a BAM file or a SAM file, the specifications can also be found in this link over here. And they go well in depth into all the different headers and file formats that you can have. If you ever get stuck with questions with alignments outside of this course, so Biostars has already been shown in the previous two lectures, but there's also a source board for SAM tools help. So just in case you're running problems with SAM tools, or if you have sequencing specific answers, there's seek answers as well, where people talk about different alignment metrics, or if you're using a new technology, you can look in there as well. And then if you want to decode whatever your flag stands for, or if you want to go the opposite way and find out what flag corresponds to a specific set of parameters for your alignment, you can go to the Broad Institute and they have an explanation for each of those flags. So at the end of your BAM file, you can make a BAM index file. And then just like you guys did yesterday, you can actually view your alignments on IGV. So what we have over here is, I believe that's supposed to be a control and a tumor sample. And you can see that at this position, you have a most likely a heterozygous SNB with a G instead of a C, I believe. So this would be where read would map perfectly, but this map over here just doesn't align to your reference and that information is stored in your BAM file as well. Typical things to look out for after you do your alignment is first in coverage. If you have 754x in coverage on average across your whole genome, you don't want huge peaks that are about 8,000. So you expect your coverage across the genome if you sequence deeply enough to be relatively stable and relatively comparable. Huge spikes means there's most likely misalignments happening to those positions. If you actually look at this position on the ideogram above, you can see that this is the centromeric position. So having these mistakes kind of makes sense. But if it happens in the rest of your chromosome or in the rest of your genome, it becomes areas to flag, look out for, or try and realign, to try and get rid of these major spikes. The other thing that can sometimes happen is if you have extremely long deletions that span much, much greater than your paired and sequencing or your reads, you can have soft clip reads. So what you can see over here is that portions of your read map perfectly or portions of your read map perfectly until you get to this break point over here. And then the rest of this read just doesn't align at all. What ends up happening is only portions of the read are actually mapped and the remaining portion are actually removed or soft clipped. Your BAM file will actually indicate in the cigar string whether part of your reads are soft clips and soft clips are usually indications that you've had deletions or some kind of structural variant happening there. If you see them recurrent enough, or you just have problems in your reads or, yeah, problems in your reads, unless you see them all across your whole genome, which they sort of line with this just all over the place. That's actually when we're going to segue into the mapping exercise on tutorial. Just to give you guys some hands-on experience. Are there any questions? Yes. So if you do high throughput sequencing across the whole genome, because it's high throughput and you have say 30x coverage, you don't expect there to be major, major, major spikes, regardless of the fact of whether there might be higher GC content at a specific location. So having twice as much coverage might be okay, but having 10 times more or say 50 times more, it does raise a few clouds. Yes. Yes. If you have higher coverage at a specific location, let's say over here, what you might have is a very repetitive sequence, very repetitive element. And so your ambiguous reads that map to those repetitive elements all just pilot over here. Yes. So your mapping quality would also be less because if they were fully ambiguous, they would end up multi-mapping or being completely ambiguous. And so they'd have mapping qualities of zero. Yes. Yes. So this is just assuming that if you keep, so if you end up mapping your reads perfectly and say this second read of your maps to chromosome 3, 7, 1, 1, 2, perfectly as well, you'd have high qualities on both of these, and so that read would be conserved. The issue becomes if you have perfect matches on chromosome 16 and chromosome 3, where does the second one end up mapping to? Because this one maps to a unique place on the genome, whereas this one really doesn't, you end up disregarding this potential position. So if you had structural variants, then the sequence of this one would map uniquely onto a different chromosome on a different position with a high confidence, so you would end up keeping that position. This is just comparing downstream of the, or ideally that template length and trying to see which one of these two positions would end up being perfect. The Fretz course. So the sequencers have their own internal models because they keep track of the bases that have come before and the base that come after, and they're looking at the combined fluorescence that comes from each one of those clusters, so it does its own mathematical model.