 All right. Morning everybody. I hope you've had a coffee or are going to have a coffee. So what was the name? Jared Simpson, where I am. I'm here at OICR. I'm a principal investigator. It means I run one of the research groups. My lab is situated pretty much just above this room on the sixth floor. Heather, one of our instructors and TAs is one of my grad students. What we work on in my lab is essentially developing software analyzing genome sequencing data. Very appropriate that I'm giving today's talks because that's essentially what I'm going to talk about. I'm going to talk about some of the software that's been developed for mapping reads to reference genomes and then later on this afternoon we'll talk about genome assembly. I also have an appointment as an assistant professor in the Department of Computer Science across the Street University of Toronto where I interact with other sort of computational faculty to try to solve challenges associated with dealing with the very large data sets that we have from high throughput sequencers. Did I cover everything required by the introduction? Something known on the CV. I'm according to my grad student getting a dog. She's been finding dogs for me for the last week because my wife and I just bought a house so apparently that's time to get a dog according to virtually everybody that I know. Yeah, also Michelle was miming out and I'm also a rock climber so if you can't find me in the lab you can find me at one of the local climbing gyms usually Rock Oasis Wednesdays and Saturday mornings. All right, so let's let's get into the lecture then. So just some brief preamble so all the slides are available through Creative Commons licenses. Part of Creative Commons is that you attribute individuals who contributed to your materials. So here are my credits so some of the stuff that I'm going to talk about today was originally developed for courses by Ben Langmead, Johns Hopkins University, and Aaron Quinlan from the University of Utah. So we appreciate them making these openly available teaching materials that everyone can benefit from. And if you're interested in sort of other topics in computational genomics, how to analyze sequencing data, you can look at their websites here where there's a lot of additional material on other topics. So my topic here this morning for both the next I'd say 45 minutes to an hour is on mapping reads to a reference genome. So yesterday when Trevor presented the introduction to counter genomics he had this figure in the slides where he visualized data processing as sort of this pipeline where you put data in our case sequencing data and at the beginning of the pipeline it flows through a number of steps and then at the end you get some results out maybe some mutation calls maybe a list of driver genes or any other topics so we'll be covering in the course. Now the focus of this lecture is about that very start of the pipeline. So when you first get sequencing data maybe you've done a little bit of QC or quality control in the data then what do you do with it? And the main the first step that most workflows will do are taking the reads and then aligning them or mapping them to say human reference genome and then passing the alignments down to downstream tools like somatic mutation callers which will identify where the somatic mutations are. So many of you have probably run read mappers before. Who's run a read mapper before? So maybe about a third quarter of the class for the rest of you you're going to have a chance to run a read mapper in the practical session but for everybody what I hope you get from this is an understanding of what we mean by talking about mapping reads to a reference genome some of the terminology around mapping reads to reference genome. I'm a computer scientist by training so I think a lot about how to design efficient methods are going to be able to very quickly and very accurately align reads to reference genomes. I'm going to talk about the algorithms only at a very very high level. I don't want to go into the details today but if anybody's interested in that I can talk about some of the techniques in the tutorial or in the coffee breaks. I also want to introduce to you some of the main file formats that we use when talking about sequencing data. Fast Q format that's the file format that sequencing reads come in when you get them from the sequencer and then the sem and bam file formats are how we represent reads aligned to a reference genome. We'll talk about some common terminology things like base quality scores things like mapping quality paired in reads. I'll introduce all of these concepts and discuss the terminology throughout this lecture and then finally after I'm done talking and you'll get a chance to run a read mapper. Our preferred read mapper is called BWA. It's written by Hang Lee. We'll talk about that a little bit later and you'll get a chance to explore some mapping results on your own and look at some aligned data. All right so I'd like to make this interactive as much as possible. I get really tired if I talk for just like an hour unbroken so please feel free to put your hand up and just ask questions so you can make this as much of a dialogue as possible. We have a question already. Thank you. Yeah so usually what will happen is that so BCL is a file format that illuminate instruments output. I think it stands for Base Call Likelihood. We'll talk about base calls a little bit later. Usually like very very early in the pipeline some will convert BCL to Fast Q and then you'll all of the downstream analysis tools like these read mappers BWA like another read mapper called bow ties some one that some people might have used or heard of. They all expect Fast Q files as input so there'll be a step sort of upstream hopefully upstream of when you get the data that converts BCL to Fast Q. That's a good question. Yeah usually I think that it's pretty much just a transformation so the exact same information is in the Fast Q as in the BCL and vice versa. It's just structured slightly differently. But like a larger question is what do we mean when we talk about the accuracy of sequencing reads. I'm going to talk about that a little bit later and then maybe we can touch on some of the some of the aspects of BCL. That's a good question. I want to come back to that. Anything else before before we go in. Is there anything that people want to hear about in particular. Yeah right. Specialized liners for ancient DNA. Yeah. That's a really good question. I know Trevor talked a lot about self free DNA yesterday. Sorry. A little bit. Okay. Not a lot. A little bit. He mentioned it. A lot of times like this this lecture today is going to be very much tailored for just you know bulk whole genome sequencing sort of you know the what we what we expect the data to look like when we say we've sequenced a genome. When you have things like ancient DNA which can be highly degraded. It has certain types of sequencing artifacts. You definitely have to handle that specially. I'm going to try to work out in as I go where you know you might vary vary from some of these pipelines that we're talking about. So I'll definitely try to cover that. Anything else people want to hear about. Yeah. Batch effects. I think we're going to have do Michelle is busy right now. Is someone talking about batch effects sometime during this course. Like if there's not there's not like a special statistics part about that effects. Okay. I'll try to bring that in. Usually we think about that like a bit downstream of sort of primary data analysis. So what we're going to talk about here today and I think tomorrow as well is taking your sequencing data and then generating say some out of mutation calls or copy number calls. Usually when you work on batch effects that comes in later when you're analyzing the results of maybe 50 to 100 sequence genomes jointly then you need to think about things like you know are we seeing certain variants only sequence from the one sequencing center versus another sequencing center. So I could try to I could try that bring that in but it's a little downstream of what we're going to talk about today. Yeah. Cram. Ah yes I can talk about cram. Let's you I'd like you to remind me of that when I talk about the BAM format in about 40 minutes from now because that's that's definitely an interesting one and very topical for 2019. All right. So that's a good list of stuff to talk about. Let's let's jump in. So I think the starting point is really how sequencing works itself. So this lecture is going to be entirely about Illumina sequencing. So as many of you are aware of Illumina is essentially the market dominating sequencing technology. It has incredible throughput where you can sequence hundreds of gigabases even up to a terabase per run with the latest Illumina Nova seeks. The reads are very very accurate. They're accurate to about 99 point let's say 9% or one error per thousand bases. They're very fast and robust as long as you can extract DNA. You can probably make a sequencing library and get it to sequence on the Illumina sequencer. This is important for things like ancient DNA where the DNA is very short. It's very degraded because you can amplify and get enough to sequence on Illumina. But then has the main disadvantage that there's this inherent limit to the read length. You can only sequence up to 100 to 150 bases per DNA fragment. Now does anybody know why that's a limitation. You don't only sequence what we consider short reads. Might be a problem with deletion. So it might be difficult to detect events that are larger than your read length. Definitely that's a definite limitation. That's a good one. Any other limitations of only having really short reads. So how big is the human genome? 3 billion bases. So reads are let's say 100 bases. It's about seven orders of magnitude shorter than your human genome. Now the reason that this is important is that not only is the human genome very large but it's also very repetitive. A lot of segments of the human genome roughly 50% of it are present in multiple locations of the genome. So that means that when you have a hundred base per read if it comes from one of these repeated segments that are present in multiple copies it's very difficult to say whether it came from this copy versus this copy. And we call this the repeat problem or the repeat mapping problem. It's going to be much more important when we talk about genome assembly in the afternoon but it's one of the main challenges of just working with short read sequencing is that you can't resolve where the reads come from if they come from some of these very high copy repeat regions. With that in mind there's a new sequencing technology developed. I usually say about 100 to 150. I think people have run my seeks out to about 300 bases but you're not getting more than let's say an upper limit about 300 bases. And the amount of the genome that you can cover with a 300 base per read versus a 150 base per read isn't that much. With paired ends you can do a little bit later. But the big jump is between the new single molecule what we call third generation sequencers developed by PacBio and Oxford Nanopore which can sequence 10,000 base per read or greater which allow you to cover much more of the genome and more of the repetitive regions of the genome. That's much more important for genome assembly so I'm going to leave the entire discussion of long read sequencing until this afternoon. So one of the main things I want you to get out of this lecture is an understanding of some of the ways that sequencing can go wrong. So a lot of the concepts we're going to be talking about are relatively straightforward in the case where you have perfect data and your genome doesn't contain a lot of repeats. So I want to introduce you to the challenges that errors in your data might generate and things like repeats in your genome might cause complications. So to start with that I want to talk about this cartoon look at how Illuminous Sequencing works. So the first step of Illuminous Sequencing you take some genomic DNA and then you shear it using say sonication or enzymatically down to short DNA fragments of maybe 300 to 400 bases. We then take single stranded template molecules, a single stranded fragment, we call it a template, and we attach it to the surface of essentially a microscope slide which we call an Illumina flow cell and the molecules are sticking up perpendicularly to the surface of this flow cell. We're going to call each one of these our templates here. Then there's an in place PCR reaction to amplify these molecules. So starting from this single molecule template we get this bundle or cluster of identical copies of that same sequence all localized in the same region of our flow cell. You get maybe 10,000 copies of the molecule per cluster. So in this little cartoon here we have two templates and then they generate two clusters from them. Now Illuminous Sequencing is very similar to classical Sanger Sequencing technology and they were going to use chain terminating bases that are furlescently labeled one color for A, one color for C, one color for G, one color for T, and then we're going to use DNA polymerase to copy in complementary nucleotides to these template molecules are attached to the slide. So we're only showing one molecule in the cluster here just for clarity but you can imagine using this bundle of identical copies. So what's going to happen is we've got our fluorescently labeled nucleotides. We've got DNA polymerase. We add them to our flow cell. DNA polymerase is going to find the complementary base to the first base of our template molecule. It's a T so the complementary base is A. For this one it's a C so the complementary base is G. You then excite the fluorophore attached to the nucleotide with a laser that's going to emit some light and then using very expensive digital camera, hooked up to a very expensive microscope, you take a picture which registers the color that was emitted by each one of those clusters. So we've got a flash of green light here and a flash of this orangey yellow light over here. Yeah. That is yeah so the yeah so the the question was this PCR step is it going to generate a complementary sequence as well. So I think that the way that the Illumina chemistry works is that the this is the five prime end down here. And this is the three prime end down here. And I think that there's a bed of adapter molecules that are all down here and the bridge amplification at least you know for the Illumina high seeks would bend the sequence like this and then amplify in place. And then I think it would then cleave the three prime and they all stand up again and then copy and place but you only want the you know this template strand attached to your flow cell or else you're gonna have this mixed signal. Yeah, I don't remember exactly how it works but you end up with this cluster of single stranded templates all with the same sequence and all with the same stranded or else you'd get this weird this weird mix. Yeah. We've been trying to experiment laser capture samples. Yeah. So a little input we had the PCR amplified first. Right. That's separate than what you. Yeah, yeah, this is yeah, there's typically there's in a typical Illumina library prep, there's two amplification steps. There's this amplification step that happens before you're generating your DNA or expanding your DNA to enough concentration that you can you can add to the flow cell and then there's this in place PCR step on the flow cell. That first PCR step is optional. If you have loads of input DNA, you can build what's called a PCR free library, which gives you less bias and coverage across your genome. But this step this bridge amplification happens on this flow cell is is necessary. You can't get away from this one. All right, both good questions. So we were this stage now we've taken a picture of our flow so we've registered a flash of green light and we've registered a flash of orange light. And that allows us to sequence the first base. So this is the top down view of our flow cell. We've registered some green light in this lower left quadrant, and some orange light in the upper right quadrant. What then happens is that you add a chemical or an enzyme to the flow cell that removes a blocking group from these nucleotides are added that then allows the reaction to proceed to the second base of your template. You then do another edition of fluorescent label nucleotides DNA polymerase, excite it with your laser capture light in this case we see red in both colors, and then remove the blocking group again. And each one of those steps of chemistry is called a sequencing cycle. So luminous sequencer, if you have 100 base pair reads, it runs for about 100 cycles, each cycle sequencing a single base. Do you have a question? Yes, these are so class. So the classic Sanger sequencing technology used what's called chain terminating or chain inhibiting bases, where they were nucleotides, they were chemically modified, so that when DNA polymerase copied them into a strand of DNA, they couldn't be extended any further. So the sequence be copying reaction would continue until it hit a chain terminating nucleotide, and then it couldn't be extended any further. That's how Sanger sequencing worked. In aluminum sequencing, it's called reversible chain terminating bases. So you have this blocking group on your nucleotide, it would get incorporated, but then you could enzymatically remove the blocking group to allow the reaction to proceed later. So it just allows you to control how many nucleotides get incorporated every step of the sequencing. And you take pictures at each step. Does that make sense? So it's a way of stopping so that we can picture it. Yeah, so if you didn't have these if you didn't if you just use, you know, regular DNTPs, regular nucleotides, when you added the nucleotides and DNA polymerase, it would just go and copy the entire, entire complementary strand. If you have a chain inhibiting one is going to copy exactly one basin. Then if you remove it, you get to do the second base in the next cycle. Alright, so at the end of this, we have this, this stack of images, one per base one per cycle. And the instrument has software, which is called a base collar, which is going to take these images, try to figure out where the clusters are in each image, and determine the color of each base. And using the map that say green is representing a, and so on, it can figure out what the sequence of each molecule was. Now errors can creep into this process. And the predominant mode of sequencing errors for luminous sequencing is when these reactions get out of sync. So now we're seeing all the molecules in one cluster. And the idealized version of it, if this chain terminating reversal chain terminating chemistry works perfectly, is that all of the molecules get the first base sequence, then we remove the group all molecules get the second base sequence, then the third base, and so on all the way through our DNA template. Unfortunately, chemistry is never this perfect. And what can happen is that some of the nucleotides might not have that chain terminating base on them. And sometimes that chain terminating base may not be successfully removed. So some of your molecules in this cluster can either jump ahead by going too far in the sequencing reaction. And some can lag behind. So in this cartoon here, let's say we're on the second cycle, we're sequencing the T base in these two templates. But this one has lagged behind by one step. So we're still sequencing that A base and seeing a flash of green light. Conversely, in this molecule, we've jumped ahead to the third base, maybe the chain terminating base group was missing from this T, and we've jumped ahead to the G base. Now what's going to happen is when you take that picture, you're going to see this mixed signal of 50% red, 25% green and 25% orange. And the base calling software is going to have to try to deconvolve that and figure out what the true base was at this, at this cycle. Did you have a question? That goes behind the head. The cycle two comes to cycle three. Yeah. And that keeps studying this kind of mistakes. Exactly. So the mistakes compound as you prove as the reaction proceeds from the first base to the 100th base, more and more of your fragments get out of sync, and you have a more and more mixed signal. If you've ever looked at a plot of the air rate of luminous sequencing, it looks like this at the five prime end, it's very flat, it's about maybe, you know, let's say one in 10,000 base errors. And then as you proceed along the reaction, and get to the three prime base, it goes up very, very sharply at the end. And that's because as the reaction proceeds, there's more chances for these molecules to get out of sync. And you see more and more mixed signal in these clusters. And it's very difficult for the base caller to figure out what the true base was for these later bases. Yeah, this is the term we use for this is phasing. I think pre phasing is when it goes too far ahead. And phasing is when it falls behind. Yeah, I guess two kind of questions. So one, this would be the same process for RNA also, right? Yeah, so for RNA and alumina, you, you convert to cDNA first, then it works exactly the same as this. And so, are you incorporating somehow the places to make it stop the breaks, if you will, into the DNA, into the DNA itself? Or are you separating it out and attaching it to the nucleotides? I guess I'm wondering, could you have like free nucleotides floating around and binding to the DNA? Like, like free naturally occurring nucleotides that don't have a fluorophore and don't have a? Yeah, I think it's sort of a question of rage and purity. Like we have, we have all of these nucleotides, which are the ones with fluorophores, which have this chain inhibiting group on them. Those are the ones we want incorporating. I guess your question is whether there's just regular nucleotides that can go into the reaction and, and, you know, cause issues, my guess would be probably, but the question is like, what's the rate? What's the proportion of those? And I know I can, I don't know that one. But definitely like, those are the issues, those are the things that generate issues with the sequencing is if you have this, these unobserved bases, or, or bases that don't have this chain terminating group on them. All right. So just to confirm, like this chain terminating, is that in between each nucleotide? Yeah. Okay, so each nucleotide that's been incorporated into it prior to going through this? That's right. Yeah, I can, I'll try to dig up like where, like what the exact chemical structure is and put it in a slack if you're interested. This is all like, this is all, you know, the aluminum, what happens on the aluminum sequencer was happening in their like library preparation kits. But like, why am I sort of gone through this is we want to understand this, this air mode, because, you know, you see this air rate where the air rates going up your three prime and understand why that is you need to understand why these molecules get out of sync. All right, all good questions. Yeah. Yeah. So so for for Sanger sequencing about this chromatogram where you see the Sanger trace where you see the peaks of color. And you configure your sequence that way. We don't exactly get that from from luminous sequencing. What we get instead are what we call quality scores, which are a measure of how of how confident the base color is that what it predicted at a certain position of the read is correct. And that's perfect leading to the next slide, which is going to talk about quality scores. So here is a fast queue file, there's five reads in this fast queue file. When you work with sequencing data, you look at these files a lot far more than you ever thought possible. And it's a pretty simple text format where each read is represented by four lines. And the first line is the ID or the name of the read. The second line is the nucleotide sequence. So this is the sequence that the base caller predicted for one of these clusters based on these images that we saw. The third line is sort of a relic from older sequencing technology. It's just a plus noun we just consider the placeholder, which is largely ignored and just takes up absolute terabytes of disk space across the world. And then for finally, the thing I want to talk about is the quality scores. So the quality score represents how confident the sequencer was that say there was an A at the first position of this read, and it's assigned a quality score of H. So I'm going to come back to this. But I first want to talk about the scale that we use to represent quality scores. So it's called the Fred scale is developed by Phil Green, pioneering sequence analysis, buying from addition from the University of Washington. And it's a logarithmically scaled measure of how accurate the sequence is. And if we see quality score in our reads of what we say Q 40, it means that the base caller thinks there's roughly a one in 10,000 chance that that base is incorrect. If it's Q 30, it's about one in 1000 or point 1%. Q 20 is 1%. Q 10 is 10%. And then for single digit quality scores, they're very, very low indeed. So essentially, you want when you're looking at sequencing data, you want your quality scores to be as high as possible. And with modern Illumina sequencing, typically the quality scores are around. I'd have to look it up. I'm going to just say around Q 30 or maybe a little bit above, where there's this air expected about one in 1000 positions. Now this is the we use these these numbers between, let's say zero and 40 to represent quality of our sequencing data. But in our fast Q file, we just have these single letters. Now the reason for this is that we just want to be able to represent our quality scores for each base using a single character. So each one of these quality scores lines up nicely with the base call that it represents. So we use the ASCII character encoding system to represent these quality scores. And we do that by taking the quality score and then adding 32 to it and then looking it up in the ASCII table. So quality score of H is probably representing about a Q, probably Q 40 base. Down here, the three prime end quality score of pound sign or hash tag, if you may, is probably something like Q 5 or Q 10. We could look it up in an ASCII table, if people are interested. But essentially, you want these quality scores to be as high as possible. And if you're not comfortable with looking at these ASCII character codes and pretty much nobody is, you can look up online converters from ASCII characters to quality scores quite, quite easily. So to go back to your question, this is the way that we're representing the quality of our data. We don't have chromatograms or senior traces. We have quality scores. And then we, we can look at say the read the line to a reference genome and something like IGV, it's going to shade the bases, according to their quality score, and we can we can make a prediction of how reliable they are. So we're looking for somatic mutations, you want all of the reads, all the bases and the reads that have evidence that mutation to have a very high quality score and sort of the Q30 range. Okay, that is the end of talking about how luminous sequencing works. Is there are there any questions before I move on? Yeah, yeah. Yeah. Right, that's a great question. So the question is, how does the base caller, to estimate these phasing issues in regions of low complexity, like if you just have a homopolymer run of all A's or all T's. And homopolymer runs are pretty much the universal problem for all sequencing technology. When we talk about long read sequencers. Later on, the sort of the biggest error for long read sequencers is they're trying to estimate it off of just single molecules. But definitely you can have this increased error rate in homopolymer runs because of the fact that you you can't accurately estimate the phasing because of that. I think it'd be interesting. I know it's, it's, you know, I've worked on variant calling for aluminum data. And we have to treat variants near homopolymer runs with a lot of caution. I'm not sure sort of in, you know, with the modern Nova seeks, whether this problem is fixed. But definitely that's, you know, we worry a lot about variants. Can you call variants near long homopolymer for issues like that? Yeah. Yeah, so strand bias is an interesting, interesting thing to come up, bring up. Because when we do call mutations, and hopefully, when you get a lecture on somatic mutations, I think tomorrow, they'll talk about the importance of seeing mutations on both strands. So it's very common when you call somatic mutations, or just, you know, germline variants, that you observe the variant on both sequencing strands to get over these sort of strand specific errors, sort of, particularly for homopolymer runs, if you have a homopolymer A on one strand, you're going to have a homopolymer T on the other. So you, you know, you're still going to have a homopolymer, maybe a T is slightly better behaved than an A. But it's not going to entirely get around this issue, if just observing it on both strands. Did you have a question, but I'm completely unqualified to answer that. That's like, that's a really good question. I, we can maybe talk about it later, but I don't know how, you know, the strength of the laser or sort of our mission spectrum, or whether they're going to cause issues like that, but it's definitely something we can look up and then follow up later. That's a great question. Are anything else about the sequencing tech? Yeah. So you've got a sequence and in the beginning, it's very high, but in the end, it drops down. Yeah. So typically you're going to look at a whole read. And if, you know, if the reads in your fast queue file like this, the aligner is going to try to align that entire read to your reference genome. Now there's two caveats for that. The first is that sometimes people will do quality based trimming, where they'll try to identify low quality sections of the read and then just remove it. So let's say you want to quality based trim here, and these, these number signs are representing low quality scores, you might just truncate the read at this G base here, and then throw away the rest of it. In that case, when you align into your reference genome look at an IGV, you wouldn't see that tail. That's not so common with, with doing read based mapping to a reference genome, you typically will just get the reads, align to your reference genome and let the variant caller figure out whether or not to use those segments or not. What might happen though, is that if there was a problem with the sequencing, and this tail of the reader, the suffix of the read is completely unreliable, like this, the sequence is nonsense, then the aligner might only align the read up to this position, and then do soft clipping of the rest of the read, and it won't be aligned to the reference. And when we look at BAM files later on, we can see examples of reads have been soft clipped, where it just wasn't aligned. It's a good question. All right, anything, any other questions? All right, so let's talk about what we're actually going to do with the sequencing data. So typically, our goal, particularly when we're analyzing cancer genomes, is to discover genetic variation, which are differences between the individual genome that we've sequenced, and some reference genome or some population of reference gene. And to do this, we typically want to take our reads and compare them to the reference gene. So we might take our read, figure out where it came from in our reference genome, and then look for places where the read differs from the reference. And here I'm just drawing this alignment here showing this vertical bar when these faces match up. And then there's no bar here because in our read, we had a T, in our reference genome, we had a C. So this could either be caused by one or two issues. Maybe the individual had a SNP or somatic mutation of this position, which was a C to T SNP, or this is a sequencing error. And pretty much all the work in this primary bar informatics is going to be trying to work out whether these mismatches here are SNPs or sequencing errors or somatic mutations. Now, there's two main classes of algorithms for analyzing your data. We could either take all of the reads themselves, perform a genome assembly, de novo assembly of the reads, then compare the resulting assembly to our reference genome, or we can individually map read by read, and then look at the pattern of the alignment to try to discover variation. The read mapping approach where we take our reads and compare them to our reference genome is by far the most common for analyzing cancer genomes using aluminum sequencing data. So we're going to talk about that problem this morning. And then later on, as I mentioned, we'll talk about the other approach where we do genome assembly. Alright, so to do this, we need to figure out where our read best matches to our reference genome. So it's a bit like, I don't know, let's say looking through our lecture notes for a particular page or topic. So let's say you want to remember, you know, you couldn't remember who gave the lecture for module four. So you type might take all your lecture notes, flip through until you find module four and look it up and you see, okay, I was Jared Simpson. Now, to map reads to a reference genome to find where the read places on a reference genome, we need to use really sophisticated algorithms, because the scale of doing this matching problem of trying to find where in our reference genome each read came from is just huge. A luminous sequencer might generate, let's say a billion reads, and the human genome is three billion bases and length. So there's no way we just take each read and compare it to every position of our reference genome to figure out where that read came from. So we're gonna have to use data structured, which are called text indices or text index to do this. So the main challenges for mapping reads to a reference genome is that the genome is very large, very repetitive, naive algorithms like trying to compare the read to every position of our reference genome is just going to take too much time. And we also have this problem where we need to avoid comparing a read to every copy of a repeat within the genome. So let's say we have our genome here. Let's say these three red segments are repeats, and we get a read that comes from one of these repetitive segments. So we don't know where the read can come from. It could have come from repeat copy one, could have come from repeat copy two, or it could have come from repeat copy three. And the mapping algorithm software going to use to map the read is going to have to very quickly decide whether it can be uniquely mapped to any of these locations. And if not, flag it as a read that comes from a repetitive region. We'll talk about that a little bit later. Alright, so third challenge when mapping reads to our reference genome is something we've already briefly discussed before. Reads contain sequencing errors know there's this air increase in air towards a three prime end of the read. And they also contain snips and indels. So the reads aren't going to perfectly match your reference genome, we can't just assume that our hundred base pair read is going to be an exact hundred base pair match somewhere in the reference genome. So the mapping algorithm needs to be able to find an exact alignments where we allow some differences between our read and our reference genome, like the snip we saw here, C to T, or even larger differences like insertions and deletions. General principle is that it's much easier to align reads that contain mismatched bases or substitutions than it is to align reads that contain indels. Now the effect of this is that when you're calling mutations off of aluminum data, it's much easier to identify somatic substitution mutations than it is to identify somatic insertions and deletions. So if you have a list of somatic mutations, somatic indels coming from a mutation caller, it's typically going to be much lower accuracy than if you were looking at somatic substitutions. So that's an important thing to keep in mind when looking at your own data. All right, so I started in Bioinformatics around 2008, a little over a decade ago. This is when Illumina sequencing first became available and really became widely adopted in the genomics community. And an incredibly popular Bioinformatics problem to work on is this trying to efficiently map reads to a reference genome without making a lot of errors. And I think hundreds of different mapping software have been published over the last 10 years that have various trade-offs between being very fast or being very sensitive. So finding most of the true alignments or true mapping locations in your data. But luckily for us, we don't need to benchmark 200 different mappers to try to figure out which one to use for your project, because the field is widely adopted software from Hang Lee, which is called BWA, and particularly a sub program BWA, which is called MAM, which stands for Maximal Exact Match. The reason that BWA is now the best practice for Illumina data is very well supported. It was developed, I think BWA MAM came out around 2011. So it's been used so widely used in the community. Basically all the bugs that can be found in this program have been found and eliminated by now. So it's very robust, very stable, and it's very fast. And because it's been adopted as a standard for mapping reads, most downstream tools like SNP callers and like some out of mutation callers are designed to expect alignments or mappings from BWA. And then a more technical reason of why it's a great choice for mapping is it uses an advanced data structure called the FM index to avoid this problem of having to efficiently map a read to all copies of a repeat. I teach a grad class in the computer science department, which basically spends four weeks on just describing how these data structures work. So I'm not going to talk about it here. But if anybody's interested in sort of how this FM index solves this repeat mapping problem, I'm happy to talk about it. Yeah. So we're going to talk a little bit later about the file formats we use to represent mappings, which is called Sam and BAM. And every sequence aligner outputs their alignments in the Sam and BAM format, so that if you want to use an alternative mapper and then put the calls through mutek to some out of mutation caller, you can because they're in Sam and BAM. The reason I said this that the tools are designed to expect alignments your BWA is that there's some subtleties and things how it handles, you know, corner cases and a lot of, you know, all the corner cases have been worked out with BWA mappings in mind. Yeah. Sure. Here. Right. I'm going to talk about this in a couple minutes when we talk about mapping qualities. But the FM index in particular is a pre-processing algorithm where we take our reference genome and we perform what's called the boros wheeler transform to take this long linear string and then essentially convert it into a sorted data structure that can be efficiently searched where all copies of a repeat are contained in the same region of the FM index. So we have some, you know, some allure repeat which is present a thousand times in the human genome. BWA can essentially align simultaneously to all copies of the reference genome and it will have this understanding from the output of this FM index transformation of how many copies there are of that repeat and then it can decide whether, you know, it scores better to this copy of the repeat versus another copy. And this is sort of the behind the scenes magic that makes BWA so fast. And this, you know, I don't want to undersell, you know, the technical details of these because it was incredibly important to develop this FM index mapping algorithm to solve this problem because back when we had, you know, the first aligners in 2008-2009 they're really quite slow but this new data structure drastically sped up the the whole mapping process that we could run it on just a laptop as we'll see later on. All right, good questions. Okay, so let's actually, we're going to go right into the first part of your question which was how to determine what the best mapping is. So the point of this part of the lecture is to give you an understanding of some of the pitfalls of read mapping and what to look out for in your data. So I'm going to discuss this through these cartoons of different mappings for example read. So we have a read here called read A. It's about 20 bases in length with a sequence A, G, T, A, A, C, T and so on. Now our mapping algorithm like BWA has found a candidate mapping location to chromosome 10 in position 1020 and it found a candidate location in chromosome 2 position 2139. Now the mappings aren't perfect. There's some mismatches here. One of them has a mismatch at the fifth base, the mapping to chromosome 10 and the other one has a mismatch near the end, two mismatches adjacent where this G, T was mismatched to T, C. Now it's the job of the mapper to figure out which of the two mappings is better. Now naively we would look at this and say okay well one mismatch is better than two mismatches. So it's more likely that this is the correct mapping to chromosome 10 than this one to chromosome 2. But then again we might think well sequencing errors are more likely the three prime end of the read. So maybe it's more likely that these two mismatches are sequencing errors versus this single mismatch early on at the beginning of the read. Now we don't know, this was just an example I made up for illustration, but the mapper like BWA needs to formalize this process and try to give you some measure of the confidence in the two possible alignments. And BWA and then Hang Lee in particular pioneered a score which is called the mapping quality which just like base quality is an estimate of the probability that the chosen mapping for the read is incorrect. And it's the exact same scale as the base quality scale where 40 means there's a 1 in 10,000 chance, QT 10 means there's a 1 in 10 chance. So BWA is going to look at the pattern of mismatches, look at the quality scores of the mismatches, it will have some model of the prior probability of observing a SNP in the human genome which is about 1 in 1,000 and integrate that into a final score that weighs the probability this mapping is correct versus this mapping is correct. And here we've assigned this one a mapping quality of 10 and this one a mapping quality of 1. We're 1 being a very low score. So BWA choose this mapping to chromosome 10 as the best alignment and it would put this mapping quality in the BAM output file so that the downstream tools can have some estimate of how confident BWA was in this mapping. Yeah? Is it going to take you in the future? No, that's a really good question. This calculation is done on a per read basis. So something you might think and their programs, their post-processing programs that'll do this is that there's two models here, that this is a SNP here versus this is a sequencing error. So if you have more reads they also have a mismatch in the same position then that will give you confidence that this is actually a SNP and you might weigh this mismatch differently than if it was a sequencing error. But it's really computationally expensive to try to jointly look at all of the reads so all the mappers make the decision they're going to look at it read by read and just have some prior probability that there's a SNP there. But you definitely could do some post-processing to try to try to try to just to weigh evidence from other reads. Alright, so any other questions about mapping quality? Yeah, so I said this already. Right, so sometimes we get reads that are perfectly ambiguous. So here's another example where there's another read also called Read A which has a mapping to chromes on 16 and a mapping to chromes on 3. And for both reads or both mappings there's a G to A mismatch of two-thirds of the way down the read in the same position. Now this is this fundamental repeat problem. So this means that this short segment of the genome is present in two different locations of the genome with the exact same sequence. So the mapping algorithm has no information to figure out which of the two mappings is correct. We call these ambiguous mappings because they're perfectly equivalent with no way of determining the true sequence. Now the way the BWA handles this is that it's going to randomly select one of those ambiguous mappings to output into our BAM file and it's going to give it a special mapping quality of zero. A mapping quality of zero flags it as being an ambiguous mapping that can't be trusted. So if you're ever looking at alignments in a BAM file, if you see reads are mapping quality of zero, you should automatically think that those reads are repetitive and you can't rely on them to call variants. Now a common pitfall is that if you have these mapping quality zero reads, you might want to know where the read might have also aligned to in the reference genome. Unfortunately for really high copy repeats like simple nucleotide sequences, STRs, you know, homopolymer runs, if you try to align a read from one of these super high copy repeats and output all possible mappings, you're going to get millions of mappings for these repetitive reads and the size of your output files are going to be huge. So Hang Lee made the decision that it's just going to output one randomly selected mapping and then flag it as being unreliable. So you're not getting all possible alignments for the read or all possible mappings for the read, you're only getting one. So Francis brought this up at the beginning. There is something we can do to help this situation and that is generate what we call paired end reads. So we have this inherent read length limit to Illumini sequencing where you can sequence 100 to 300 bases, but it has a feature where you can take a DNA fragment and sequence 100 bases from one end and 100 bases from the other end of the DNA fragment to generate this paired end read where you have one sequence here then some unknown sequence in the middle and then another read at the end. Now we approximately know the length of the DNA fragment. They're usually 200 to 400 bases in length and if the read mapper knows that the reads are paired together it can use this information to constrain the alignments and help resolve some of these ambiguous mappings. So let's go back to this example, chromosome 16 and chromosome 3. We have this read which is ambiguously mapped. If we took a read pair from this DNA fragment we might see that the second half of the pair maps much better to the downstream sequence of chromosome 16 than to the downstream sequence of chromosome 3. Now the fact that we don't know the sequence in the middle doesn't matter all we need to know is that this sequence was followed by a gap of about 100 bases to 200 bases followed by this sequence. And since they both have good alignments to chromosome 16 and close proximity BWA is then able to resolve that the true mapping was chromosome 16 versus chromosome 3. So using these read pairs allows us to convert whether would be ambiguous mappings to unambiguous mappings. Yeah. Yeah, that's a good question. I used to know how the paired end chemistry works. I don't anymore. It comes down to talking about how you get just the same 5' end on the flow cell the same 10' molecule where you have this bridging PCR where the 3' end goes to 3' adapters on the flow cell. It's in that process where you sort of bend the template molecules down onto the flow cell. I could look it up and drop a Lincoln slack of how the you have to use to tell the sequence or do you want to do paired end reads versus single end reads. Most whole genome sequencing applications do paired end reads just because it's so useful and as you'll see later it's not just useful for resolving ambiguous mappings but it's also useful for calling structural variation. So usually it's paired end unless you have a special application where you might only want to single end read. Yeah. Yeah, you don't observe them. You can't. So you'll have other reads that might start in this gap but for this individual read pair you don't observe any of this sequence so you can't say anything about what goes on in here. But this is just a single read or single read pair. You're sequencing a billion read pairs per lumina run so if this one doesn't cover the gap some other ones will. Yeah. Yeah, so you have a long DNA fragment of 400 bases some on flow cell chemistry happens so that the DNA is bent into this bent molecule where the 5 prime end is one part and then the sequencing reaction starts from the 5 prime end proceeds for 100 bases and then it starts from the complementary strand of the 3 prime end at the other one reads for 100 bases. The total length of the DNA fragment is maybe 400 bases so you've read 100 bases from either end of a 400 base fragment which gives you 200 bases that are unobserved. Alright, so we're coming to the end here at times of 10 just to go right. Last thing I want to talk about before we go into the practical is the SAM and BAM format. So SAM stands for sequence alignment map. As I mentioned throughout the lecture, this is the file format for storing reads mapped to our reference genome. SAM is a text format which means it's human readable. You can open it up in the text editor, you can use various UNIX tools to transform and manipulate that file but it's very inefficient to store on disk. The file sizes are huge so there's a binary version which is called BAM which is what we usually use to store sequencing data. And in the practical we're going to see how we convert SAM files to BAM files. Most aligners will natively output BAM so it goes directly to this binary format and then you can use a program called SAM tools to view the text representation of your alignments for regions of interest. So here's what a SAM record looks like and I'm going to walk through fields one by one. So this should all be on one line but because the screen would be too wide I've broken it up into three lines here. So the first field is just the read ID. So if you think back to the beginning when we talked about that FASTQ file, the first line of the FASTQ was the identifier for the read. That's copied directly into the SAM file here. The second column is what we call the flag. This is a binary bit mask which compactly represents just single variable flags like whether the read aligned to the forward strand of the reference genome or the reverse strand of the reference genome whether the read had a pair and whether the pair was mapped successfully it's just a compact way of representing these different states of the read. Later I'll send you a link to how we convert from this flag to some human readable output. Next is the location of where the read mapped to. So the chromosome this read mapped to chromosome 19 and the position along the chromosome which here's 8.8 megabases. Next along is the mapping quality. So this is the mapper's estimate of how reliable the mapping is. This read mapped with Q60 which means the mapper estimated there's about a 1 in a million chance that this mapping is incorrect. When you have paired in the luminaries a very high proportion will map with Q60 where the mapper had very high confidence in the mapping position that it shows. That is the maximum. It won't estimate higher than Q60. So it's probably slightly underestimated but it's difficult to estimate very rare events so it can't go beyond that. Next along we have the cigar string. So the information up to this point is just telling us where in the reference genome the read is. We also want to know how the bases of the read line up against the bases of the reference genome. So I drew these pictures before where we have these vertical bars drawing the alignment. This cigar string is a compact representation of how that read should be lined up against our reference genome. So the cigar string for this read is 76 M. The read is 76 bases in length and every base of the read is matched up one to one against the reference genome from end to end. So there are no gaps in the alignment, no insertions and deletions. If you have insertions and deletions in the alignment, like these two diagrams I put at the bottom here you're going to get a cigar that has these D and I operations. So the alignment here of this read against the reference genome says that there's 4 M so 4 matches 1, 2, 3, 4 then 1 D 1 deletion nor this reference base by putting a gap here followed by another 6 matches 1, 2, 3, 4, 5, 6 So this is just a very compact disc space friendly way of representing how the bases of the read line up against our reference genome. Here's an example with an insertion into the read so we have 4 matches one base inserted and then another 4 matches. Does anybody know something odd about this one on the right? I purposely did something tricky here. Reference doesn't have insertion, it's an insertion. So this is an insertion here, so we've said 1 I so this T is inserted here There's a 5 Sorry? I think somebody said the answer here. There's a point mutation, yes. This second base is different between the read and the reference genome. The reason I put this trick in here is because there's a really common mistake to make is thinking that M stands for match and meaning that the base between the read and the reference genome is identical. It doesn't. It just means that they were matched by the aligner, which means that they were lined up against each other. It doesn't mean that they have the exact same nucleotide sequence. This is a design choice made very early on in Salmon Bam format It's very confusing. You might want it to say something like 1 identical base 1 mismatch 2 identical bases, 1 insertion 4 identical bases But for reasons of disk space efficiency, they decided to encode identical bases and mismatched bases using the same operator, which is this M. It trips up a lot of people, which is why I spend time talking over here. Moving on, the next three fields all deal with the paired and reads. If you sequence a read pair it will tell you the chromosome that the read pair aligns to. There's an equal sign if it's the same chromosome and then the position along that chromosome and then the length of that DNA fragment as calculated by the length of the alignment or the distance between the alignment. The insert size of 119 means that these two reads came from a 119 base pair fragment of DNA. And then finally we have the read sequence itself just like in the FASTQ file and then the quality string also just like the FASTQ. Any questions about the SAM format before we wrap up? What's the end at the beginning? What's the end at the beginning? That's a good question. So N stands for an ambiguous nucleotide. Essentially the base caller wasn't able to confidently predict what base was inserted in the first cycle. So let's introduce this end symbol here. So that one we don't know if it's AC, G, or T so we represent it with an N. We have many questions. We'll start here. Can you have another operand besides the equal sign there? Like would there be another outside of the B1 and same chromosome? I think that it's either equals if it's the same chromosome or the identifier of the chromosome directly. So if the read pair this read was mapped to chromosome 19 if this was mapped to chromosome 2 it would say 2 here or whatever the ID of the reference it is. So it's either equals or chromosome N. So what would be the point of doing it? I guess that's good to assist chromosome or avoid to another chromosome. What would be the purpose of doing that at this step that you're looking here so you can apply something about it based off of a chromosome or a sequence that follows it so if it's on a different chromosome I guess that's good. So when we talk about structure variation later major class of structure variation or translocations between different chromosomes Trevor probably talked about this later on reads that one half maps to one chromosome and the other maps to another chromosome are evidence for there being a translocation. So that's what is giving you right there. Of course you wouldn't just look at one read pair and if there's 19 to chromosome 2 from one read that's not enough evidence you would look for many reads that map to the same position where one half is mapped to chromosome 19 and the other to chromosome 2 to call a structure variation. Francis? From an example Oh yes, Cram. Great, that's the first question first. So most read, most aligners and most mappers like BWA will output reads that can't be mapped to the reference genome as well so every read that was in your FASTQ file will be present in your BAM file even if it doesn't align to any chromosome because it's a novel sequence or the maybe sequence called is too low to be able to align it. So that will be indicated in this flag field here. There will be a flag that says the read was not mapped. Now there's programs like SAM tools I think SAM to FASTQ which will take a BAM file and convert it back to the original FASTQ file that the sequence are generated. The reads will be in a slightly different order but you can still go back to FASTQ from your BAMs and then the Cram format. So this file format BAM requires about one byte of disk space per sequenced base Typical aluminum sequencing run might sequence 100 gigabases so your BAM files will be about 100 GB in size. After BAM was developed Ewan Burney at the EBI and his student Marcus Fritz developed a new way of encoding alignment information which allows it to be compressed better and he uses reference based compression to shrink the size of this alignment information. The way it works is that if you have a read that exactly matches the reference you don't need to actually store the read sequence you can just say this read is a substring of the reference from this position to this position and you save a lot of space by encoding the reads this way and this is what's called the Cram format which I don't know the C probably stands for compressed reference alignment map, something like that and it's growing in adoption to just save sequence and cores genome centers space by outputting their alignments in Cram instead of BAM The reason it wasn't adopted very quickly is that access to the reads used to be a little bit slower but they've done a lot of work in engineering the Cram format to be just as fast as BAM so a lot of pipelines are now switching over to just using Cram and it's typically used for archiving data so if you send your sequencing data to the EBI or the NCBI they're going to convert it from BAM to Cram so that their enormous disk bill is a little bit lower I would have to look that up A lot of the information is actually very difficult to store the quality string compressed so pretty much all the remaining the remaining size is in these quality scores so people are working on ways of instead of using 8 bits to represent a quality score, you might use fewer bits but you'll lose some precision back there and then come back here So let's say you wanted to look at your original FASTQ to take a Cram file would you have to go Cram to BAM and then to FASTQ No, we're going to work with some tools later on that will allow you to directly view parts of your BAM file the tools are the exact same way for Cram files as well so the main tool package that we're going to talk about which I have on this slide is called SAM tools you know, I've put in here it's toolkit for working with SAM and BAM files it supports Cram as well so you can use the exact same set of tools on Cram files as you can on BAM files Do you have one? So SAM tools when you're making BAM files it's also generated Which files? Yeah, that's a good one so often something we want to do is not just look at the entire BAM file in its entirety but we only want to look at Cram's on 19, position 8 million which we're just looking at so there's a command called SAMtoolsView which we're going to be working with where you give it a genomic coordinate and it gives you all of the read the line to that genomic coordinate to make that work which tells us where all the reads are in the file for particular segments of our reference genome that's what's called the BAM index which has the suffix .bai to generate those you run a program called SAMtoolsIndex on a BAM file and it'll generate that .bai file we'll see, you'll run those programs in the practical session