 So I'm Jared Simpson, I'm a principal investigator here at OICR, I work just upstairs. I work on algorithms and data structures for sequence analysis. So I design the types of methods that we're going to hear about today for mapping sequence reads to reference genomes, doing de novo assembly, that sort of thing. So this module is mapping and genome rearrangements. So I'll talk in two parts. I'll first give a short lecture about what we talk about, where we talk about mapping sequence reads to a reference genome. And then we'll do a tutorial, then we'll have a coffee break, then I'll talk about what genome rearrangements are, how we can find genome rearrangements, and then we'll do a tutorial on that as well. So please just ask questions as they come up. If anything's not clear, just feel free to put your hand up. And ask me to clarify something, or go over something in more detail. So the objectives of this module are really to understand what we talk about when we talk about mapping reads to a reference genome. So as you know, this is the cancer bind, informatics course, or cancer genomics course. So we want to, often our goal is to sequence a cancer and understand what mutations are present in that cancer genome. And the way that we do that, well, first we'll sequence an individual's tumor, we'll sequence the individual's normal genome from, say, a blood sample, we'll then compare all of those reads to the reference genome to see where mutations are. And mapping reads to the reference genome, which just means taking a read and finding where it might have come from in the human reference genome, is the first step of all of those processes. So I'm going to talk about what we're talking about, mapping reads in detail, also I want you to understand different file formats that we work with in high throughput sequencing. So I'll talk about the FASTQ format, the SAM and BAM formats, which are the common ways of working with sequence data. Also throughout this lecture I'll talk about common terminology that we use to describe alignments like base quality scores, mapping quality scores, paired end reads, that sort of thing. And then later on we'll find out how to find genome rearrangements using read pairs, and you'll have the chance to run a mapper, BWA, and rearrangement caller during the tutorials. So I'll talk a little bit about sequencing platforms to start. So I think Francis talked about this a little bit yesterday in the introduction lecture. So, John did? Okay, thanks. So since about 10 years we've had high throughput sequencers, which can just produce massive volumes of data, and this is just a figure of different sequencing platforms in the time that they run for and the amount of data you get per run. And up here at sort of the top is the Illumina high-seq platform, which is by far the most common way of sequencing human genomes these days, and I'm going to really focus on just data from that platform, the Illumina platform, but just understand and understand from John's talk yesterday that there are alternative sequencing platforms that give different types of data, like much longer reads from the Pac-Bio or the Oxford Nanopore system, or you have an iron-torn system which produces somewhat longer reads than Illumina, but at different throughput. But because Illumina is the dominant player in sequencing, we'll be focusing on that here today. Okay, so Illumina sequencing works cycle by cycle. So you start with a single strand of DNA fixed to a slide, and then labeled bases are added one by one, and as the base is added, a camera records a fluorescent signal from each base, and each one of the four bases is colored with a unique color, and the sequencing process just records which color of base, so I need to be on both sides, which color of base is added to each set data? Will that record too? I see. Okay, fantastic. Yeah, so here in the first step of the sequencer, this T was added, and then it gave off a blue fluorescent signal, and that was measured by the camera over here. Then the sequencing reaction proceeds to the next base. It was a G, and it was recorded here as a green color, and then it was a C, which was recorded as orange, then blue, purple, orange again. So the sequencer is just adding bases, base by base, and recording the color that was incorporated into the growing strand of DNA at each step. Now, Illumina Software translates these images, which are these incredibly large images where each dot is an individual sequencing read, it translates that by looking at color by color to decode that into a nucleotide sequence. So to look at the stack of images, one taken for each cycle of the sequencing process, and it will change that into TTT, GCA, and so on for one of these dots. So we've got a file, and this is a fast Q file, which records a sequence of every one of your reads, and then an estimation of whether the base at that position is correct or not, and that's called a quality score, and I'll talk about that in just a few minutes. Now, the sequencing is very reliable. You get very high accuracy data from Illumina Sequencers, but various things can go wrong that cause the machine to record the wrong base certain times, and when you have these strands of DNA fixed through a microscope slide, they come in a cluster of thousands of molecules all with the same sequence, and this cycle by cycle sequencing reaction is happening in the same order for every one of those molecules in the cluster, but sometimes molecules can lag behind where corporations didn't happen, or sometimes they can jump ahead where too many bases have been incorporated, so they get out of sync, and they call that phasing when some of the strands in this reaction are lagging behind or pre-phasing when some of the strands are too far ahead, and the problem with this is that it then causes a mixed signal. Instead of having a pure color coming out of each one, like for example, we record a gray one here, now we've got one signal that's red at this step and three that are gray, so now the base color, when it's trying to translate these images back to base calls, it needs to try to deconvolve the fact that some of those strands might be out of sync, and that's the source of errors within Illumina data, it's just that the signal becomes essentially diluted by strands getting out of sync during the sequencing process. Now the Illumina software can measure this and estimate whether bases are reliable or not, and the way that it indicates that to the user is by outputting what are called base quality scores, and this is just a log-scaled probability that the sequencer estimates the base call is incorrect, so if you have a quality score of 10, it thinks that there's a 10% chance that the base that it out-portons out of FastQ file is incorrect. Now if you have a quality score of 30, it's about one in a thousand chance that it's incorrect, and most of the data will fall into this Q30 to Q40 range, which is quite reliable. Often, if you see what reads at how these very low quality scores, less than 10, you might want to just discard that data, and most software that uses the sequencing reads, when the quality scores are very low, they'll just ignore those reads. But quite often, the quality scores are above 30, and the data is quite reliable. Now, different sequencing platforms have different error profiles, so the Illumina sequencer has a fairly low error rate, usually on average, about one in 200 bases, and they're mainly substitution errors, where the sequencer has said maybe there's a T in this position when there's actually a C. The 454 and the Ionetorrent sequencers are mainly insertions and deletions in homopolymer runs, and the Pac-Bio sequencer, one of these long-read sequencers has a much higher error rate of about 15%, and the errors are a mixture of insertions, deletions, and substitutions. And when you're working with your sequencing data, it's very good to be aware of these different error models or different error profiles for the platforms that you're working with, because when you're, say, looking at your read the line to a reference genome in IGV, you want to be aware of whether the differences that you're seeing with respect to the reference are due to errors or if they're due to true mutations or, say, SNPs. Sure, I should have brought one. It's about the size of a USB stick, and I've got one in my office. But the extra nanopore, it works by feeding a single strand of DNA through a nanopore, and it measures the blockage of electrical current flow that's going across the pore. And from that, it can try to predict what sequence which was in the pore at the time of the measurements were made. And because the current measurements are so small, they're on the Pico ramp scale, it's very difficult to convolve what sequence passed through the pore. So it has a fairly high error rate, roughly equivalent to the Pac-Bio of about 80 to 85%. It's still very early in the development of oxygen nanopores sequencer, so it's gone up from about 70%, sorry, 70% accuracy, 30% error rate was about six months ago, and now current data is about 15% error rate. How does the aluminum system infer the quality of scores based on the intensity of light? Really intense being light that would have a better quality score for example? Yeah, that's right. It's essentially the signal-to-noise ratio, so it's taking four pictures, one per color channel. It's got some filter on there, a particular laser that excites across some wavelength. And one of the signals will be very strong, and then the other three signals will be weaker, and that's the background, and it's essentially a ratio between the peak and all the background signal. And these artifacts you have for molecules going out of phase and not, they contribute to the background, so that's why if molecules are getting out of phase, then you have lower quality scores and errors start to creep in. How is it able to read one dot kind of the influence by all the other fluorescence around it, if it gives a concentration of spot? So they've spent a lot of time, so how do they base call off the individual dots here? Like each one of these is a sequencing reaction without being influenced by the neighboring ones around it. Is that right? Yeah, so they've spent a lot of time tweaking the chemistry of how many clusters they put in the flow cell to make sure they're well separated. And right now the latest deluminas, the high-seq X10s, actually have a patterned array where the clusters only form in certain positions, so that they're well separated from each other. What's the maximum you could actually put on an instil link That'd be a good one for John. Yeah, I think you get something like on the order of 50 million reads per lane, so 50 million clusters. Yeah, but if they do overlap and they can overlap by chance, the software will detect that and then not output reads. Okay, so here's what the Illumina Air profile looks like. Here we're looking at the error rate as a function of position within the read. So these are six different sequencing runs that I used as part of an assembly project. And you can see that at the far prime end of the read, at the zeroth base, the first base, the error rate is very low. And it continually goes up towards the end of the read, whereas if we're looking at this green line here, the error rate goes up to about two and a half percent at the last base of the read. And this is just an effect of the signal-to-noise ratio that we just talked about going up as the sequencing reaction which is signal-to-noise going down as the sequencing reaction proceeds. It's harder to measure the last base than it is the first base. Yup. This one? No, these were different runs. These were six completely separate data sets. It was all Illumina sequencing, but it was even done by different centers. It's different different institutes. It wasn't the same sequence or it wasn't the same run. Yeah, I'm wondering why some keep up so early, I mean 100 positions for people 150. Yeah, so you can tune how your read length, how long you want the sequencing reaction to go for. So this one was about 150 bases, but the common Illumina's are about 100 to 125 now. But they are more actually in the Q40 range. Yeah, yeah, yeah, especially down here at this end. On average it's Q30 to Q40. Okay, so the raw output that a user will get from an Illumina sequencer is a fast Q file. So this is just a file format that stores sequence reads and our qualities. So I'll briefly walk through the different fields in this file. So the first one is the read ID. So the sequencer will align a unique identifier to every read. There's some structured information here. The there's this first field which is HWI isn't the identifier of the sequence. And then I believe it's the fourth lane. And these numbers are the X and Y coordinates of the cluster on the machine. You don't actually need to know that information commonly. Sometimes it's good to look at which sequencer produced the data. But often you just need to use this as a unique ID. Now the slash one at the end means that this is the first read in a pair. So something I haven't talked about yet is that reads can come in a pair where you can sequence one end of a DNA molecule and then sequence the other end of it. We'll be using that in the genome rearrangement part later. But I'll just remark at this point that the slash one indicates the first read and a slash two would indicate the second read. Question? There are many definitions of pair read versus made pair. Right. Is there any difference between it or they are synonyms? Yeah. The difference is how they're prepared. So made pairs are typically much further distance apart. But I'm going to leave that because I have a slide about this later. So we'll come back to made pairs and paired ends in a little bit. Is it advisable to store images from a lemur? No. Not anymore. Is it advisable to store the images from a lumina? Used to, there was a lot of debates around this when luminous sequencing started to take off. Because people would argue that that's the raw data of the experiments and you have to store that and you have to archive it. But it's just impractical. The images are very, very large. They're terabytes per run. So basically the community decided that... Can you pre-process an image? Because it's in a jpeg file. It is, yeah. If they were stored, you could. But it's just too much data to keep around. So usually it's just to rese... If there's a problem with Iran, you just resequence it. It's cheaper to resequence than store all the images. So the second line in the FASTQ file is the actual nucleotide sequence. So if the sequence are red, that's fairly straightforward. Does that include the tag or anything like that? This should be the actual sequence you want to use. So in here, there's this ggctac in the read ID. That was the barcode that was read. Then it was stripped from the sequence. So this should be usable sequence and you shouldn't have to process it further. Yeah. So when you design an experiment, what do you concede upon when you decide on read events? These days, because within Illumina sequencing, you don't have a really big range to choose from. You can maybe do 100 to 125. So often it's just whatever you're sequencing facility or whoever you have access to that's going to provide the data, whatever they're most comfortable generating, which is usually 100 base in this case. More generally, longer reads are better for things like structure variation. If you can get very long reads, say you're doing de novo assembly of something, then you want to think about going to Pac-Bio or Oxford Nanopore, because they give you much better genome assemblies. But if you're doing a cancer genome right now, you just choose paired end 100 base pair reads. The third line is again the ID. It's technically the quality ID. Oftentimes this field isn't filled in and there will just be a plus sign in the beginning of the line because this is essentially redundant information as it's the same as the first line of the file. Yeah. That's separate from the quality score. It's cool. Yeah, it's separate from the quality score. The fourth line here is the quality score, which I've got it with an arrow here. So these, I talk about quality scores numerically, say Q30 to Q40. So this is just an ASCII representation of quality scores. You can represent the quality of each base in a single character. So there's ways to convert these characters to new air quality scores online. Essentially, you just take the ASCII character code for the symbol and subtract 33 from it. Okay, so that's fast Q. I'll now talk about mapping reads. The reference genomes and the goal is simply now that we have this huge set of reads that the Illumina Sequencer gave us. We want to find out where each one of them came from. So the issues here and something that people like me who work on algorithms for sequence analysis had to deal with in the last 10 years is that the human genome is incredibly large and it's very repetitive. Over half of the genome is made up of transposons and other type of repeats. And finding where a very short 100 base pair read came from in that repetitive genome is quite challenging. So there's been a lot of work in designing data structures and algorithms that can accurately find the placement of an individual 100 base pair read on the reference genome. And of course, high throughput sequencers produce a huge amount of data and there's a lot of differences between any individual and the reference genome. For example, one in a thousand positions will be different due to SNPs than you have insertions and deletions and structural radiation. So the mapper needs to consider all of this and needs to consider differences between the reads and the reference genome. It needs to not place reads at the wrong copy of repeats and it needs to be fast enough to handle all of the data that Illumina will produce. So here's really the three criteria we look for in a mapper. It needs to be accurate. If a mapper aligns reads to the wrong place in the genome, for example, if it gets confused by the repeats, this is the main source of false positive variant calls when you go to find mutations from your sequencing data. Just the mapper putting the read in the wrong place. The mapper needs to be sensitive. It needs to account for the differences between the individual that we sequenced and the reference genome and it needs to be fast. We need to be able to do this and within a reasonable timeframe so you can finish your project and the informatics cost doesn't outweigh the cost of sequencing the genome. So there have been many, many mappers published in the last 10 years. That's probably an understatement. I think there's at least 70 at last count. Luckily, you don't need to go and pick from one of 70 to use the field of standardized around a few very popular mappers. But if you're keen to understand the differences between mappers and how they address these three different criteria, I listed on the last slide, you can look at this review paper that was in Bioinformatics a few years ago which fairly comprehensively lists the strengths and weaknesses of each one of the mapping programs. That said, we're going to be using a mapping program called BWA, particularly the BWA MEM variant of this program this is a very popular program written by Hang Lee and it's largely the standard way of mapping Illumina reads to a reference genome. So we'll just take that and run with it without having to worry too much about picking a specific mapper for the problem. Okay, so here's sort of a cartoon of what a mapper does. It's taking a single sequencing read which is depicted by this red bar at the bottom and then trying to predict where on the reference genome that read came from. So it'll do that by computing an alignment between the read and all the different places it could have come from on the reference genome, scoring them in some way and then picking the one with the best score. So in this example here, we have three different alternatives alignments. It could have aligned here with no mismatches. It could have aligned here with one mismatch or could have aligned here with two mismatches. Yeah. Later, kind of figure out if there were, say, say you're doing a cancer project and you end up with a whole bunch that have the two on the bar side you know that are missing our reference. Yeah. And so you say you end up with a whole bunch of those like does it ever go back and try to actually map it there by taking into account that that's really, you know, they can have that again and again. So, yes and no. Yeah, it does make sense. So analysis pipelines are very long. They consist of 10 to 20 steps where mapping is just essentially the first step. And when you go to find mutations, often the mutation callers, which I think you'll hear about tomorrow, they'll do realignment. So they'll take putative mutations and say, okay, now that I know there might be a mutation here, do the reads fit better to the reference, if that makes sense? Yeah. So yeah, they're essentially trying to learn the differences and then remap to them. The aligner, now we're probably going to change it. Right, yeah. In the quality of that base, we know that we might actually choose one. Yeah, exactly. And it does exactly that. In the scoring function weights it. It's essentially a probabilistic model. It'll say what's the probability of the data from given this reference location or this reference location. And those calculations always incorporate the quality scores. So yeah, they will. So how do you calculate it? Yeah, that's the next slide. Yeah. So mapping quality is essentially, it's the same scaling as the base qualities. And it's an estimate of the probability that the position that the mapper chose for that alignment is incorrect. So if we go back to this, there are three different alternatives and it will give a quality score. It will choose this one, say, because it has no mismatches. But the mapping quality will be the probability that the true alignment was this one or this one. And to calculate that, it's going to use the quality score, the base quality scores, as we just talked about, to estimate this mapping quality. So if you have a mapping quality that's Q30, about one in a thousand reads with a Q30 alignment will actually be mapped incorrectly. This concept of mapping quality is incredibly important when you're trying to find mutations because the alignments have to be correct. If there might be a very strong signal of a snip at a certain position, but if the read is mapped in the wrong place, then that's not a true snip. It's just, it's some other variant. It's just an artifact of the mapping algorithm. So often when you're going to look at mutations in IGV, IGV will color the read alignments by mapping quality. A white alignment is a Q0 mapping quality, which is essentially just a randomly chosen position, which can't be trusted at all. So you want to keep this in mind when you're looking at your alignments. Even if you're short-quakes, how does it map? Even if what it does is the highest, how does it know? Yes, so the internals of the mappers, they'll usually, they'll have a data structure, which is essentially an index of the genome. And it will use that index of the genome to look up candidate locations where the read can map. And this can usually be something that's quite simple, based on just exact matching strings between the reads and the reference genome. It will then take those candidate locations, so it might say, okay, it might align to this position of chromosome one, this position of chromosome seven, et cetera. And then it will perform a more detailed alignment using dynamic programming, which then gives you this sort of picture, the exact differences between the read and the reference genome. And then it will use a statistical model to estimate the probability that the mapping is correct or not. So what's the relationship between base quality and mapping and read quality? Yes, so the base qualities are the individual probabilities that the sequencer has called the wrong base at a position. The mapping quality is for the whole read, and it's the probability that the read was placed incorrectly on the reference genome. It was mapped incorrectly to the reference genome. Sorry? No, not directly related. They're not directly related, no. The mapping quality calculation uses base qualities when it's trying to assess whether it's a true difference or just a sequencing error. But it's not direct. Yeah. If you have paired in leads, does it take the second lead into account when trying to resolve low-quality base scores? That's an excellent question back here first. No, it's good. It's good. It makes my talk feel like it flows really well together. Yeah. And how reliable is it? That's a good question. It's better than it used to be. It used to be that low-quality scores would be underestimated and high-quality scores would be overestimated, quite badly, actually. And the Broad Institute wrote a program called Base Quality Score Recalibration, which you can guess what it does. It tries to fix it by adjusting the low-quality scores upwards and the high-quality scores downwards to make it a true signal. Nowadays, if the quality scores are good enough that often you don't have to do recalibration and you just take them on faith. Did they ever publish their categorical vision about, like, reserving indels? Yeah, so there's two types of... So there's Base Quality Recalibration and there's Indel Realignment. So GATK's Indel Realignment will take a database of known insertion and deletion positions and then realign the reads around those indels to try to improve the alignments around indels. Indels are sort of a difficult case for mapping programs because if a read aligns at the end of an insertion position, it will often just introduce mismatches rather than opening up a gap at the end of a short read. So Broaden wrote this Indel Realignment program that will fix that problem and allow it to make gaps at the end of reads. Yeah, it does look at... It looks at all the possible alignment positions. So it would say... So the mapping quality would take, say, these three mapping positions, one with no mismatches, one with one mismatch, one with two mismatch, it would calculate the probability given that this is the true alignment in that this mismatch is a sequencing error. That uses the base quality. It would also calculate the probability that these two mismatches are sequencing errors using the base quality and also probability that this had no sequencing errors. And then the mapping quality would be set as a function of those three calculations. Does that make sense? No, usually, for most regions in cancer genomes, mutations are fairly well-isolated, whether it'll be a single point mutation. And mappers are usually pretty good at accounting for a single difference or just a few differences between the reads and the reference genome. If you have a very rearranged region or something where there's been some hypermutation, you can misalignments in those regions. But for most of the mutations in cancer genome, they're on their own, and the mappers are well-able to account for that. So how does this distinguish the sequencing error from heterozygous? So that's when the quality scores come in. So sequencing errors will tend to have very low quality scores. It might be a Q5 base, and the mapper will think, well, there's a 20% chance that base is a sequencing error, whereas a heterozygous position, it might be a Q40 base. So the mapper would be very unlikely to say that that's a sequencing error rather than a true head. Some of them do. BWA doesn't. BWA is called a best mapper. It will report the best alignment for the read and a mapping quality score, but it won't report alternative hits. If we go back to this review, that this will talk about different mapping strategies. So multi-mappers will be included in there. Kassava is Kassava included in this paper. Kassava is Illumina's aligner that comes bundled with the high seek. I think it might be too recent. I think Kassava, John, do you know when Kassava was introduced? It was around a while. A while? Okay, so it might be, but we don't know. Okay, so let's go ahead and read. So paradigm reads are, as I said earlier, when you sequence both sides of a DNA fragment. So you might take a 400 base piece of DNA and then sequence the first 100 bases of it and then the last 100 bases of it. So now you've got 200 bases of sequence separated by a gap of unknown sequence of length 200. And this is actually very useful when you're trying to map reads to a reference genome. So if we go back to our example here, now we're trying to map this read pair to the reference genome where we know the sequence at the start and the sequence at the end, these black bars here. And we don't know this red sequence in the middle. Now if we just map the first half of the read to the reference genome, we might find two hits that are roughly equivalent. They both have a mismatch at this position. But because we have the additional information of the second half of the read pair, when we map it to the reference genome, it might resolve what the true mapping is. So now we've mapped the second half here with no mismatches and the second half of the second alignment with six mismatches. So it's much more likely that this is the true alignment than this is. And this again will go into the mapping quality calculation and a map like BWA would certainly choose this as being the true alignment rather than this one, which has a much poorer alignment for the second half of the pair. Okay, so BWA will write all this information into what's called a SAM or a BAM file. So this is a standardized format that was developed around the 1000 Genomes Project as a way of working with alignment data. So SAM is a tab-to-limited text file. At the bottom here is one line of SAM. And BAM contains the same information, but it's a compressed binary format of the alignment. So they're inter-convertible. You can convert from SAM to BAM and BAM to SAM. And this is really an important format to understand when working with sequencing data. So I'll go through all the fields of the SAM file in order. So the first one is just the read ID. So this is the same read ID that was in the FASQ file that I showed earlier. It lets you figure out which read each line of the SAM file refers to. The second field is what we call a flag. So this flag is indicating which strand of the reference genome the read aligned to, whether it has a paired end read, whether the paired end reads are mapped in the expected orientation, and all sorts of other data that BWA has inferred from the alignment. There's a very nice document online. I should have included a slide on it, which will take one of these numerical flags and tell you what information is set for each flag. Yeah, it's a flink. Yeah, it's on Picard's webpage. Yeah, yeah. So that's a very good way, because this is just a binary flag. It's impossible to interpret. I can't look at 99 and tell you what that means. You have to essentially look up what this flag is trying to get across. So the second, so the third and fourth field are the chromosome of the reference genome that the read aligned to and the base position, that's relatively straightforward. Next field is the mapping quality that we've discussed. In this example, it was mapped with a Q60, which I believe is roughly one in a million chance that the alignment is incorrect. Often when you map 100 base pair reads with BWA, the strongest alignments will come out as Q60. These are very trustworthy alignments. Does the SAM file tell you which one? No, you could argue it should. So often you can add extra information into the header of the SAM file, and often there'll be a tag in there that says at RG and then the path to the reference genome. So when you run the SAM, when you run the mapping, yeah, so you specify extra documentation. Yeah, BWA can take in essentially an arbitrary header that will just get pasted into the output SAM file. So the next field is the cigar string. So this allows you to represent the alignment between the read and the reference genome. So it's encoded using a number and then a symbol. So 76M means that 76 bases of the read were matched to the reference genome. Now, if there are insertions and deletions with respect to the reference genome, they show up like this. So here the read has a single gap here where this T is missing, and the cigar string says that there's four matches, which is here, ACGA, and then one base deleted from the reference, and then six more matches. And then if there was an insertion in the read with respect to the reference genome, it would look like this. 4M, one I, one insertion, 4M. So this just lets programs or you decode how the read actually aligns to the reference genome. And this is a SNP callable where we use this to try to figure out where the variants are. This question? If it's a mismatch? Yeah, so it always uses M's for matches and mismatches. M just means that a base aligns to a reference base. There are extended tags where you can indicate the positions of mismatches. So it could be... So for example, here there's a mismatch here. You could represent this as 1M, 1X, which is a mismatch, 2M, 1I, 4M. That BWA and most mappers will use this simplified convention where M just represents align faces rather than matches and mismatches. And then if you have paired end reads, there's a couple of fields that describe the relationship between the pairs. So if there's an equals sign in this field, it means that the pair mapped to the same chromosome as the read that the sound record refers to. The second field here is the position of the pair. And then to feel after that is the insert size. And the insert size is the distance between the ends of the DNA fragment that was sequenced, which is shown here at the bottom. And then the next two fields are the actual sequence and its quality score. So that's the same as the FASTQ file. It's just duplicated here in the SAM to be self-contained. So you have the... When you sequence a pair to end read, you have this DNA molecule, you sequence from one end, you sequence from the other end. The insert size is the length of the total DNA molecule you sequenced. Period. They could be zero though, right? Well, most of your fragments would be much... Well, I think in the next part, we'll talk about the distributions of insert sizes that you get because usually you'll size select the DNA in some way, usually by running it on a gel. And then you'll get an insert size distribution, which is like a normal distribution around 300 bases, let's say 30 bases standard deviation. And most of your reads will fall within that distribution. Yeah, but I'll maybe leave that for the next part. Okay, so some resources. So SAM Tools is the toolkit that everybody uses for working with SAM BAM files. It's got a converter to go from SAM to BAM and vice versa. You can sort all the alignments either by read name or by aligned position on the reference genome, and you can extract alignments for given regions of the genome that you're interested in. We'll be working with SAM Tools in the tutorial. It's just coming up in a few minutes. This document here is the SAM specification file. So it's actually the technical document describing what each field of SAM means and how to interpret it. And if you're the author of a mapper, this is what you'd look at to try to figure out how to write SAM. If you have questions or you need help, the SAM Tools Help mailing list is very good. Hang Lee is the owner of that mailing list and he'll answer questions from users and there's a large community of people who are keen to help new users of SAM Tools. Or you can ask on Biostars, which is a Stack Overflow-like site for sequence analysis or seek answers. We're all great resources for getting help. So I just want to go over a few issues with an IGV as we'll be using that in the tutorial. So I think there was an IGV session yesterday. Is that right? Yeah. So when you're looking at alignments in IGV, you usually want to look for the mapping quality of the reads. In this case, all of these bars are colored gray, which indicates quite high mapping quality. This example is a tumor normal pair and this is what a standard somatic mutation will look like. There'll be the tumors on the bottom here. There's a consistent difference here where I can actually read what base that is. I think it's a G. Where there's a G in the tumor and at the same position of the normal, it's just the reference base. So there's no mutation here. So this is as good as mutations will look when you're sequencing a cancer genome. Since not all bases were changed, does it mean that the mutation is heterozygous or that the tumor is heterogeneous? It can be either. Yeah, it can be. And then getting at these questions is really rather difficult. But yeah, usually with the cancer mutation, it's never going to be entirely one base, right? Because unless one of the copies was lost, the reference copy would be lost. If often when the depth is equivalent and you have one base like this, just means one of the chromosomes was mutated. So I wrote this up because I'm going to show you what alignment problems look like. So as I said before, alignments are the biggest source of false positive variant calls. And a ton of work has gone into understanding where alignment goes wrong and tried to eliminate the false positives. So this is BW alignments for a reference sample called NH12878. And this is the centromere of chromosomes on two. And as you can see, all of these colored bars are differences between the reads and the reference genome. And you can see there's vast numbers of differences here. And the read depth, which is plotted in this track here, goes up and down. There's depth of well over 100,000 in this region, in the 10,000s here. And really it's just a very difficult region to understand what's going on. Now the problem here isn't that this individual is very different from the reference genome. It's just that because the centromere is so repetitive, BWA just can't reliably map reads to the centromere. So it's just placed reads wherever it best fits. And it ends up with a lot of very poor quality alignments that would cause false positive variant calls if you were to run a variant caller over that. So what you'll often do is you'll screen out these very difficult regions. If you see variants in the centromere, try to temper your excitement because they're probably not very reliable. A different way of looking at this, this is one of the figures from the 1000 Genomes paper. This is using inheritance patterns. So calculating the probability that the variant calls are in Hardy-Weinberg equilibrium as a proxy for whether the variants can be trusted or not. And we see that the variants around the centromeres, this is, I'm not sure what chromosome this is, I think 17, the variants around the centromere of chromosomes, 17 are far out of Hardy-Weinberg equilibrium. And this is just indicating that all of the variants that were found in 1000 Genomes project around this region are probably not true variants and just mapping artifacts. So that's something to watch out for. You also see at the telomeres, which tend to be quite repetitive also, that you're getting a lot of variants that are probably artifacts. Now, one weak spot of mapping 100 base pair reads, using any mapper really, is that they can't deal very well with large gaps in the alignments. So this is a real tumor genome where there's a large deletion in the tumor with respect to the normal. And BWA has only found one read, which is this one here shown with this gap, this black thin bar. And the other reads that should show evidence for this long deletion were soft clipped. So BWA just stopped aligning the read up to a certain point and gave up. So if you were to run a very colored on this, you probably wouldn't see this deletion because there's only one read that shows the evidence for it. If BWA was able to handle larger gaps in the alignments, this would come across as a very strongly supported deletion. With about 10 reads showing evidence for it. I think this will be covered in more detail in the mutation calling section tomorrow. Okay, so we'll start the read mapping exercise now. If before that, is there any more questions about mapping and alignments? Okay, welcome back. So this part of the module will be a little bit shorter. And I'm going to specifically talk about structural variation and how we can find structural variation. So other parts of this course are dedicated to different mutation types, like single nucleotide variants, which we'll hear about tomorrow and short indels. Also, Sarah will be talking about that. But here I'm going to talk about structural variation. And what I mean by structural variation are very large scale changes to chromosome structure, like large insertions and deletions, where a lot of sequences has either been added into the genome or deleted from the genome. One sequence has been inverted, so an entire stretch of the genome has been flipped. Or translocations where material has been swapped between chromosomes and copy number variation. Copy number variation is a little bit special. And again, Sarah is going to talk about that tomorrow morning. So I'll be dealing with the other three structural variation types, the large insertions and deletions, inversions, and translocations. So the way that we infer structural variations are going back to paired-end reads. And we had a question this morning about the difference between paired-end reads and made-pair reads, and I'll describe that now. So here's how we produce paired-end sequencing data, the type of data we heard about earlier today. So we start with genomic DNA. We shear it or fragment it and size select it into roughly 200 to 500 base-pair fragments. We then add sequencing adapters to either end, to rather both ends of the fragments, and then sequence from one end in and then the other end in. And we call those paired-end reads. That's what we aligned this morning. So the important thing here is that the information you're getting is fairly short-range. 500 bases is still fairly small when you compare it to the size of an entire chromosome. Now, to discover even larger structural variation, we want to increase the separation between the ends of the pairs. And to do that, we get what are called made-pairs. So made-pairs have a different preparation where, again, we take genomic DNA, we fragment it into this time one to 20,000 base pieces, and then we make this long 20kb piece a circle. We then enrich for the sequences that were correctly circulized, and then we shear the DNA at near the point where we joined the two ends. And we get a fragment that looks like this, which is where this yellow bar here is the point of circularization. We then do the same sort of sequencing as paired-end sequencing, where we sequence from either end of this fragment. But now, because this part, this blue part of DNA, was from one half of the circle, and this part of DNA was from the other half of the circle, the physical location of these can be up to 20,000 bases apart. And the circularization procedure, we call these made-pair reads, allows a much larger separation when we map the pairs to the reference genome. Is that all clear? You'll get opposite strands for this too, right? And the orientations are now different, and I'll come back to the orientations of reads in a minute. But with paired-end reads, we expect the reads to be pointing to each other like this. With made-pair reads, they point away from each other. Yeah? Why not do a size dimension of 20 kb, a set of 500? So the aluminum sequencer has a limit of how long the molecules can be when you put them on the sequencer, which is effectively about 600 bases. And that's why we need to do this molecular biology magic to get much larger separations. Okay, so going back to paired-end reads, as I just said, there's an expected orientation. The reads should be pointing to each other where one read is on the forward strand of the reference, and the other read is on the reverse strand of the reference. And this is important when we try to enforce structural radiation as the orientations may change depending on the type of variant. Now, when we perform the size selection, which is typically performed by cutting out a band on a gel, it puts a constraint on the size of the DNA fragments that we're going to sequence. So typically, they'll follow a normal distribution with a mean of about, say, 300 to 400 bases, and standard deviation of about 10%. And that is really important information when we go to try to enforce which type of variants there are, as if pairs are mapped much, much further outside of the range of insert sizes that our library preparation says they should be, it might indicate that there's a deletion. I'll give you examples of that. So when we align our read pairs to the reference genome, the ones that match the expected orientation when they're pointing to each other like this, and within this fragment size distribution are called concordant read pairs. Those are the ones that are mapping as expected. And discordant read pairs are the ones that are mapping not as expected. And those are the ones that might give evidence of structural variation. So note, though, that not all discordant read pairs necessarily indicate structural variation. As we talked about this morning, there's a lot of mapping artifacts where BWA might just put reads in the wrong place, and those can also look like discordant read pairs. So typically, we require evidence from multiple discordant read pairs to call a structural variation, and the reads must have a fairly high mapping quality, as well, to be evidence of a structural variation. So what kind of coverage would you need with, I could say, more exotically to be able to get enough read pairs to call? Yeah, that's somewhat difficult. Usually, if you're doing an integrated analysis of everything, SNPs, indels, structural variation, you'll have 30x coverage, which is more than enough to call structural variational. But you can get away with a lot less coverage. Maybe I'll ask Andrew. Andrew, you're a structural variation guy. What coverage would you suggest for doing an SV analysis of a genome? I only would, although I've done something... 10x, yeah. But then what we did is we leveraged the fact that we also had matching coordinates. I see. Right, so if you have other data, then you can get away with lower coverage. The standard recommendation is 30x, though, like different individuals or different sequencing runs. Yes, okay, so can you merge information from different individuals to increase the power to find structural variation? In a population context, you can. If it's a structural variation that's segregating in the population, you can sequence more individuals at low coverage and then find things that are at high frequency in the population. That's essentially the design strategy for the 1000 Genomes Project, where they sequence 1,000 individuals at 4x and then call structural variation off that. For a tumor sample, you can't, because the things that you're interested in are essentially private mutations. So you're stuck with just sequencing 1 individual at high depth. But if you're interested in polymorphic structural variation, you can merge information from different samples. There would be some, since most structural variations are in a few private... Yeah. ...if you see something across multiple variations, they're probably going to be a false positive or a germline that you have to make something different or normal. Yeah. So a good way of getting rid of those positives. Right. So just to say that in a different way, like often what we want to do is sequence the individuals' normal genome and the tumor genome, say, call structural variation in both, and then see what's only in the tumor and not the normal. And what Andrew's suggesting is if we pooled the normal samples across multiple individuals, we'd have more power to find the polymorphic germline events that we're not so interested in. So that's one strategy. Okay. So now I'll talk about patterns of paired in reads that indicate structural variation. So I'll be using this sort of diagram for the next two slides. Here we've got a reference genome, our reference genome, and a donor genome, which is the individual that we've sequenced. In this example here, the donor has a deletion with respect to the reference. So this random block here that's present in the reference is not present in the donor genome, and we've introduced a gap here, which is just depicted by these dashed lines. Now, when we sequence paired in reads, we're sequencing the donor genome, and they should look like this. Read one is pointing this direction, read two is pointing at read one, and the distance between them is roughly about 300 bases. Now, because the donor genome doesn't have this large block of sequence that's in the reference, when we go to align these reads to the reference genome, they map further apart than expected. This read would map over here, this read would map over here, and when BWA writes out the sandfile, that insert size is going to be whatever the true separation is, plus whatever the size of this red block of sequence is. So the signature of a deletion is reads mapped in the expected orientation, but much further apart than we'd expect based on the insert size distribution. The opposite of that is when there's an insertion in the donor genome, the individual's genome or sequence, with respect to the reference genome. Here, the red sequence is present in the donor, but not the reference, and when we sequence around it, the pairs will map much closer than expected because now the insert size is the true insert size minus whatever the sequence is that's not present in the reference genome. And again, the pairs are going to map in this orientation where they're pointing to each other as expected. Now we have more complicated struct variation like tandem duplications. In this case, there has been some part of the individual's donor genome that was duplicated and just copied... made a second copy in the same place. Now if we have read pairs where one read is at the end of one copy and then the other read is at the beginning of the second copy, when we map them to the reference genome, this read, which is at the end of the copy, goes here, this read, which is at the beginning of the second copy, goes here, and now the leads have mapped in the wrong orientation. They're mapping away from each other rather than towards each other. So when you look for structuration patterns, you can look for this sort of signature of tandem duplications. Now similar to that, we have signatures of inversions. So here, this blue sequence, which is running in this direction in the donor genome, has been flipped. It's been inverted, it's now running in this direction on the reference genome. So when we sequence it, we have one read here, one read here, this one gets mapped all the way over here, and now it's been reversed, this one gets mapped in the same orientation, and now the reads are pointing in the same direction, and that's the signature of an inversion, and also the reads are much further apart because this one has now been moved all the way over here. And in the exercise coming up, it's going to, we'll see an example of what a real inversion looks like in the BAM file you mapped earlier. So here's a summary of what we expect for the different types of structure variation. So when there's an insertion in the donor with respect to the reference genome, we expect the map distance to be too small, but the read pairs will be in the correct orientation. When there's a deletion, we expect the distance to be too large, again in the correct orientation. And inversions have this pattern where the reads point in the same direction, 10 duplication as a pattern, where they point in opposite directions. And if there is an interchromosomal exchange, we'll have mappings where one read maps to one chromosome, the pair maps to the other, and they can be in any orientation. Okay, now when we, if there's a very long insertion in the donor genome with respect to the reference, we might actually miss it. If this inserted sequence is longer than the insert size of our distribution, we won't have pairs that map one side on one half of the inversion and the other pair on the other side. So here we might not be able to see very large insertions. What will happen is we'll typically have one pair here, the other pair in the novel sequence, and maybe only one end of the read maps to the reference genome. So that's something to watch out for, that there's a limit to how large of insertions we can find from, say, 300 base pair paired end data. This one? Okay, so when, if, let's say we've sequenced 300 base pair paired end reads, and let's say we have a 1000 base pair insertion into the genome. Now, to be able to find this with paired end reads, we typically need one half of the pair on one side of the insertion and the other half of the pair on the other side of the insertion. But because the event is much larger than our insert size, we can't have that. There's not going to be pairs that span all the way over from one end of the insertion to the other. So this isn't a problem with deletions because deletions are spreading pairs out, and you'll always have one pair on one side of the deletion-based grade point and one side on the other. I think this would become a lot clearer when we do the exercise, which we'll go through these types of events, and we'll see the actual patterns in IGV. Do you have a question? I was just going to say, so then do these bits, the red bits, get up to where? It depends on what type of insertion it is. If it was, let's say, transposons jump around the genome. You can have transposon insertions. So if it's a transposon that just moved around the genome, then anything that aligns, there was sequence from the red bit will just align somewhere else in the genome. If it's truly novel sequence that's just not in the reference genome, it's likely that the pair just won't map at all. Yeah? Could it happen that you have a run into the insertion and then you have, like, the read that catches the halfway through and then it's actually truncated by the audience? Yeah, well, that's definitely a signature of these events. You'll have exactly as you say. For the part of the read that's the same as the reference, it will align and then BWA will truncate the read and it'll have these soft clip tags where it just says, I haven't aligned the last 30 bases of the read. And that sort of leads into the next slide, which is a different type of signatures. So so far we've talked about just using the distance and orientation between paired end reads to infer structuration, but also if you can have reads that are split, where the alignment is just broken into two places. So for example here, there's a deletion at this point. This read sequenced the deletion breakpoint and then BWA went to map it and mapped one part here, the other part here with a large gap in between indicating the deletion. Now, if this was an insertion, it might just align half of the read, as we just said. Do you have a question? Do you know? Okay. Okay, so this is cancer bar informatics course. So we're interested in sequencing tumors. So I'll talk a little bit about somatic variant calling versus germ9 variant calling. So we're almost always going to sequence both a tumor sample of an individual and the individual's normal genome. And we're typically only interested in the structure variations that occurred in the tumor, not the ones that occurred in the normal. So there's two approaches to to just finding the tumor specific structure variation. And the first approach is taking the tumor sample, running a structure variation car on just the tumor and then running a structure variation call on just the normal and then looking for things that were called the tumor and not the normal. The second approach is just finding structure variation in the tumor and then using a program that will search for evidence of the same structure variation at the read pair level in the germ9 sample. And the there are programs to implement both of these approaches. We're going to use a program called lumpy SV in the in the tutorial, which will call both tumor variants and normal variants internally and then filter out all of the germ9 variants automatically for you. But structure variation detection is very active field and different programs will implement different approaches for enriching for just the tumor mutations. On that point, I'm going to talk about the performance of structure variation programs. And as I said, this is a very active field of development. There's unlike read alignment where the field has largely settled on BWAMM as the read aligner. There's no one true structure variation car. There's a lot of programs that will perform well in certain contexts, not so well in others. And Paul Butros here at OICR is organizing essentially a benchmarking project, which benchmarks different structure variation cars. And there's a live leaderboard that you can look at to see what programs and what pipelines are currently the state of the art. So this is a good resource to look at when you're trying to select what program to use. For the tutorial, I chose lumpy as it's a very easy program to use and it's good for demonstrating how structure variation colors work. But I encourage you to sort of have a look at this leaderboard and see what alternative programs are out there. Now I'm just jumping back to this slide about gene fusions. We're going to hear about that in the afternoon. But this is a special type of rearrangement where you come where there's been a rearrangement between say different parts of a chromosome or two different chromosomes which can create a new gene which is a bit of gene X in this case and a bit of gene Y and Andrew will talk all about finding these in the afternoon. So with that said, we'll start the second exercise as before. It's on the wiki and just a list of commands that you can enter into the terminal and in this case we'll heavily rely on IGV to view the actual alignments and to look at these structure variation patterns based on these diagrams that I've just made. But before we launch into that is there any more questions about this?