 All right, good morning, everyone. So I'm Jared Simpson, a few words about myself before I start into the lecture. So I'm a principal investigator here at OICR. I work just upstairs on the sixth floor. My research group develops algorithms for processing sequencing data. So like the type of things we're going to talk about here today, mapping reads to reference genomes, finding rearrangements, genome assemblies, that's the sort of thing that I work on in my group. In addition to being a PI here at OICR, it also has a position in the computer science department across the street at the university. I've been working in bioinformatics for just under 10 years now. Prior to being bioinformatics, I made video games. I was a software engineer at Electronic Arts out in Vancouver. And before that, I was a computer scientist studying at the University of British Columbia. So this module is really about some of the most fundamental things that we do with DNA sequencing data for other cancer genomes, or indeed normal genomes as well. And that's finding out where the reads might map onto a reference genome, and then trying to use those mapped reads to discover patterns of variation. So I'm going to go over that. This module has really four parts. There's going to be a lecture on genome mapping, which is right now. Then you'll do a tutorial that's on GitHub, where you get a chance to be hands-on with some of the mapping tools that we most commonly use in the field. Then we'll have a bit of a break. I'll come back, talk about genome rearrangements. And then there'll be a tutorial on genome rearrangements as well. I like answering questions better than lecturing, so please feel free to ask questions as they come up. Well, I'm going through this material. So here are the objectives for this module. First and foremost, I want you to understand what we mean when we talk about mapping reads to a reference genome. I'll elaborate on that in just a little bit. Also, I want you to understand the common file formats that you'll be using when you work with sequencing data. So the inputs, the data comes off the sequencer, comes in the FASTQ format. And then the files that you use to work with reads mapped or reads a line to a reference genome, are in what's called the SAM and BAM format. I'll go over all of those in this lecture. Also, I want you to understand the common terminology we use when we talk about mapping reads, like mapping qualities, mismapped reads, and so on, so that when you're getting help from someone or when you're trying to talk to people about some aligned data, you know what people are talking about these various terms. And in the second half of the lab, we'll talk about how we find genome rearrangements using read pairs. And read pairs is one of these terms I'll explain later. And then finally, in the tutorial, you'll have a chance to actually run a mapper on your own and rearrange and call it as well on real data. Right, so when I entered bar informatics about 10 years ago, it was really the start of this era of very high throughput sequencing. Prior to the introduction of the SELEXA, which eventually became the Illuminous Sequencing Instruments, the dominant way of sequencing genomes were using capillary-based sequencing based on Sanger chemistry. And that was relatively low throughput instruments. You could get maybe a few hundred megabases of data per run, but not much more than that. And really not enough to routinely sequence human genomes. But since these high throughput short read sequencers have been invented and widely deployed in the field, we're now able to sequence genomes routinely. At the top end of the scale here, which is really just a depiction of the throughputs of very sequencing instruments, is the Illumina high-seq sequencer, which gives you about 600 gigabases of data over a 10-day run. This has completely changed how we do cancer genomics. It's now cheap enough that we can sequence cancer genomes routinely. And the content of this module is really how we analyze data coming off of those sequences. So a little bit about the Illumina Sequencing Chemistry and how it works. It's a cycle-based instrument where nucleotides are added cycle by cycle to a growing DNA molecule. So it's based on fluorescence. And what happens is that, do you have a pointer? Well, I'm going to neglect one side of the half. We're grew and flow, so I won't use that. What happens is that the nucleotides are being added to this growing molecule DNA are labeled with a fluorescent tag with one of four colors. And a laser is used to excite that fluorescent tag, and then a camera captures what the color is for that stage of the reaction. So on the left-hand side here, we have a single T that's being added to this molecule DNA. And then as the reaction progresses, a G was added, a C was added, a T was added, and so on. Now, on the flow cell that these DNA molecules are attached to, you perform amplification to amplify what's a single molecule into this cluster of thousands or tens of thousands of molecules. And this gives you this circular area on there. All of those molecules should have the same color at the same time. And literally, sequencing the DNA is just reading off the patterns of colors. So if you see blue followed by green, that means it's TG, and so on. You're just reading the color. Yeah. What does it mean when you're sequencing provider tells you that the clusters were too close to each other and then the base color? Yeah, so the base calling algorithm is trying to figure out what this order of colors was, what the sequence of colors was. If the clusters overlap, it will look like a mixed signal where it might look like it's both blue and green at the same time. And then it can't say for that part of the flow cell what the sequence was. Is that the issue with the DNA you're sending for sequencing or is that just how it operated? You have to fairly carefully control the concentration of DNA you add to the flow cell. If you add too much, the density is too high and you start to get overlapping clusters. So the software that the base calls will try to detect that and then either will say, OK, these clusters aren't useful for sequencing or it might just get confused if it can't detect those. So it's basically you want to try to get your concentration in the sort of the Goldilocks range where it's not too high, where you're getting overlapping clusters, but it's not so low that you're just not sequencing very much. If you're at the perfect high spot, you get more throughput. You get exactly. Yeah, yeah. So are you explaining the level of that concentration? Yeah, so you start with a DNA molecule and then on the surface of the flow cell, there's adapter sequences. The DNA molecules come in. They bind to the adapter sequences and then one strand of the DNA gets cleaved off. And then the DNA is then sticking up vertically on the flow cell. It bends down to another adapter that's complementary to the other half of the DNA. And then there's what's called bridge PCR that happens where that bent molecule DNA gets progressively amplified into a cluster. Did you have a question back here? Sydney, OK. Right, so here's a term that you should start to be familiar with and that's base calling. Base calling just means to take this stack of microscope images where each one of these individual colored dots is one of these clusters and figure out what the sequence of nucleotides was for each one of these clusters. So it's just transforming from these images to an ASCII file in fast Q format, which with the sequence of each one of these molecules. So every point of time these images are sequentially captured and analyzed? That's right. Yeah, so the cycle-based chemistry is, so it'll flow nucleotides on there. They'll get incorporated. A laser will excite them. It'll image them. And then there's a blocking group that doesn't allow the DNA molecule to grow any further. That blocking group then gets removed and then the cycle continues. Excuse me. Sure. It greets in this order, line by line. Yeah, when we read the file, we read... No, the file. But how do you get from the image on the left to the file on the right? So this image is just for one cycle. So you'll have 100 of these images stacked up. So this is, let's say this image is for the first base of the read. For each one of these clusters, we'll say it's, okay, ACG or T. And then 30 minutes later, you take a second image after the chemistry has progressed. And then from that layer of image, you'd find the same clusters in these images and you'd read off the color for the base at that time. So this whole thing is for one base? This image is for one base for how many dots are on there, maybe 10,000. So it's 10,000 reads. One base of 10,000 reads. That's part of the DNA. They're all the same. They're all doing the first base or the second base. It's lockstep chemistry. So in the 50th cycle, you're reading the 50th base of every one of the molecules. But each spot is a different piece, potentially. Each spot is a different DNA molecule. Yeah, actually. In, this is the name of the read, the first line. And these, I believe it's these two, oh, sorry, it's these two numbers. I'm looking at the last two numbers in the first line. That's the coordinate of the lot on the process. Okay, so no sequencing technology is perfect. All sequencers make errors in various ways. One of the sources of errors of the luminous sequencing chemistry is what's called phasing and pre-phasing. I've just explained these molecules are being extended one base at a time in lockstep where all of the molecules in the cluster should be at the 50th base and have 50 nucleotides incorporated in them. Now that doesn't always work perfectly and some molecules can go ahead and some molecules can fall behind. And when some molecules in the cluster fall behind it's called phasing and when some of them go ahead it's called pre-phasing. And what happens is when you get phasing and pre-phasing the color in that cluster becomes a mixture. If some molecules are down on the G base you'll have some red signal whereas the other ones are blue or green and that gives you a mixture and this makes it more difficult for the base caller to determine what the sequence of that colony of reads was or that cluster of reads was. Yeah, exactly. So when I come to quality scores this is what determines the quality score is this phasing and pre-phasing. It's a bit more complicated than majority it has a statistical model of how often things fall behind or fall ahead but yeah that's the general idea. So right now I'm talking about within each one of these dots within a cluster. Within the cluster they're all in the same molecule but from the different dots they're different molecules. So this is the errors that occur within one cluster. Okay because of these chemistry problems or error modes each sequencing technology has a unique error profile which is dependent on the chemistry of it. So the Illumina Sequencer which is what we mostly work with in Cantogenomics it has a very low error rate roughly 1 in 200 bases are errors and most of the times when an error occurs it's a substitution error where the sequencer just said okay there's a G at this position when it was actually a T or any other type of substitution. Other sequencers like the 454 or the Ion Torrent have mostly insertion and deletion errors within homopolymer runs and then the single molecule sequencers which aren't measuring a cluster of molecules but it's just measuring off of a single molecule DNA which is much more noisy they have a higher error rate of about 10% to 20% and the errors tend to be a mixture of insertion, deletion, and substitutes and the single molecule sequencers are by Pac-Bio and one that I work on quite a lot which is the Oxford Nanopore Sequencer. So maybe you're going to cover this in a bit of a side step but with the single molecule ones they have higher error rate even when you account for the error introduced by application and the other stuff. Yes. This is just intrinsic to having to measure off of basically single base. And something important to understand when you look at your own data is that the Illumina error profile isn't constant across the read. The chance of having a substitution base is dependent on the position from the start of the base and errors tend to increase towards the 3' end of the sequence. Here's a plot of the error rate per position for six different Illumina runs and you can see that for all of the runs the error rate is starting to creep up towards the end of the read at this 100th base position. So while the error rate's around 1,000 at the start of the read it can be a few percent at the end of the read and this will be reflecting the quality scores of your reads and also something you should be aware of when you're trying to say call SNPs as the bases towards the end of the read aren't as reliable. So then you just trim to 100 in this case? For this, yes. So one of the runs here was up to 150 bases and you can see that the error rate's around 3% at the end. It's probably worth trimming that back to say maybe 120 or even 100 just to keep the best data. So when you said towards the end of the read and then you said it towards the 3' so what about the reverse one as either reads at the beginning of the mistake or don't know if they were towards the 5' It's always towards the 3' so in so for the when you sequence a read pair which is something I'll come to it's sequencing so the first read will be on the forward strand of the reference the second read will be on the reverse strand so when you look it in reference coordinates it would be the lower coordinate with the higher error rate so let's think about the errors in terms of the strand that was actually sequenced so it's always a 3' end of the strand that was sequenced I can come back to that Do we have a question right here? So if you sequence a genome and then you had like 100 reads at some position do you mean? So this is averaged over a lot of reads Yeah Do we have a question? Why does it happen? So it's exactly this so as these you have this cluster of 10,000 molecules they're all in lock step incorporating the same base and as that reaction progresses and you're getting towards the end of the reads more of them have gotten out of phase so the signal purity they call it is lower at the end of the read So if that's the case then in theory if you had a shorter reaction on the error rate it's like the longer the reaction goes Yeah so you can see it in this one so the set of reads that's 150 bases in this case had the highest error rate and all of the longer ones the error rate creeps up Is the error rate of a specific secure feather random or it offers in the same side like for example if you set them to So the error rate is random or there is some limitation? It's not random and that's a very good point because it depends this is one source of the errors this phasing phasing PCR amplification can introduce errors it can introduce stutter in simple repeats FFPE damages DNA and you get these cross links between thymines and all of those things contribute to the error so it's not random and it depends on how you prepared the DNA so the highest accuracy way that people sequence is doing PCR free libraries where you don't do any PCR step you're just getting the unmodified DNA onto the sequencer and that tends to have better properties than if you're amplifying it or using FFPE FFPE is the hardest to sequence from Okay so now explain Even without PCR you still want to have the bridge amplification to create the cluster right? Yes, yes when we say PCR we mean when you allocate adapters to sequencers usually PCR step as well but the bridge amplification is always there So the FASQ file is the format that you we use when the data comes off of the sequencer it's fairly simple text file format where each sequencing record so each one of these clusters which was a colony of DNA molecules is represented by four lines and the first line in the FASQ file is the read name so this is the identifier of the read there's structured metadata within this read name like I was explaining over here where the X and Y coordinates of where this cluster was on the flow celler in here there's the instrument name the flow cell it was run on but often times we don't deal with that metadata too much and we just take this long string as being a unique identifier of this read the second line is the actual base called sequence of the read so this is just a sequence of nucleotides that the base caller determined from the stack of images the third line is just a separator that separates the base called sequence from the base quality scores and the base quality scores are an estimation that the base caller made of how reliable each one of these nucleotides are so it's an estimate of whether there was an error at that position or not now you look at this and it's just this these ASCII characters CCC, FFF so on and this is just an encoding of a number that we can print in the file so when you convert this to a numeric quality score you just subtract 62 from the ASCII character code for that and you get what's called the base quality score the base quality score is a number between 3 and 40 and it's a log-scaled probability that that base call is incorrect so if you see a base quality score with a quality of 40 it has about a 1 in 10,000 chance of being wrong so the base caller has said I'm very confident in this particular base conversely if you see a base quality score of 10 it has about a 10% chance of being incorrect so if you are looking at SNPs and you're trying to decide whether they're real or not and all of the reads have a score of Q10 you wouldn't trust it as much as if the reads had a quality score of Q40 okay so a read mapping program will take those reads in fast Q format yeah so you would take the ASCII character code of this say capital F is going to be something like I don't know 80 I don't know the ASCII table in my head and then you subtract there's a conversion table there's a direct lookup table but yeah so I don't know so what we want is this numerical code for it the ASCII table from 0 to 40 are unprintable characters you can't print them in a file so you add an offset into the printable ASCII characters I think it's offset of 62 and that gives us these these codes but for the purposes of working with the data you can just treat it as a lookup table where F means Q25 or something so is there a one to one relationship with the ASCII character yeah exactly so this so this G is a quality score for this question in the back what is that center blue say again yeah it depends on a few factors usually so in you'll get quality scores of 3 which are the lowest quality scores sometimes at the end of the reads and usually you just want to throw away that information there's the sequencer just basically gave up on calling those bases and it put the lowest quality score on usually Q20 to Q30 are the ones that we have reasonable confidence in when you work with a SNP calling or a somatic mutation calling program they'll internally take these quality scores put them into their probabilistic model and then make a call for you where you don't have to directly worry about the quality scores so it's really it's looking at a product of how many reads were showing a particular mutation and the quality scores of those reads as well right so the read mapper is going to take that fast Q file and it's going to try to figure out where in your reference genome let's say the human reference genome those reads came from so the human reference genome is very large it has about 3 billion bases and the reads are extremely short about 100 bases so it's commonly referred to as finding a needle in a haystack where you're trying to find where that 100 base pair read could have come from in the human reference genome so what do we want to do that well of course when we're sequencing the cancer genome or sequencing someone's germline genome we want to know how they're different in comparison to the reference genome we want to know all of the genome variants they're present in that individual versus this reference so when we map the reads or align the reads to the reference genome we can then look at those alignments to call SNPs or things like rearrangements and as I said to do this we need to map the reads to the reference genome first now you might think that okay well let's just try to compare the read to every possible location in the reference genome but unfortunately because the human genome is so large and because it's so repetitive that's really not computationally feasible so what we have to do is build what's called an index of the reference genome that lets us efficiently look up possible equations where that read might have mapped onto the reference genome so the most popular mapping program which is called BWA it uses an index called the Burroughs Wheeler Transform to rapidly identify the possible locations where a read might have mapped so working with things like the Burroughs Wheeler Transform is what my research group does so I'm happy to talk to anybody who's interested about how that algorithm works but it's a bit beyond the scope of the course for this lecture now what does the read mapper need to consider when it's mapping reads to a reference genome well first it needs to consider the fact that there's sequencing errors we can't just look for exact matches between this read that we're mapping and the reference genome we need to consider that there's going to be differences between it because of these these sequencing errors that we talked about so the read mapping program must be tolerant about differences between the reads and the reference genome so typically what read mappers do is they use this index of the reference genome to find exact matching what we call seeds and then it will refine the seeds using dynamic programming to determine what the best mapping location for that read is so the general principle here is that reads that have a lot of errors and insertion and deletion errors in particular are a lot harder to map than reads with just few substitutions like Illumina data so if you're mapping say Oxford Nanopore data which has 15% error rate mixture of insertion and deletions it's much more challenging to map those than say an Illumina read that has a single substitution in it over 100 bases would you still trust BWA to map Nanopore reads compared to the design form well yes so that's a good point and something I wanted to say in the last slide is that you typically choose your mapping program depending on the type of sequencing technology so when we're mapping Illumina reads later on we're going to use BWA mem when I map Oxford Nanopore reads I use BWA mem but with specific options that are designed to score dash X, O, N, T, 2D there are other specialized read mappers for Nanopore data like graph map but I like to use BWA mem second issue that a mapper needs to keep in mind is that short read sequencers generate huge amounts of data as I said in the second slide Illumina run is about 600 gigabases of data over a 10 day run and that's just a huge amount of data for the diagram to churn through so we need to select programs that are extremely efficient or else we'll end up spending a whole lot of money on just compute time rather than sequencing genomes so right, so how do you choose a mapper well needs to be accurate, we don't want it to be making errors where it places the reads on the wrong location of the reference genome, that's the major source of false positive variant calls it needs to be quite sensitive it has to allow for differences between the sequence individual and the reference genome and as I just said it needs to be fast as well so about four years ago there was a very nice review paper in bioinformatics about the surveyed all the different mapping programs that are out there if you're interested in this and looking at the tradeoffs between say speed and accuracy this paper goes into it in quite a lot of depth as you might have figured out I like the mapping program BWA a lot of people use that and it's sort of the standard for aluminum data now and it's what we're going to use in the tutorial but just be aware that there's probably been a hundred different mapping programs published over the last ten years for mapping short reads to a reference genome and they all sit on a different position of the sort of accuracy versus speed curve okay so now I'm going to talk about some of the terminology with mapping reads to a reference genome mainly mapping quality so we can visualize the process of mapping a read to a reference genome like this we take our sequence read which is at the bottom of the slide in red here and we have our long reference genome which is this black bar here and we need to figure out where on this reference genome we want to place this short read so let's say the mapping program found three possible locations from it and from left to right there's a mapping location where it mapped perfectly to the reference genome there were no differences between the read and the reference genome in the middle one there was a single mismatch between the read and the reference genome and in the one on the right there are two mismatches between the read and the reference genome how do we evaluate which of these is the most likely true position for the genome and we do this and we represent it using what's called a mapping quality and just like a base quality where we had this measure of how reliable the base was the mapping quality is a measure of how reliable this mapping was so in the output file for the liner it will say okay this is a Q30 mapping and that means there's roughly one in a thousand chance that the mapper selected the wrong mapping location for that read so just like when you're looking at SNPs you want to just look at the ones or consider the ones with fairly high quality when you're looking at your read alignments as well you want to just consider the ones that were mapped very confidently and in the exercise there'll be examples of this that you can look at now what possibly causes mapping errors well if there's a lot of errors in the read it can confuse the mapper and put it in the wrong position if the genome is very repetitive and the human genome is very repetitive it can cause the mapper to put the read in the wrong location and if there's any variation like SNPs and indels between the reference and the individual that you've sequenced that can cause problems as well so the mapper is the program? Yes the mapper is the program like BWA is the mapper now because Illumina Reads are fairly short of a hundred bases that causes these problems where we might mismap the reads into the wrong location so what some people did is they developed what we call paired end reads which allows us to increase our confidence in the read mappings so the idea of paired end reads is that you would take a longish molecule of DNA let's say 500 bases you'd sequence 100 bases from one end of the read and 100 bases from the other end of the read with the 300 bases in between left unread so now we have this constraint that we know we have 200 bases of sequence and they're separated by 300 bases which allows the mapper to have more information to know where to place reads so to illustrate that let's say we're now sequencing a read pair we want to figure out where this is if we map the first half of the pair there might be two equivalent mapping locations so let's say it maps to these two positions with one mismatch each but now we wouldn't be able to say which of these is the two mapping locations they are the exact same and there's no information to resolve the mapping so we sequence the read pair and we then map the second half of the read because we have this constraint that the second half of the read needs to map about 300 bases away we can then we get a much more reliable alignment so in this case for the first alignment the second half of the read pair mapped with no differences whereas in the second alignment the second half of the pair mapped with six differences so we'd be much more confident that okay the true alignment is this first one where cumulatively there was only one mismatch versus seven mismatches for the other alignment this is yeah go ahead and we always use an insert card like 300 bases between the two ends it's usually tweaked for different applications it doesn't change much more than say 200 to 500 bases just because it's really hard to get very long fragments to work on this like bridge amplification so the upper limit usually sees about 500 I think if the field is kind of standardized around 300 as being the common insert size well it's not quite independent but people do do that and it definitely helps like as I mentioned before the error rate goes up towards the end of the read and so you're reading the noisy bit twice if they overlap and for things like metagenomics and 16s sequencing where you want to sequence say 200 base pair fragments at very high accuracy people do that quite commonly you'll make the insert size about 180 bases so they overlap a little bit at the end that's quite common it makes it a lot better like it gets rid of just the statistical variability caused by these phasing problems overlap between the fluorescent spectra for the nucleotide tags what it doesn't get rid of is amplification artifacts if you've amplified your DNA and the amplification made an error that's going to be read the same way twice and it won't fix that the amplification for the cluster or the PCR that you do before it mainly the PCR that you do before it the reference is just to the still with one grid but then you have overlapping grids over that region yes are those used as well to figure out how to put this together like independent reads that are from the same location sort of usually we're viewing and when we calculate mapping quality in particular we just view these two reads as a pair and independent of everything else we map them to the reference we estimate mapping quality then we move on with the next pair so you go pair by pair we go pair by pair that's right if you the red line doesn't depict the sequence that would have to be bound by another return yes but that's not used it's used downstream the mapper just treats every read pair as independent downstream programs like SNP callers indel callers they'll then use that information they'll look at it okay it's the next layer of information okay coming to the end of this half of the lecture I'll just explain the SAM and BAM format so this was a standard that rose out of Sanger Institute and Broad Institute about say seven eight years ago as a way of representing alignments such that you could write a mapping algorithm you'd write it's aligned reads or it's mapped reads in some standard format and then any sort of SNP calling program could just take this file that's a standard instead of having a different file type for every mapping program so SAM stands for sequence alignment map it's a text file it's just a tab delimited text file you can read in your terminal BAM is the compressed version it contains all of the same information but it's just a much more compact representation of the alignments so your files don't take up a huge amount of disk space so I'll walk through the different fields in the SAM file so the first field is just the read ID so this comes straight out of that first line of the FASTQ file I showed you earlier the second field is what we call flags so this is a binary number or a number that's just encoding a binary string that just gives small bits of information like what strand of the reference genome that read map to whether it has a read pair whether it's the read pairs was mapped in the expected distance and orientation so to decode this number you can't just look at this number which is 99 and figure out what it means there's online tools that will decode this the different bits of information and I think in the exercise after this that's described a little bit more the next fields are the actual chromosome that the read map to and its position along the chromosome so in this case the read map to chromosome 19 at about 8.8 megabases next field is this mapping quality so this read go to mapping quality of Q60 which is actually the upper limit for mapping quality you won't see map reads with map with a higher than Q60 quality that means it's about 1 in a million chance that it was mapped in correctly so we could rely on this read quite a bit next over is what's called the cigar string and this is a representation of how the bases in the read line up against the bases of the reference genome and I've given two examples at the bottom of the slide here showing an insertion with respect to the reference genome and a deletion with respect to reference genome so the one on your left is showing a deletion and we encode this as 4M 1D6M so the way that we read this is that four bases of the read lined up against four bases of the reference then one base of the reference was deleted so there was no base in the read that maps to that so it has just a dash character here followed by six bases which match between the reference and the read the other side on the right is an insertion, here we have 4M so there's four bases aligned to the reference, one base inserted and then four bases again aligned to the reference now something important here and it's a bit subtle is that these 4M characters these 4M codes don't tell you whether the bases between the reference and the read are identical so in the second position on the right the reference has an A and the read has a T that still is 4M it's not trying to say that these are identical bases it's just saying that they line up against each other that's something that people commonly get confused about early on with working with these files is that they want to treat these 4M or 10M or whatever it is as an exact map so next in the SAM file is the pairing information so the chromosome that the read pair mapped to in this case it's just represented as an equal sign saying that it was the same chromosome it was chromosome 19 again followed by the position of the read pair and then the actual inferred insert size and distance between that whole fragment of DNA that was sequence then we just have the actual sequence of the read same as what's in the FASTQ file and the quality score as well ok so in this slide it's just a list of resources for working with these SAM and BAM files so the program we'll be using is SAM tools which is a tool kit for manipulating these files we can use it to convert from SAM to BAM and the other way from BAM to SAM you can sort the alignments into reference coordinate order or you can extract all the alignments for a particular reference location so now SAM tools offers thread and sorting so it's a little bit faster than simple sorting without using threads you just say that's faster than our tools which also uses thread and sorting I haven't done a head-to-head comparison of them but you just trust SAM tools I trust SAM tools it's a little easier to install I know Hang Lee very well this is my preferred software yeah I don't know has anybody done a benchmark between SAM tools and Picard talk about speed yeah I was going to say that but I didn't want to offend any Java programmers but yeah generally compiles C programs are a little faster than Java programs so I'd expect that SAM tools is a bit faster so the second bullet here is just a link to the specification for the SAM file so if there's anything that you read in the SAM file you can't figure out what does this flag mean what is the exact meaning of these fields this is the technical document that all the developers work from so this is the ground truth for SAM files so have a look at that one to answer any questions if you want help online you could ask at the SAM tools mailing list if you have any forums like bio stars or seek answers you can find very helpful people to help you out right so I believe you all have a tutorial on IGV already is that right so in the module coming up where you actually will run BWA and you'll use SAM tools we'll be using IGV to view the aligned reads IGV gives you a lot of information about these reads you can get all of these fields during the SAM file like where the read was mapped where pair was mapped and so on and you can view the mismatches between reads and the reference genome so this is an example of a somatic mutation in the bottom is a sequence tumor in the top is a sequence normal and we can see that there's a single base that was substituted in the tumor that isn't present in the normal and you find a mutation in gene that you think is quite interesting you open up IGV you navigate to that location and then you look at the read alignment to see whether you trust that mutation this is quite a good one all of the reads are aligned with Q60 and all of those bases that are different with respect to the reference have quite high quality scores now but you have to be aware that there are various bits of the reference genome where mapping can go quite wrong this is an example of reads mapped to the centromere on chromosome 2 now when we're sequencing a genome we're randomly sampling reads from that reference genome and what that means is that we expect that the read coverage to be roughly uniform across the whole genome so when we see things like really high piles of reads where the read depth is much higher than we expect it usually indicates there's some sort of mapping problem in the centromere the mapped depth is around 700x typically we sequence a genome to around 30x and there's these huge peaks where over 8000 reads are piling up at single locations we wouldn't trust any of these alignments they're almost certainly just artifacts of the fact that the centromere is incredibly repetitive and it's the same sequence or the same repeat structure shared across all the chromosomes so we would have little to no confidence in these alignments and you can see these vertical colored bars that would look like SNPs but because of the fact that these are probably artifact alignments we wouldn't trust any of these SNPs that have been called here and finally something to be aware of is soft clipped reads so this is when BW is trying to map a read to the reference if it gets to a point where it can map the start of the read but then not map any of the rest of it it will just stop aligning and only map the first part that can confidently align and the rest of the read is considered to be soft clipped now this is an example of a real indel in a cancer genome which was too long for BWA to align the reads with a gap in it so what happened is it just soft clipped all of the reads up to the break point to map the rest of the reads on the line this is one of the patterns that it looks like and we'll be getting into this more in the next lecture when I talk about genome rearrangements but that's the end of these slides before we start the read mapping exercise is there any more questions about mapping reads to reference? I'll ask you about the reference genome there might be a tumor of a patient that's a reference genome I'll give you a little bit you usually sequence both the normal which is just say a blood sample from lymphocytes or some adjacent tissue and you also sequence the tumor and when you go to compare when you go to find variants you make sure that there's a lack of evidence for the variant in the normal sample before calling it to be a somatic mutation and this will come up in a few modules as well when I talk about rearrangements when we talk about somatic mutations I think tomorrow so you're going to create a reference genome for each patient his own what we usually do with programs internally is that so let's go back a few slides so in this case we would look at this position we might call we might see that there's a C to G mutation we would look at all of the evidence for this mutation in the tumor and then we would look at the same position in the normal which is shown here and see that there's no evidence for it so the program would be internally integrating all of this evidence and it would say okay there's a somatic mutation C to G that's just to the so you don't, alright the difference would be the reference to his own that's the difference you don't, you usually don't reconstruct the information that gets lost and the program is doing it comparing the cancer to his own to the reference what are you saying for the cigar and you said a common misconception about matching being mechanical can you just go over that sure so for this one here we have, we represent the fact that this fore-based stretch matched the reference genome but this these bases mismatch there's a T in the read, one in A in the reference so why is that matching it means that M means that they align against each other not that they're the same things so it goes back to how we represent dynamic programming where when you align using dynamic programming if you form the diagonal it means that the two bases are expected to be lined up and we call that a match that's where the confusion comes from because you say okay they match it should be the exact same thing but it doesn't mean that they're aligned but not necessarily identical so the cigar string wraps the indels but not necessarily the snips would you see something in there for the larger structural variations you usually can't do large structure variations with cigar strings like inversions there's not a representation of usually what will happen a cigar example I didn't use is the soft clipping so there might be, if a reed gets soft clip it'll say 50 s and that just means that the last 50 bases weren't aligned so usually when you get structural variation you end up with soft clip briefs okay if that's all the questions let's move on to the exercise here's the second part of the module which is covering genome rearrangements I hope you all got on well with the reed mapping exercise so I'll give a brief lecture probably only about 10-15 minutes about finding rearrangements then we'll do a second lab where you actually have a chance to run a factor on the BAM file that you generated in the first part of the lab so there's a lot of different types of variation that we want to find when we sequence a genome we have single nucleotide variants like just substitutions or snips you have insertions and deletions where bits of DNA has either been deleted from the genome or inserted and then you have different classes of structural variation which are larger types of variants which are usually quite drastic changes to the chromosomes this module is going to focus on just the structural variations later on you'll hear about gene fusions and then tomorrow you'll hear about calling single nucleotide variants and insertions and deletions as well so the class of structural variations really encompasses different variant types from very large insertions and deletions like megabase scale deletions where bits of the chromosome get flipped translocations where you swap material between different chromosomes and copy number variation where you can have duplications of large parts of the chromosome so the typical way of finding structural variation is using these paired-end reads that we discussed in the first part of the lab so here's an overview of how you generate paired-end reads for genomic DNA so you start with your genome you then fragment it into short pieces using a size range of about 200 to 500 bases after size selection you then add sequencing adapters to both ends of those DNA fragments then you put the one in your aluminum sequencer sequence from both ends so one read you sequence the forward strand and the other half of the pair you sequence the reverse strand and what this means is that when you sequence these read pairs they have an expected orientation when you map them to the reference genome if the read pair just comes from some part of the reference genome that matches matches well you'd expect one half of the pair to map on the forward strand of the reference and the other half of the pair to map on the reverse strand is depicted here with these arrows here one arrow facing to the right depicting the forward and the other arrow facing to the left depicting a reverse strand map now another source of information that we have is that the the length of these DNA fragments that you sequence should come from a distribution it's roughly Gaussian or normal in shape with the peak of around 400 bases this depends on how you sheared the DNA and size selected it but when we map these read pairs to the reference genome we can use this distribution to figure out how far apart we expect them to be and this gives us information is that if they map much further apart than expected or much closer than expected they might indicate there's some level of structural variation I'll give examples of that in the upcoming slides but just a bit of terminology before we do that when the read pairs that map to the reference genome are in the expected orientation and the distance between them is consistent with this fragment size distribution we call them concordant read pairs they don't have any evidence for structural variation when they either map too far apart or too close together or outside of this expected orientation we call them discordant read pairs and these discordant read pairs are the ones that give some sort of evidence of structural variation in these SAM tools flags that you had a look at there's a bit in there that's set that will say yes this is a discordant read pair or this is a concordant read pair and in the lab later we'll be extracting reads from these BAM files that are concordant which means they're discordant which means they have some evidence of structural variation so let's look at some examples of what the patterns of discordant read pairs should look like for various types of structural variation so I'll be showing these simple diagrams of some sequence sample and a reference genome to illustrate these patterns so this is an example of a deletion so here there's this red part of the genome that's only in the reference and it's not present in this individual that we've sequenced this is an example of a deletion now I'm depicting the deletion in the sample on the top is just this dashed line that sequence isn't present and the breakpoints are just these red vertical bars that's where this deletion occur now when we sequence read pairs from this sample genome and then when we map them to the reference because of the fact that the genome that we've sequenced doesn't have this red block when BWA maps into the reference genome they map much further apart so when you expect them to be 300 bases apart if this deletion is say 500 bases because of the fact that the sample genome doesn't have that red 500 base part piece they would map roughly 800 bases away from each other so that would make them discordant because they don't come from this 400 base pair distribution and that's evidence of a structuration which is a deletion so the signature of deletions is when the reads map further apart than expected and later on you'll be viewing some deletions in IGV which should help reinforce this point now second type of structuration signature is an insertion so here the sample that we've sequenced has some block of sequence that's not in the reference genome again the novel sequence is this red block and the breakpoint is just this red vertical line in the reference and when we map these reads which span the insertion so there's one read on the left half of the insertion one read on the right half of the insertion when they map to the reference genome they're going to map much closer together because of the fact that the reference doesn't have this red block of sequence so the signature of an insertion is closer together than expected well some of the the other pair that I need to give you the insertion in which case you just have the forward mapping and the reverse mapping somewhere else where the insertion comes from it depends on the size of the insertion and what the inserted sequence is as well I'm going to come back to that in a few slides so if you have a type of amplification known as a tandem duplication where within the sample there's two copies of a sequence where they're located head to tail like this if we sample a read pair that's around the breakpoint of the tandem duplication when that second half of the read pair maps it's going to have to map to the start of the other copy of the tandem duplication in a reference what happens is that the read pairs point away from each other like this so whereas we expect them to be pointing towards each other they point away so the signature of tandem duplications is this wrong orientation signature that looks like this similarly if we have an inversion where a bit of DNA gets flipped what's going to happen is that the half of the pair that's at the start of the inverted sequence is going to map much further away and in the wrong orientation because this DNA has been flipped in the chromosome so we get the two pairs mapping in the same direction now and with much larger size than you'd expect so this is just a summary of the different evidence for these structure variation types so for an insertion the map distance is too small but the orientation is correct for a deletion the map distance is too big with the orientation is correct and then for inversions in tandem duplications we either have pairs mapping in the same direction or away from each other a type of structure variation I didn't mention is interchromosonal rearrangements and here we'll have pairs that map to different chromosomes and they can have either any type of orientation so as we just had a question about this can go wrong in various ways one of the ways it can go wrong is missing large insertions so if your insertion size is much larger than the distribution of your paired and reads you're not going to have reads that span across the whole insertion so the reads that fall in the middle of the insertion aren't going to map anywhere and you won't have reads that have this distance that is too close together you might have one pair that map where only one pair of the reads maps and the other one is unmapped that's a line of evidence that some rearrangement detectors will use to find these types of insertion I have small insertions so usually there's this boundary between where the reads will be split by BWA where it will align with a nice gap in it and you can detect it off of just the read alignments usually you can find those up to maybe 30 to 40 bases but because when it's larger events differences between the distribution of the fragment sizes between what you expect and what you actually found usually you can find those if it's maybe 100 to a few hundred bases in that way so there's this sort of gap between 30 bases and 100 where we don't have very good methods to find these types of events right so another line of evidence for structurations is split reads this is where BWA has just introduced a large gap into the read as I was just explaining and here's an example of where there is a deletion and the aligner might just split that introduce this gap corresponding to the deletion usually when this happens these are easier events to find than when you're just trying to infer it off of the fragment size distribution so a special type of rearrangement which is quite important in cancer is gene fusions this is where you had some rearrangement that shuffled axons around between different genes and when those genes are expressed you have fusion protein Andrew is going to talk about this in detail later on but this is a very important type so we've dedicated a whole module to it right so just before we start the exercise I'm worried about calling somatic versus germline mutations so we have variation in our genomes with respect to the reference genome of all types snips indels structural variation and then on top of that if we have a tumor there's additional mutations that occur and usually when we're sequencing cancer genomes we're mainly interested in these we're called somatic changes which are the mutations that are only present in the tumor genome not present in our normal inherited genome so I touched on it earlier is it usually what we do is we'll sequence both the tumor sample and the normal sample and then we'll compare them and we'll subtract off all of the events that have evidence in the normal sample as being germline events rather than somatic and this is an important thing to keep in mind when you're looking at these events in the next lab so we'll now start