 Okay. Good morning everybody. As Anne said, I'm Jared Simpson. I'm a principal investigator here at OICR. I have a lab upstairs on the sixth floor where we develop computational methods for processing and analyzing DNA sequencing data. It's part of my appointment here at OICR. I also have an appointment over at the University of Toronto in the Department of Computer Science as an assistant professor in computer science where I do some more theoretical work on developing methods for analyzing very large collections of data like sequencing data that we're going to hear about. So I'm going to be with you for about the next four hours today. We're going to talk about the two primary methods of analyzing DNA sequencing data, being either mapping reads to a very well characterized reference genome. This is what we usually do in cancer bioinformatics because we have human reference genome. In the second half of this module, I'm going to talk about doing pure de novo assembly. So trying to reconstruct the genome sequence directly from reads that come off, say, an illuminous sequencer. And for both of those, we're going to have laboratory modules where you have a chance to run some mapping programs and assembly programs as well. And in between, we'll have a break just so I don't go on and on for four hours here. So for the first part of the lab, we want to talk about mapping reads to a reference genome. And what I hope you learned from this is to understand what that process means. So I'll give you a little bit of insight into how the algorithms work for taking a set of sequencing reads off, say, an illuminous sequencer, mapping them to a human reference genome. I'll introduce the FASTQ in the SAM and BAM file formats. These are the primary files that you'll store on your computer that represent either sequencing data that's come off DNA sequencing instrument or after you've aligned the reads to a reference genome, how we represent them in the SAM and BAM. I'll explain those to you in detail. I'll describe some common terminology that we use to describe alignments in sequencing data, things like base quality scores, mapping qualities. And then in the last part, you'll get a chance to run a mapper on your own and explore the results. So I'm going to start by just talking a little bit about how DNA sequencing works and the main DNA sequencing instrument that we now use, which is the Illumina High Seeker, the Illumina X10 instruments. So the human genome was sequenced using what's called Sanger sequencing, which was sequencing chemistry developed in the 1970s. It was fantastic. It allowed us to sequence the human genome, albeit at very great cost of billions of dollars. But once we sequenced the initial reference sample of a human genome, there's this realization in the field that we needed to reduce the cost of sequencing, such that we could sequence many, many genomes and discover a lot of genomic variation and then link that to the observable phenotypes that we all have. So an example is here, cancer genomics. We want to be able to sequence an individual's genome and see if they have particular germline snips or indels that predispose them to getting cancer. Or if they get cancer, we want to extract DNA from their tumor samples, sequence it, find where the mutations are, and then link that to genes that might be derived in the progression of the cancer. And in Trevor's lecture yesterday, I'm sure he told you about that process in a lot of detail. Now, the way that we do this nowadays is using what's called these Illumina sequencers. Now, Illumina is a second generation sequencer. It was a generation after Sanger sequencing, and we call it a massively parallel or a high throughput sequencing instrument. And the innovation in Illumina sequencing is that you could sequence many, many reads, many, many fragments of DNA in parallel using this imaging technique. And I'm going to talk about the next few slides. But the advantage of Illumina sequencing are that it gives you the best throughput. By far, you get up to a terabase of DNA from a single Illumina run. The reads are incredibly accurate. The error rate is only probably less than 1%, maybe 1 in 200 bases. And it gives you a pretty good read length for the second generation sequencers. Also, the library preparation is very fast and it's very robust, so it's quite easy to get high quality Illumina data compared to some other sequencing instruments. Now, the disadvantage of Illumina sequencing is this read length. You can only read about 100 to 150 bases of DNA at a single time. And because the human genome is so large and so repetitive and something we're going to be coming back to again and again over this morning, is that read length isn't enough to resolve a lot of the most difficult repetitive regions of the human genome. So when we go back to talking about genome assembly a little later on, I'll be introducing third generation or single molecule long read sequences like the Pac-Bio and the Oxford Nanopore that aim to get around this. But for cancer geomics, because we want to sequence so many samples, get power to find driver genes, we typically use Illumina sequencing because it's so cheap and because you can sequence a human genome for around $1,000 to $1,500. Now, I'm going to walk you through how the Illumina sequencer works, because I think it's incredibly important to understand the actual physical and the chemical processes that are going on. So when you will analyze your own data and something doesn't quite look right, you can go back and think about what the error modes of Illumina sequencing are, and that might give you insight into why you see some of the errors. Now Illumina sequencing is called sequencing by synthesis. The original chemistry was developed by a company called CELACSA, which was subsequently bought by Illumina. The way it works is you take your DNA samples, which contain many copies of genome. These could just be DNA that you extracted from white blood cells. You fragment that DNA into a lot of small pieces, the pieces being between a few hundred to maybe 400 bases. And then you attach those DNA fragments, which we're going to call templates, to the surface of a microscope slide. Now, at each position on the microscope slide, we have a single template molecule. We're showing two examples here with template molecules, the sequence TACAC, and another one over here, it's a template sequence C-A-G-G-G. Then what happens is in place on the slide, there's a process called bridge amplification or bridge PCR, which amplifies from these single molecule templates into a big cluster of the same DNA sequence all in place. So these clusters here I'm showing them as being six, six molecules. In practice, they're more like something like 10 to 20,000, all in a bundle, lying in the same region of this microscope slide. So that's the library, that's the cluster generation step. Next, the sequencing starts. Now this works so you inject color labeled nucleotides. So these are nucleotides that have a fluorophore that will emit light after being excited with a laser. You have one color for each of the four different nucleotides, and you add DNA polymerase as well. And the DNA polymerase is going to find the colored nucleotide that's complementary to the next base in these fragments. So here we're just drawing single molecules, pretend that these are clusters, and it's going to attach it into this growing strand of DNA. So here found A that's complementary to this T, found a G that's complementary to this C. The fluorophore gets excited by laser. It emits light, a very, very expensive digital camera attached to a microscope, images, what color of light was emitted. And then the sequencing reaction for that cycle ends, and then you proceed to the next base with another injection of the color labeled nucleotides, addition of DNA polymerase, and then capturing of light. So each one of these steps, you're sequencing a single base sequentially in order down this DNA fragment. Yeah. In the other way, is that a function of what could be both polymerase mismatching a base or the camera mis-detecting the signal? I'm going to hang on to that and come in and talk with you in two slides. Yeah. How does it release the antisense strand? During this step? Because I know the bridge, it goes like this, and then it anchors, and then it pops back up. Yeah. And then there's that again. How does the antisense strand go on? There's a step on there that will, that will just get rid of it. I don't know the chemistry of how it actually, like during the bridge PCR, I don't know how it gets rid of the antisense strand, whether it separates them, or just one strand goes away. I'll look that up and maybe come back to you at the break. Okay, so in the fifth step, we're just going to line up all of the images, one for each cycle. So in the first cycle, we had a green flash of light for this cluster, a yellow flash of light for this one. In the second cycle, we had red insert third yellow blue, and so on. And the base color is going to take these images, try to detect where the cluster is across the five different images. So it's going to say, okay, this is the leftmost cluster across these five, and then predict what color of light was in each one of these images. So it would say green, red, yellow, red, yellow. And that would be that would then get turned into the base called sequence for that DNA fragment. So it knows what color nucleotide is green, and so on. Now coming back to the question we just had about the errors of Illumina sequencing. Now the primary air mode is due to what we call phasing or dephasing problems. Now, because we have this cluster of 10 to 20,000 molecules, all that are being sequenced simultaneously. As the reaction progresses, some fragments, some molecules in that cluster are either going to lag behind a sequence reaction, or they're going to jump ahead. And by that we mean that when we're on sequencing cycle two, we might have some molecules that didn't get a base incorporated in the first cycle. So they're lagging behind by step, or they got too many bases incorporated in that cycle. So they've jumped ahead to this G base. Now what happens, the effect of this is that the signal that we capture with these digital cameras becomes a mixture of the different colors. So here we have two molecules in this cluster, which emitted a red light for this T, we had one that emitted green and one that emitted yellow. Now the what we'd register here is a signal that's 50% red, 25% green and 25% yellow. And the base collar has to deal with this in some way. It has to try to deconvolve this signal to figure out that okay, this red green signal is because there's a fragment that's lagging behind and this orange signal is because this fragment has jumped ahead. Now early in the cycle, the color is very pure. We get mostly the same color for every one of these templates. But as the sequencing process continues towards the three prime end of the fragment, towards the end of the read, the later cycles, more and more templates are going to fall out of sync. So you get an impure signal near the end of the DNA sequencing fragment. And this is what gives you the characteristic aluminum error rate, where the sequencing error rate is very low at the beginning of the read and that increases as you go to the three prime end of the read. Now the base collar has fairly sophisticated probabilistic models to deal with this, it's going to try to model this process of fragments either falling behind or jumping ahead. And it's going to try to to to deconvolve these mixed signals, but it's also going to give you a confidence level in how sure it is about the assignments of the base calls. So when it says, okay, in the second cycle here, I think there's a T, it's going to give you what we call a quality score, which measures the base collar's confidence in whether there is a sequence, there's an error at that position. So the quality scores are directly proportional to the error rate that you observe at that position. So the main file format that you have, your sequencing data in is called the fast queue format. This file format has four lines per read or per record. And each one of these lines describes something about the sequencing reads, the very first line in this example, we have five sequencing reads, four lines per read, so 20 lines total. The first line is has an at symbol at the start of it to limit the record. That's just a marker saying that this is the start of the fast queue record, then has the read ID, following that at sign. So this is just an identifier for the read that you'd illuminate instrument assigned to that cluster. Next, we have a nucleotide sequence that was predicted by the base collar. So this is about 100 bases here. These are the bases that were predicted for every cycle of that sequencing reaction. The third line is a placeholder, it's just a plus sign. Sometimes the sequencing read ID is duplicated in this field. In modern fast queue, though, it's usually just a plus without any other information. And then the fourth line is the string of quality values, which is shown here where those HHH all the way down to these number signs here. Now, these quality values, as I just said, reflect the base collar's confidence in the assignment of those bases. So for the A base that it called in the first cycle, it assigned a quality value of H. Now, it's not very intuitive to turn H into a probability that there was an error at that site. But the way that fast queue format is encoding this information is using what's called a friend score, which is a numerical value for the quality, the quality value that's then turned into one of these printable ASCII characters. And you can go online and look up fast queue on Wikipedia, and it will give you a table mapping from one of these character codes back to what the friend score is for the base. So these friend scores are estimates of the probability of bases incorrect. And this is just a table of the different of a sampling of friend scores and mapping to their probability that the base is incorrect. So a base quality score of three, which is very low probability, has about a 50% chance that the observed base is incorrect, base quality score of 10 has a 10% chance, the observed base is incorrect, quality score of 40 is one in 10,000 chance that the base is incorrect. Now, typical Illumina data will the average quality score will be between 30 and 40. So point one to 0.01% chance the bases are incorrect. But the base quality scores are going to drop towards a three prime end of the read because of the chance that these signals because go out of phase. And okay, so that's description of Illumina. Do any questions about Illumina at this point? Did I answer your question about the errors? It just won't follow it. It gives you the probability error. Could it tell you what other base person could be? It could be a T with one in 1000, but it could be also maybe a C, but definitely not an A. Yeah, it's a good question. When Illumina sequencing first became available, the image files like these these images of the flow cell were kept. And you could go back and look at the images or more get software that would go back over the images and make alternative base calls at every possession. Unfortunately, these image files are huge. So the field decided collectively that we're going to throw out the image data as it's being generated, just so you don't have this multiple terabyte files coming out of your sequencing run. So essentially because you lose that information, because the Illumina base call is not designed to make alternative base calls, it will just tell you I think there's a T here and here's a probability that it's incorrect. You don't have that alternative information, unfortunately. One extension of that though is that the error modes of Illumina sequencing are quite well understood. We know that if there's a T error with certain probability that it might have been a C instead because of how the floor for spectrum overlap, but typically that lower level information is only used in things like very sophisticated variant callers and it's not used by people who are just looking at their own primary data. So the rest of this lecture is going to be about mapping reads to a reference genome and then in later modules of this course, other instructors will take these ideas of mapping reads to a reference genome and describe to you how that's used to find genetic variation like SNPs, somatic mutations, copy number variation, and so on. Now, the human genome is incredibly large. We can think of it as a 3 billion nucleotide string and Illumina reads are very short. They're only about 100 base pairs in length. So to discover variation, we want to compare reads to their corresponding position on the reference genome, the position where that read may have come from, and then look for differences in the alignment. So if you have our reference here, we have our read here, we see that there's a mismatch here where the reference has a C and the read has a T. This is a candidate SNP position. It also might be a sequencing error. By taking many, many reads overlapping the same position, we'd be able to differentiate between the two. But the part of this process I'm going to talk about, and others are going to cover how this exact variant calling process works, is just finding where this read matches on the reference genome. It could have come from any one of 3 billion different places in the reference genome and we need to find the exact one or the set of possible mappings of that read onto the reference. So to do this we need to figure out where the read best matches the reference and this is what we call the mapping problem. Now I'm going to go over some of the challenges of mapping reads to the reference genome. The primary one is that the genome is incredibly large and repetitive. The human genome is roughly 50% repeats where the sequence is present in multiple copies spread throughout the genome. The main repeats are caused by retro transposons, which have copied themselves along the genome over millions of years. And because the genome is so repetitive, naive algorithms that just look for simple matches between a read and the reference genome are going to take too much time. So the most naive mapping algorithm you could think of is to take our read, compare it against every position of the reference genome, sliding left to right, as if you're just scanning down the pages of a book for a certain sentence. And if it matches there, we omit that mapping location, otherwise we just move along. For the amount of sequencing data that we get from an Illumina sequencer, this sort of algorithm wouldn't work. So we need to work with much more clever techniques which build what we call indexing data structures of the reference genome to very quickly screen out unsuccessful matches. And the key aspect of developing these indexing data structures is that we need to efficiently avoid aligning a read to all different copies of these repetitive regions. And there's been hundreds, if not thousands of papers written about this topic of how we can efficiently align very short reads to a repetitive reference genome. And this is the type of problem that my computational lab works on. So second thing a read mapper needs to take in mind is that reads contain sequencing errors and snips and indels. So something we might want to do is just look for exact matches between our reads and the reference genome, like a program like grep might do. But because we have these sequencing errors that accumulate in reads and because the reads differ from the reference genome at places where there's snips and indels, the mapping program needs to be tolerant of these differences between the reads and the reference genome. So I gave you an example of where there was a single base mismatch shown again here where there's this possible C to T substitution. And here's a slightly more difficult case where there's a deletion in the read of the C base and insertion of these TC dinucleotide here. Now, when you go to map your own data, you should understand that tolerating substitutions, these type of differences where there's just a difference in the identity of the base, is much easier for the mapping program than tolerating these insertion and deletion mutations. And the effect of that is that when you go to do things like snip or indel calling or smack mutation detection, the accuracy of finding snips or substitutions is much better than the accuracy of finding insertions and deletions. Just because it's so much harder to map reads with insertions and deletions than it is to map reads with just substitution. So in earlier versions of this course, I've gone over different mapping algorithms and talked about the various properties of a good mapper. Now, this was a daunting task because as I said, this was an incredibly active area of computational biology research and literally hundreds of mappers have been published over the last 10 years with different speed, sensitivity, and accuracy trade-offs. Luckily for us, the field has now settled on using one mapping algorithm. So we're only going to learn about that one today. And that mapper is called BWA-MEM. This is a program written by Hang Lee when he was at the Sanger Institute. He's now the Broad Institute. The reason it's become the standard way of mapping a luminary to reference genome is it's very well supported. Hang is actively updating BWA to fix bugs. There's a huge community of users so that if you need to ask a question, a lot of people have experience working with this mapper. Also, most of the downstream tools that we're going to be working with, like the SNP callers and the semantic mutation callers, expect alignments to come out of this program. So they're optimized to take alignments from BWA-MEM. Now, as I mentioned, the main problem when mapping reads to reference genome is we need to efficiently handle all the repeats that occur in the reference. And BWA-MEM uses a data structure called the FM index, which is also sometimes called the boroughs wheeler transform, to avoid mapping reads to all copies of the repeat individually. So it organizes the reference genome into this data structure, which is closely related to what we call a suffix tree, which allows you to take a read and align to all different copies of a repeat simultaneously. And this is what gives BWA-MEM its great speed. Going into how this mapping algorithm works and how the FM index works is a little bit too much for this course, but if anybody's really interested in the details of how read mappers operate under the hood, I'd be happy to go over that during the lab session. So we talked a little bit about base quality scores, which is the base callers estimate that a base call is incorrect. Now, there's a corresponding concept in mapping, which is called mapping qualities. And what mapping qualities try to do is that it tries to assess how reliable an individual mapping is. Now, coming back to this idea of how repetitive the human genome is, even though we have 100 base pair reads, which can cover 90% of the genome uniquely, there's still a lot of sequence in the human genome that can't be covered uniquely by a 100 base pair read. So when BWA-MEM takes it to the reads, maps them to the reference genome, it might have multiple candidate mapping locations that have very similar alignments or very similar scores. And it's going to need to decide what the true alignment is between these different possible alignment. So here's an example here. We have a very short read which is just shown for illustration. It has two candidate mapping locations, one to chromosome 10, where there's a single base pair substituted, the reference has a T, the read has an A, and one to chromosome 2, where there's two base pairs substituted, the reference has GT, and the read has TC. Now, if BWA-MEM found these two possible mappings, it needs to decide what the true mapping is and give you some level of confidence in that. So it's going to calculate this mapping quality score, which just like a base quality score, is a fred-scaled probability that the mapping the BWA-MEM chooses is incorrect. And the way that it calculates this is that it's going to use the base quality scores to decide whether these mismatches it observes are sequencing errors or true SNPs and then weigh the two mappings against each other. So it's a mapping with one mismatch at this position, more likely to be true than a mapping with two mismatches at this position. So BWA-MEM takes all the possible mappings, looks at the quality scores around these differences, and then it assigns this probability score to each one. And it outputs that into the SAM and BAN file as this mapping quality. So this one, it would assign a mapping quality of 10. And just like we saw in that base quality score, that means there's about a 10% chance that that mapping is incorrect. Now complicating this is that sometimes when a read matches a very repetitive reading of the genome, we might have two mappings that are exactly equivalent. There's no information to resolve them. So here we have another read, and it maps to chromosome 16 and it maps to chromosome 3, and they both have a mismatch in the same position. So in the absence of no other information, we can't say anything about what one of these two mappings might be correct. We say that these reads are perfectly ambiguous. You can't pick chromosome 16 or chromosome 3 as the correct position for where this read might have come from. So in this situation, and this is sort of a BWA gotcha, something that's not intuitive that people get confused about when they first start working with mapping data, BWA-MEM is just going to pick one of these two positions randomly. So it's just going to select one at random and output that as the mapping for the read. It doesn't output the alternative mapping, it doesn't say that this read could have also come from chromosome 3 if it picked chromosome 16, but what it's going to do is set its mapping quality to zero, which flags that the mapping is perfectly ambiguous and you can't trust it at all. Now downstream programs like the mutation collars and particularly copy number variant collars that you hear about this afternoon are designed to tolerate these mapping quality zero reads. When you're looking at your sequencing data, if you're looking for somatic mutations and you see a bunch of reads that show somatic mutation with a mapping quality zero, you should automatically think, okay, I'm not going to trust these reads as much as reads that have a higher mapping quality. Yeah, that's a very good question. Right now it wouldn't. So the question was if it's known that this is a SNP on chromosome 16, that there's an A SNP at this location, but there isn't a known SNP on chromosome 3, would BWA-MEM be able to resolve that by saying this position is more likely to be the alternative allele that we know is segregated in the population? BWA-MEM doesn't do that. It only has a view of this fixed single reference genome. There's a lot of people though that are working on new strategies for mapping reads who reference genome, which is what we call population reference graphs, where rather than representing the reference genome as a single string, it represents reference genome as a graph data structure where branches in the graph denote where there's variation within the reference. So a population reference would in principle be able to resolve that by leveraging this extra information that we know there's polymorphism at that position. But right now, those methods are very much still in development. My group works on this, a graduate student who works on this, but they're not yet ready for people to use at a very large scale. It's mainly for copy number variation prediction. So the main way that we detect copy number variation is that we look at read depth along the genome. And if you have a duplicated region of the genome, you expect twice as many reads in this region. And by randomly selecting one of the reads from the possible mapping, that kind of evens out the coverage across all copies of the repeat that it could have come to. Yeah, so if you had a copy number variant of just the repetitive region, we're not going to be able to detect that because exactly it would smooth it out. If it's a copy number variant of say a megabase of the genome, which is a mixture of unique and repetitive sequence, then you wouldn't lose all your depth in the repetitive regions, but you'd smooth the copies across them. So you'd still be able to pick that up. This was one of the, like, when mapping algorithms were first developed, like this was a big point of discussion of how do we deal with these very repetitive regions and these repetitive mappings. Because in principle, the thing you might want to do is catalog all of the possible locations that this read map to. Unfortunately, for very high copy repeats, like simple sequences like CG dinucleotides or homopolymers of length of thousand, the number of possible mapping locations of a read that consists of only A's is going to be in a million. And it's just impractical to output all possible mappings. So this idea of just randomly selecting them to sort of smooth the coverage across all copies became, you know, the best compromise. Yeah. Is the mapping quality ever, because here are the same examples, just the reference? It doesn't. So each read is treated by the mapper independently. So it wouldn't look at other reads to try to resolve this with a caveat that I'm just about to explain. Okay. So when we first had Illumina sequencing or when it was still known as Selexa sequencing, the read length was about 27 bases, then it increased to 36 bases. Now it's about 100 bases. But for a long time, we were working with 36 base pair reads. And the number of ambiguous mappings for 36 base pair read to very large reference like human was very, very high. We didn't have a lot of unique coverage across the whole genome. I think something like 80% of the genome could be covered uniquely by 36 base pair reads. So Illumina developed a chemistry trick to increase the number of unambiguous mappings we can get. And this is done using what we call paired-end reads where we take a DNA fragment of length of about 400 and we take an Illumina read from one end and Illumina read from the other end. And then we don't know the sequence in the map. But what the mapping program can do is it knows that these two reads are from the same source fragment of DNA. And it knows that the two ends should be around 400 bases apart. So it can constrain the alignments using this information to help resolve some of these ambiguous mappings. So here's how BWMM would map paired-end read. Let's go back to this example where we had this ambiguous mapping to chromosome 16 and chromosome 3. There's no information to figure out which is a true mapping. Now if we sequence reads in pairs and the sequence of this reads pair is here, we can map that pair to a nearby location roughly the size of our DNA fragment length and then see whether there's a good mapping for it downstream of the first half of the pair. So in this case there's a mapping to chromosome 16 where there's a single mismatch but when we look at chromosome 3 there's many mismatches. So by sequencing this distant piece of DNA or this distant end of the DNA fragment we're able to resolve this and then the mapping to chromosome 16 becomes unambiguous because it's constrained by this good mapping for the pair. So this is clear to everybody. So when you work with sequencing data now, all aluminum data comes in paired-end reads. It's very very rare unless you're doing things like RNA-Seq or some other assays to sequence single-ended reads. You'll almost always have paired-end reads. We've sequenced 100 base pairs from one end of the DNA fragment, 100 base pairs from the other, and then the mapping algorithm is going to try to place them together to resolve some of these ambiguous mappings. Okay in the last few minutes of this lecture before we get into the lab, I'm going to go over the main file format that we use to store mappings and alignments. So early in high throughput sequencing bioinformatics, every mapping program output its own file format with it slightly different and slightly inconsistent behavior. As I said there's hundreds of different mapping programs developed and this was a bit of a nightmare for the field because you'd have to write converters to take the output from mapper A to put it into snipcaller B and vice versa. So everybody got together, realized that this was silly, and they decided that they'd make a standardized file format that everybody could write their sequence alignments into and then all the downstream tools that we work with like the snipcaller and the symmetric mutation collars would take this standard file format as input. So the file format that was developed was called the sequence alignment map format or SAM for short. It comes in two variants. SAM is a human readable text format and there's a record from the SAM format shown down here. SAM is useful for if you want to examine the alignment yourself as a quality controller just pull out some information like where the reads map but primarily the read the alignments are stored in what we call the BAM format which is a binary file format that is much more efficient to store on disk and read into these downstream snip collars, mutation collars. And later in the tutorial we'll look at some command line tools you can use to convert between SAM and BAM. So I'm going to walk you through each one of the different fields in the SAM file format now. So the first field here SRR 013667.1 is the read ID. So this comes directly from that first line in the FASTQ file which had that at symbol which is just the unique identifier for the read. The second field is what we call the flag. This is a numerically encoded bit array which gives information about whether the read came from the reference strand or the reverse complement of the reference strand if the read came with a pair and whether the pair was mapped successfully. This isn't a human readable field you can't go from a flag of 99 and say okay this mapped the reverse complement strand you need to use a converter to do this. And in the last slide of this I link to a page on the Broad Institute where you can put in this numerically encoded flag and it will tell you which bits are set and what those bits mean. The next two fields are the chromosome and start coordinate of the mapping that was selected. So this read was mapped to chromosome 19 at this position. The fifth field is this mapping quality. So mapping quality is always higher is better. A mapping quality is 60 is the upper threshold. This means that PDOA MEM had incredibly high conference that this is a unique correct alignment where the probability this is incorrect is something like 1 over 10 to 6. So there's a lot of confidence that BWA had that this is a correct mapping. The next field is what we call the cigar string. So the mapping is gives you the coordinates of where the mapping starts and the cigar string tells you how the bases of the read line up against the bases of the reference genome. And this is a slightly obscure format it's designed to be a compact representation of where the gaps in the alignment are. So the cigar string for a couple example alignments is down here. So here's our reference sequence. Here's our read sequence. There's a gap in the read which is shown by this bar here. And the cigar string is 4M says there's four matches between the read and the reference here here here and here. Then there's one deletion from the reference. So this T needs to be deleted. And then there's six matches between the reference and the read sequence here. So using the cigar string and the example one up here is 76 matches. The SNP caller can take that cigar string and figure out which base on the read matches which base on the reference genome. Here's another example of a cigar string where there's an insertion into the read. So again we have 4M1I showing that insertion 4M for these matches. Now I purposely drew a substitution difference here where there's this AT. The cigar string only tells you the bases between the reference and the read that line up each against each other. It doesn't tell you whether the same nucleotide or not. So you can't just take this and say there's a single mismatch at the second position. You have to actually parse the first four bases that reference, first four bases in the read to say that there's a snippet or a substitution there. And all tools will have code to do that comparison. The next fields all deal with this paired end information. So we have the chromosome of the other half of the read pair. It's an equal sign if it mapped to the same chromosome. So in this case they both mapped to chromosome 19, then the position of the read pair, and then the insert or the fragment size which is the inferred length of that piece of DNA that was sequenced based on the alignment to the reference genome. And the remaining two fields are just the read sequence as it was seen in the FASQ file and the base quality scores as well. So it's the full length size. I should change this. The proper terminology and the more clear terminology that the SAM file format uses is the template length which is the size of the complete DNA molecule end to end. It's also commonly referred to insert size but I realize that there's some confusion around that term of whether it refers to the unsequenced distance or the entire length of the fragment, but this field refers to the entire length of the DNA fragment. Okay, before we move into the lab just a few more resources and then a few slides. So usually you want to tell BWMM that both reads came from the same pair. So there's two ways you can do this and we'll see one of the ways when we map reads in the tutorial. You can either have the read pairs in the same FASQ file where the first read of the pair is in the one record and then in the next record is the other reading the pair. And so usually your sequencing data will come off the Illuminae instrument in a particular format and it will tell you that all of the reads from the first half of the pair are in this file, all the reads from the second half of the pair in this file and then you give it to be, you give those two files to BWMM and it will figure out where the pairs are and do it all automatically. Now the reason this is important is that if you map say the first half of the pair with BWMM and then a second round of BWMMs around the second half of the pair, BWMM wouldn't be able to use the information between them. So it needs to be told which reads go together during the mapping step. Yeah, it's okay, it's just slightly different. BWMM will write out an intermediate file format and it writes one of those intermediate file formats for both the first pair and the second pair and then a subsequent step takes in the pair of intermediate file formats and outputs a SAM file. But BWMM internalizes all of that. I encourage you to use BWMM if you can. Just everything's moved to BWMM and it does a lot better with 100 base pair reads and aligning around indels. BWMM internalizes all that. You just give it the two halves of the pair and two FASTQ files or one interleaved FASTQ file and then it handles it automatically. But yeah, you're right. BWMM needs to handle it a little bit differently. I can talk to you a little bit more about that if you're interested in details. Any more questions before I move on? Okay, so here's some links and resources. So SAMtools is the primary toolkit for working with SAM and BAM files. There's a program called SAMtools view which will convert from SAM to BAM and vice versa. You can also sort a BAM file by its alignment coordinate or the read name. And you can also use it to extract reads for given alignment, given reference region or genomic location if you're only interested in doing, say, variant calling at a particular gene. If there's anything that you think is ambiguous or you don't understand about a SAM file, you can go directly to the SAM and BAM specification which is linked here, which is a technical document describing what each one of those fields means. If you run into trouble, there are a lot of different resources for asking for help. There's a SAMtool help mailing list, biostars and seek answers. Both have large communities of experienced users who can help out. And finally, there's a link here to decoding that flag field where you can put in the numerical value of a flag and it will tell you which bits are set and what all of they mean at that link. I encourage you, when you're looking at your files later on, to go to that web page and have a look at what these flags mean. Okay, so the last few slides here are just going to be an example of some alignment issues that you might see when working with real data that you should be aware of. So have you had an IGV tutorial already? Okay, great. So you're all now familiar with this screen. So this is IGV for a tumor normal pair. So the tumor is on the bottom, normal is up on the top. Each one of these gray bars is read aligned to a reference genome. And the color of the gray bar indicates its mapping quality. So this moderate shade of gray here indicates that these reads mapped with very high mapping qualities. If the reads map with low mapping qualities, they'd be a lighter shade of gray. If they mapped with a mapping quality of zero, meaning they're completely ambiguous, they would be white bars here. So you could automatically look at them and then down weight the evidence that you think there's a mutation there. So IGV draws substitution mismatches as a colored nucleotide. And we see here in this tumor sample, there's very good evidence for a substitution here. Then there isn't corresponding evidence for that substitution in the individual's normal. So a good somatic mutation caller would take this, look at the high level of evidence for the substitution, look at all these mapping qualities, which have quite high mapping qualities, and call a somatic mutation at this position, which is a C2G substitution. So you might run into regions of genome that look like this. This is an incredibly awful picture of reads aligned to reference genome. When I look at a screen of reads aligned to reference genome, what you want is roughly uniform coverage of reads across the genome. Here we have these huge spikes in coverage where there's a lot of reads and there's these jagged coverage here. And we can see there's probably something odd going on in this reference genome just based on the huge variation in coverage here. Now the reason is this is if you look up here at this diagram of chromosome, this red bar indicates where this window is. And we're right at the beginning of the centromere of chromosome 2. And the centromeres are incredibly repetitive and they're very difficult to map reads to. So B2B mem has struggled to accurately place reads at this region of the reference genome. So I would automatically look at this and think that none of the variants that are seen here are probably true. They're just artifacts from the mapping process. So finally you might get reads that look like this. Sorry. Here B2B mem has only been able to align the start of some of these reads, and then it's performed what's called a soft clip on the reads where the end of the read hasn't aligned to the reference genome. And in this view of IGV, the soft clip bases are shown up as these multicolored tails of the reads here. Now what's happened here is there's a very large insertion, sort of deletion mutation in this tumor sample that isn't present in the normal. But because the deletion is so large, B2B mem wasn't able to map reads that span across that deletion. So it's only lined reads up to the point at which the deletion starts, and then it discarded the rest of the reads by emitting what is called a soft clip. And these soft clip characters are denoted in these cigar strings with an S code, and some programs that look for somatic mutation, indel mutations, will take those soft clip reads, try to perform a little local assembly around here to gain evidence that that's a true mutation. Okay, I'm going to stop talking now, and we're going to go into the read mapping exercise. Okay, welcome back everybody. So this lecture is a new addition to the cancer bioinformatics course. There's a growing need for doing genome assemblies to characterize both novel species that don't have a good reference genome. I was chatting with someone just in the break about this, but also to look at the more highly rearranged parts of cancer genomes. So we decided to add this new lecture, which gives a high level overview of how genome assembly works in the case where you don't have a reference, where you don't have a very good reference. So just to motivate what genome assembly is, here's how I view this genome assembly process. We've got our genome, which is made up of a mixture of unique and repetitive sequence. Again, visualizing a repetitive sequence using these red bars. We fragment many copies of our genome. We randomly sample DNA fragments, and we determine their sequence with DNA sequencing instruments. And then the genome assembly process is to take those randomly sampled fragments and reconstruct the genome using just the sequence of those fragments that we found. So in the mapping lecture, I gave the analogy of mapping as trying to find a sentence in a book by just flipping through the pages of the book and trying to match up that sentence. With genome assembly, we're going to take our book, we're going to take many copies of it, put it into a paper shredder, shred it into a lot of pieces, dump it on the floor, and then try to just look at those fragments of the book and reconstruct what each page said. This is a much, much harder problem than just scanning through and trying to find matches. But in the case where we don't have a reference genome, this is the best we can do. Now the assembly problem is very close to my heart. This is what I've spent most of my career looking at. I wrote an assembler called Abyss when I was at the BC Genome Science Center out in Vancouver, and then continued working on genome assembly algorithms during my PhD in the UK. I'm going to talk in about three different parts here. I'm going to talk about how assemblers work, sort of a high level theory of assembly. I'm then going to talk about assembly algorithms for short reads, like Illumina reads that we've already discussed, and long emerging sequencing platforms like Pac-Bio and the Oxford Nanopore Sequencers, which can sequence much longer fragments of DNA off in an excess of 10,000 bases. And then finally, I'm going to give an overview of what genomic features make short read assembly difficult and how we can measure them and assess them prior to actually setting off on a genome assembly. So we talked about assembly using whole genome shotgun sequencing data, and whole genome shotgun sequencing, or WGS, takes our input genome, many copies of the input genome, it fragments it into many pieces randomly. This is the shotgun stat that refers to as if you just blasted the genome into billions of pieces, and then figure out what the sequence of each one of those pieces is using a DNA sequencing instrument. So at the end, we've got all of our fragments, which are sequence reads that we're going to try to use to reconstruct what that original genome sequence was. Now, if we knew the ordering of these fragments along the genome, if we knew, for example, that this read came from the leftmost position of the genome, and then this read came from the fifth position of the genome, and all the way along, if we know this total ordering of the sequences, genome assembly is actually very easy. We could just look at the base at each position and put that into our output sequence to reconstruct the genome. Unfortunately, because the shotgun nature of genome sequencing, because we've done this random fragmentation process, we don't know where these reads came from. We don't have this total ordering of the DNA fragments across the genome that allows us to reconstruct the sequence easily. We need to compute this and infer what the ordering of the reads are computationally. This is where labs like mine come in, where we work on algorithms to efficiently do this. Now, key term involved with genome sequencing, indeed, with reference mapping as well, is the coverage of your genome. So the coverage of the genome is a measure of how redundantly you've sampled the genome. Usually, it's short for the average coverage, which is the average number of reads covering a particular position in a genome. So we calculate that coverage by summing up the total number of bases that we've sequenced here to 177 in these blue reads here, and we divide that by the length of our genome. Here it's 35 nucleotides, 177 divided by 35 is about 7. So we say we have 7x coverage of this genome. We could also refer to coverage as the number of reads crossing a particular base of our genome. Here the coverage of this T is about 6 because we have 6 reads that cover this position. Now, when doing any sort of sequencing analysis, the coverage of your genome is incredibly important. If you want to find somatic mutations, you need a lot of sequencing coverage to discover whether somatic mutation is a true mutation versus sequencing error noise. If you have a position of the reference genome that's covered by a single read and there's a mismatch in that read, even with a high quality score, you're not going to trust that that's a true somatic mutation. You need multiple reads covering each position of the genome to give redundant or multiple evidence that there's a mutation there. Similarly, when we're trying to do a genome assembly, we want high coverage because we need the reads to overlap to be able to determine whether they come from the same position of the genome by sharing very similar sequences. So the basic principle of genome assembly is that the more similar two reads are, the more likely they are to have come from overlapping stretches of the genome. So we sequence a genome with 100 base pair reads to 50x coverage. We expect the reads to overlap by about 90 bases or more. And if we find 90 base pair overlap between reads, an overlap is where the end of one read matches the beginning of another, we can infer that they might have come from the original, the same original or overlapping pieces of a reference genome and then we can stitch them back together to reconstruct that little bit of the reference. Now genome assembly software formalizes this notion of using sequence similarity or overlaps between reads to reconstruct our reference genome. So the genome assembly field developed when Sanger sequencing was invented by Fred Sanger in the 1970s. The first genomes were essentially reconstructed manually by comparing Sanger sequencing gels to each other. It became pretty obvious though that automated or software-based sequence assemblies would be needed for assembling larger genomes like bacterial genomes. And the first methods for automating assembly were developed for fairly long reads, Sanger reads around a thousand bases that had a moderate error rate. The field then developed into devising all new methods for short read sequencing, like aluminum sequencing, which had 100 base pair reads, very high accuracy, very high throughput. And we worked on short read assembly for about five or six years, but with the invention of the PAC file and the auction nanopore sequencers that can sequence greater than 10,000 base pair reads with higher error rate, the methods that we developed for Sanger sequencing or the field developed for Sanger sequencing are now back to being used for these long read sequencing platforms. So as I describe the two assembly algorithms, I'm going to first describe the assembly algorithms were developed for Sanger sequencing data and that are now applied to long reads like PAC-Bio and auction nanopore. I'm then going to describe the variants of those algorithms were used for short read sequencing. And I'm describing this in two completely different steps because the properties of these data sets are very different. So we need very different assembly methods to handle. So long read assembly pipeline looks like this. We take our sequencing reads from our PAC-Bio or auction nanopore instrument as input. We then build what we call an overlap graph. It looks for similar pairs of reads and links them in a graph, which I'll describe in the next slide. We then take our graph, process it to compute a layout of the reads, which is an ordering of the reads from left to right. This describes the contigs of the assembly. We then do the final step, which is called the consensus step, where we pick the most likely nucleotide sequence for each contig and then output our contigs at the end. So the definition of a contig, it's short for contiguous sequence. It's just a region of the genome that has been assembled back together. It's unlikely to be the full length of a chromosome, but it might be a few megabases of length if we've stitched together many reads into that contig. So the first step of this assembler is finding all overlaps between reads. So just like I said, when the fundamental principle of assembly is we're looking for sequences that share a lot of sequence similarity, where the suffix at the end of one read matches the prefix or the beginning of another read, what an overlap layout consensus assembler is going to do, it's going to take all of your long reads, take all possible pairs of long reads, align the ends of them just like BWA is aligning a read to a reference genome, and it's going to check whether there's a significant match at the ends of those reads. If there is a match, it will say great and will create an edge from one read to the other read in this overlap graph. If there's not a significant match, which is based on a threshold on the overlap length and the number of mismatches that it'll tolerate, there's no edge between these nodes. So our overlap graph has a vertex in the graph for every one of our sequencing reads and edges between any reads that have this sort of relationship where there's a similar suffix and prefix between the two reads. So here's an example overlap graph with an overlap length of three for a handful of seven base pair strings from this original genome that is shown down here. Now once the assemblers construct this overlap graph, its job is to take this graph and find a path or a walk through the graph which visits each one of these vertices or each one of these nodes that reconstructs this original string. We don't know the sequence of this string. We need to figure out from the structure of this graph. Now this graph, even for a very small example, is a little complicated. It's not obvious here whether there's a walk through this graph that reconstructs this original string. So the assembler has to do is it has to pre-process this graph to clean it up a little bit so that it can compute this layout of the reads into the reconstructed genome. So that's the next step of the OLC assembler where it bundles stretches of the overlap graph into context. Now to illustrate this, we've shown an overlap graph here for a fragment of a song that says to everything turn turn turn there is a season. Now the reason we've selected this fragment from a song is that there's three copies of this word turn in here and this is representing a genomic repeat where we need to figure out what the copy number of this repeat is. So if we take this string we break it into seven base pair or seven character sub-sequences. We construct an overlap graph of those seven base pair sub-sequences by just looking at overlaps between the different fragments. We get a graph that looks like this. This graph is quite messy. It's very difficult to look at. And there's one obvious reconstruction from just looking at this graph. Now what the assembler is going to do is it's going to try to infer whether all of the edges in this graph are actually necessary or not. So if we look at this sub-graph consisting of these three nodes, there are two paths here. We could go from here to here to here or we could go directly from here to here via this green edge here. Now the sequences that are spelled by these two paths are the exact same. So they both spell the word to every with an underscore at the end. This is the same if you go from here to here or from here to here to here. They spell the exact same sequence, which means one of those paths is redundant. It's not necessary to represent it in the overlap graph. And in terms of theoretical computer science, we say that this green edge is a transitive edge that can be inferred from the other edges. Can you see it up here? So from here to here it's the exact same sequence as going from here down here to here. So the green edge is a transitive edge and the blue edges in this example are what we call irreducible edges. They're not redundant. So what the assembler is going to do is it's going to search the graph for transitive edges and remove them to clean up the graph and simplify it a little bit. So in the first step, we're going to remove transitive edges with this pattern where there's an edge that goes from this node to this node to this node and there's an edge that bypasses this middle node. We're going to get rid of these bypassing edges here. So let's do that. And we get a graph that's much more clean in structure than what we saw up here. So we've removed a lot of these transitive edges cleaning up the graph. Now in a second step, we're going to remove another set of edges which bypass two nodes. So instead of bypassing one node, these edges go over these two rather than this chain down the bottom from here to here to here to here. And again, we get a much simpler graph now where a lot of the vertices only have one edge coming in and one edge going up. And this is exactly what an assembler wants to see because it's unambiguous what the reconstruction of this part of the genome is. There's no ambiguity. We can just merge all of these chains of nodes together to get our assembled genome. So that's exactly what the assembler is going to do in the layout stage after it performs this transfer reduction. It's going to find these unbranching, these nonbranching stretches of the genome and output them as our content in the assembly. So we're going to have one content here that runs up until this branch point. This has a sequence to everything turn that we're going to output a content here. It goes out of this branch point which goes turn there as a season. And then the bit in the middle with this complicated branching and looping structure is that unresolvable repeat that we can't assemble. So it's not it's not obvious from the graph how many times we should put that word turn into our output. So the assembler just gives up at this point and outputs two contigs and one unresolvable repeat. So this is all the assemblers trying to do is trying to infer these relationships between reads, find out the stretches of the genome that can be unambiguously assembled, assemble those together and give up on the repetitive regions of the genome. Now the final step after it's computed those contigs is it wants to calculate what we call the consensus sequence which is inferring the true sequence of the genome for each one of those contigs. Now the reason this step is necessary is that long read sequencers like Pac-Bio and Oxford Nanopore have a very high error rate and we need to overcome that error rate when we're constructing our contigs. So what we do is we take all of the reads that we use to assemble the counting and we construct what we call a multiple alignment of those reads where we just line up every base from every read into a column just like we were aligning our reads to our reference genome and we had this column of all G's that aligned to the reference and then we're going to process this multiple alignment from left to right and take the most frequent base at every one of these columns. So here we've got five T's in our reads so the sequence in our output genome is going to be T. There's all the reads agree that there is a T here. Likewise in the second column we have five A's all the reads agree we have an A so we're going to put an A into the output. Now for the third column one read thinks that there is a deletion there it's got a gap but four reads say that there's an A. So the rule that we're going to use to take to make a consensus sequence is we're going to take the most frequent base in this column and output it into our output. So there's four A's one gap so we say the consensus is an A and that goes into the output. Okay so let's carry on all these columns are unambiguous here if we go to this column four reads said there is a C one read said there is a T we're going to put a C into our output here here four reads said there's a C one read said there is a gap and so on until we've processed every column of this multiple alignment picked out the most frequent base. Now this is a bit of a simplification because the error rate of Pac-Bio and Oxford Nanopore sequencing data is so high what the consensus algorithms typically do is they'll take the raw signal that the DNA sequencing was measured and it will put that raw signal into a probabilistic model that weights each one of these votes that each read has by the actual base quality score or the inferred probability of that base from the sequencer which gives it a bit better accuracy in reconstructing this consensus sequence. I'm going to pause at this point and ask whether there's any questions about assembly up to here for long reads. So for Pac-Bio the reads are typically between 10,000 and 20,000 bases for Oxford Nanopore it depends on how you prepare the DNA we typically go for around 8 to 10 kb to maximize throughput or yield but there are groups who have taken the approach that they want to maximize the read length and they've been able to get 100,000 base pair reads up to 500 to 600,000 bases but there's a drop off in the amount number of reads and the total amount of data you get if you shoot for the longest read length possible. Did you have a question? So that is the really the sequencing I mean the shopping of the fragments and all that for RNA sequence but you see it's usually 75,000 bases but now they're talking about thousands right. So for the Dino assembly it's a completely different set of or a different way of doing it. Yeah so typically so for between 2010 and 2015 we typically denote due to Dino assembly using Illumina reads, whole genome shock and Illumina reads, 100 base pair reads, we wouldn't get very good assemblies out of it because of the short read length. So what the field has moved to now is using different sequencers like not Illumina sequencing the Pac-Bio and the Nanopore instruments which can sequence these 10,000 base pair reads which give you much better quality genome assemblies and in the practical session after this lecture you'll get a chance to do a short read assembly and a long read assembly for the same genome and then compare the two to see how the qualities of the two assemblies differ. Okay thank you so the other question I had was when you say a content that's after you align the reads and then finally get this quite a big copies for the given section right. So usually so we usually look at it like generating content from this step so we take our reads we've now got an ordering of the reads from left to right we merge our sequences together to get strings that look like this these are contigs and then this step takes our contig aligns all the reads back to the contig and then takes the most frequent base at every position to output the consensus sequence for the contig. So the original contig probably has a few errors it might only be 99% accurate after the consensus step it might be 99.99% accurate. Okay any more questions before I move on okay so we're not going to talk about short read assembly this is still a very popular thing to do because getting a lumina reads is so cheap you can sequence even large genomes for maybe a few thousand dollars and when we first got a lumina data we we even tried to assemble these 36 base pair reads that I said gave us so much trouble even mapping them to reference genomes and unfortunately the assembly methods that were developed for Sanger sequencing data which I just described these overlap layout consensus methods they completely failed at assembling these short reads the problem was if you try to find an overlap between 36 base pair reads the overlap between two 36 base pair reads is going to be very very short about maybe 30 bases and because the genome is so repetitive you would have this huge overlap graph with hundreds of billions of edges in the graph this was completely impractical to build for human size data sets so if you tried to use an assembler developed for for Sanger sequencing data on a lumina data you would have run out of time you would have run out of memory on the insert on your your computer and it would have completely failed so what the field did is they took an entirely new approach to assembly which is based on uh using short exact matching words within your reads which we call k so this is the different stages of a short read assembly pro pipeline there's usually an optional step which will try to error correct sequencing reads we'll then build what we call the ground graph we'll clean up the graph just like we saw for the layout step of olc assembly we'll then find context from the graph scaffold our content together and then optionally fill in gaps at the end i'm going to talk about each one of these steps in turn now so here's a plot of the characteristics of aluminum sequencing data for six different uh genomes that we sequenced to assemble de novo i'm going to tell you why i use these six later on um and here's the error rate of each one of these six different data sets as a function of the base position within the read and for each one of these uh data sets we can see the error rate going up by orders of magnitude as you get towards the three prime end of the read the reason for this is this phasing and defacing problem that i talked about uh in the first lecture where more of these molecules in a cluster get out of sync towards the end of the read that makes it more difficult for the base caller to resolve what the true base was uh so it makes more mistakes towards the three prime end of the read up to a few percent uh as shown here now the problem with this is when we use uh kamer based assembly graphs these sequencing errors bloat the size of our assembly graph and they obscure what the true relationships between reads are so what a lot of assemblers do is they try to take our the reads and perform error correction where they want to detect which base in the read is incorrect and then figure out what the true base was at those positions where the read was incorrect now a naive strategy for finding sequencing errors and correcting them is to take all of our reads find overlapping pairs of reads construct a multiple alignment and then take the most frequent base just like we saw for that consensus step fortunately this isn't going to work for Illumina data because as i just said finding overlaps between really short reads is is computationally too expensive so what we do is we take short fragments which we call kamer's a kamer is just a sub uh word of a read of a particularly length k we're going to count how many times these kamer's across across uh are seen across our entire sequencing data set and we're going to look for kamer's that are rare which denote where sequencing errors are and compare their counts to kamer's that are seen uh much more frequently so let's say we take a see a kamer of size 20 so k of 20 we're going to slide a 20 base pair window across our read starting from the first 20 bases and then starting from the second 20 bases and so on until we process the entire read and we're going to count how many times those kamer's are seen now we always sequence genomes to quite high coverage we want 30x coverage 40x coverage so the intuition here is that kamer's that are perfect that don't contain a sequencing error are going to be seen about 30 to 40 times depending on how much coverage we obtain for that genome but since sequencing errors are quite rare and they're going to flip genomic kamer's to some novel sequence we only should see them one or two times uh at most so as we're sliding this kamer window across our read counting the number of times that kamer occurs if we see the count drop down to one or two we're going to infer that there was an error at that position that's shown here where we have this function which is counting the number of times this kamer's seen it's perfect we saw it 40 times we count the number of times this kamer's seen it has a sequencing error it's only seen once we would infer there's a sequencing error which disrupts the kamer count we would then search for fixes of this c by trying a g and t if one of those uh changes the kamer to being a more frequent kamer we would replace that c with the corrected base and that would correct that sequencing error so this is an incredibly popular area of birefaction to work on and there's probably about a dozen different kamer based error correctors available with different uh time and memory uh properties so some are faster some use less memory so i'll put a listing of some of the most popular ones up here i think of this list bfc is probably uh the best one in terms of time and memory usage so once we've error corrected our reads we have a much cleaner data set that we can then use to build our assembly graph and the way that short read assembly works is based on this idea that a great computer scientist named Pavel Pesner developed which is called the de Brown graph so as i mentioned computing overlaps between pairs of short read is computationally infeasible so we're going to again use kamer's and break our original sequencing reads into kamer's and then link reads that are adjacent inner sorry link kamer's that are adjacent with inner read with an edge so here's an example of constructing a brook to brown graph for k equals four for this set of six base pair reads so we're going to take the first four base pairs of the first read ccgt we're going to add that former into our graph as a vertex which is shown down here we're then going to take the second former of this read which is cgt and add it into our graph as a vertex down here because the former ccgt is followed in a read by cgt we link them up with an edge as shown here now we're going to do that for all of our reads adding these formers as vertices and linking up adjacent formers as pages and then we get this to brown graph with the structure shown here now this graph has a repeated four base pair sum sequence cgtt where in one read it's followed by the sequence gtta and another read is followed by the sequence gttc so this gives us ambiguity in the graph and this is representing a repeated former that we would have to resolve now luckily in this case there's a unique path that goes from this camber down to here followed this lower branch and then back around and goes here which reconstructs the the complete genome sequence but often just like we saw in the overlap layout consensus assembly the assembly is just going to stop at these branch points and output contigs where it's completely unambiguous now this might seem like a bit of a silly idea to break our reads which are short already into even shorter sub sequences these camers but the idea here is that the the brown graph can be constructed extremely quickly and we don't need to compare all pairs of reads to each other it can be constructed in time that's linear in the size of our sequencing dataset which for computer scientists is incredibly important property to have when you work with 100 gigabase sequencing collections like trying to do a de novo assembly of a human genome from Illumina reads so the next step of the short read assembly pipeline is to clean up any sequencing artifacts from our graph so the two main types of artifacts we want to get rid of one are what we call tips which are residual sequencing errors that couldn't be corrected which give these short spurs or branches off the graph they go for a few nodes and then terminate without being connected to anything else so the camers that contain sequencing errors typically lead to these disconnected branches off of the graph so we have two nodes in a tip here four nodes in a tip here now as I said the assembler is going to look for contigs by trying to find these unbranched structures of the graph and bundle them together if it sees a branch caused by a tip it's going to stop our assembly we don't want that so in a few slides and it's described how the assembler gets rid of these branches now the second sequencing artifact that we want to get rid of are what we call bubbles this is these occur when we have a heterozygous snip in diploid genomes and their characteristic is a branch off common sequence where one half of the branch follows one allele like the c allele here the other branch follows the g allele here and then they come back together back into common sequence a short distance later now again the assembler if it's just going to look for branches in the graph and stop content growth will stop at each one of these bubbles and it will reduce the size of the contigs that we can reassemble so the assembly field has developed algorithms for getting rid of these artifacts which we refer to as graph cleaning so here might be a brown graph that we've assembled we've constructed from aluminum reeds again the contig structure in this graph isn't very obvious but we can clean it up pretty simply to improve the continuity of our assembly so the first step we're going to look for tips so we're going to mark all of the nodes in our graph that only have a connection or an edge in one direction we're going to walk backwards until those nodes rejoin the branch point and then we're just going to remove all those nodes from our graph and again just by getting rid of these tips with this simple procedure we end up with a graph that's much cleaner that looks like this the next step we're going to look for branch points in the graph and we're going to look for nodes where those branches rejoin and that's where these bubbles occur where we have this diamond structure our branches out and then comes back together now the assembler is going to remove header zygosity by picking one of the two alleles to be the output sequence and removing the other one and we'd get a structure that looks like this in the graph our graph is now a lot simpler and we can start to see the contig structure in the graph as these unbranched segments of the graph so the assembler is just going to look for these unbranched segments merge them together into these contigs and we end up with a sequence that looks like this where each one of these gray boxes is our contig now our contigs are typically going to break due to two reasons either we've hit into a repeat that we couldn't resolve or there just wasn't enough sequencing coverage to make connections in our graph so when we have illuminated data which comes as paired-end reads we want to build higher order structures next by scaffolding our contigs together so scaffolding takes our paired reads maps them to our contigs and then looks for pairs where one half of the pair maps to the end of one contig and the other half of the pair maps to the beginning of another contig so here we have a selection of blue pairs mapped to the end of this contig and a beginning of this contig here we have red pairs and mapped to the end of here the beginning of here and so on and we can build up an ordering and an orientation of the contigs using these inferred paired-end reads to say that this contig is followed by this contig in the genome which is followed by this contig in the genome this one this one we don't know the interleaving sequence we're just jumping over these unresolved portions of the genome but at least we can get say gene structure in the right order by building these contigs now in the output file our scaffolds contain gaps which are these unresolved parts of the contigs and we typically fill in the gaps with n symbols which stands for completely ambiguous nucleotides so we know that there's some sequence there that separates these two contigs we don't know what it is so we just put n's into our scaffolds some assemblers will try to then perform a localized de novo assembly with less stringent parameters to fill in these gaps other times if you have complementary sequencing data like low coverage pack bio long reads you can use the long reads to try to fill in these unresolved gaps as well so which do you expect from a genome assembly it depends on the size of the genome and the technology that you use to assemble it for bacterial genomes assembled using short reads aluminum reads typically will expect hundreds of contigs with 10 to 100,000 base pair lengths so you'll reconstruct your genome into about 100,000 base pair pieces and there might be about 50 of them for your assembly for long reads with bacterial genomes because the read length is so long and because bacterial genomes typically aren't very repetitive you can often reconstruct the genome into one single contigue that spans the entire genome or in the worst case if there are some long repeats maybe you'll assemble it into two to five contigs instead for large genomes short long reads give you much much better contiguity of the assembly for a short read assembly of say human genome you can expect on average around 10,000 base pair contigs which is maybe the size of a gene for long reads you can assemble up into mega base sized contigs where you're putting many genes into their right order and orientation in those contiguous sequences the drawback however is a long read data is much more expensive probably by a factor of about five to ten so whether you want to use long reads depends a bit on your question whether you need a complete finished assembly and of course whether you have the budget to sequence with long read data typically if if someone asks me for recommendation of whether they should use long read data packed by our nanopore if it's a large genome I think it's worth the expense just because you've got such a better quality reconstruction as you'll see in the practical coming up okay I'll take another pause here any questions about short read assembly or anything we've talked about this you can so typically what the assembly will do is it'll align all the read pairs to the long contigs and it will infer what the fragment or the insert size distribution of your paired end data is and then when it aligns the pairs to these ones that span between contigs it will solve like a simple maximum likelihood solution to infer what the best size of that gap is so if you know your pairs on average are 300 bases apart and your read aligns 100 bases from the end of one contig and 100 bases into the next one you know the gap size should be roughly 100 bases yeah yeah the human genome is sequenced with a really interesting conservative strategy where rather than doing whole genome shotgun sequencing they cloned the human genome into backs of about 150 kb made a tiling pack path of backs sequenced the backs individually and then assembled them together this allowed them to close a lot of the repetitive sequences because rather than a repeat being globally present across the entire human genome which you get for whole genome shotgun sequencing the only repeats you get are internal to backs so the gaps in the human reference genome that are continuously closed tend to be very very difficult low complexity sequence highly repetitive things that even like aluminum reads just don't map map to these um with auction nanopore reads that we're getting 100 kb 500 kb reads we're starting to look to see whether we can close some of these these reference gaps by just spanning over them with an entire auction nanopore read yeah can you explain speed when you do the k mode correction yeah where do the cameras come from is it from a separate sequencing experiment no so you you have your set of reads let's say you got a billion alumina reads you break every one of those reads up into cameras and then you put them in to say a hash table or some dictionary data structure that says this camer was seen 40 times across my entire read data set okay just that doesn't make very much sense to me okay because what it looks like is you're just taking your same data rearranging it into a less informative right it's the same issue i have with bootstrapping i don't understand how that gives you more information if you're breaking it up well it just by comparing the frequency of camers we can detect which ones contain errors so let's say let's say we have um we've sequenced our genome to 50x coverage now if all of our reads are perfect without sequencing errors when we break reads into camers we would expect every camer to be seen in roughly 50 reads okay because we have 50x coverage so 50 reads covering every position and every camer should be seen in 50 different reads now sequencing errors come in and they take a camer they flip a random based now even though the genomes are very repetitive if we use a long enough camer size like a camer of 40 that probably generates a unique camer that's never been seen before so it takes a camer that was seen 50 times and converts it to a camer that was seen just once so by looking at the camers that are only seen very few times we can detect which ones contain sequencing errors so a set of camers containing sequencing errors are the set that have only been seen once or twice across our entire data set we then look for random changes to those camers they convert them from a low frequency camer to a high frequency camer and then how how would that be different than just stacking all the the reads up and then looking at each point it's not different it's just much more computationally efficient um if you wanted to you could stack up all the reads together it just takes a long long time oh yeah yeah so this is this is a clever heuristic which gets around the need to align very short reads against each other which we want to avoid so this also means that just when you pick the k basically picking the number one would be traditionally more extensive work it's a good question um longer camers require more memory to store into a hash table but you also um you reduce sequencing coverage effective sequencing coverage when you pick a long camer um if you pick say the most informative would be taking let's say your read length is 100 bases of taking 100 base pair of camers as well but you wouldn't have a lot of redundancy because a lot not many reads overlap by 100 bases so the camer size is a trade-off between getting a lot of camers that are shared between reads and picking a k that's long enough that you can resolve repeats this is like one of the magic numbers that goes into genome assembly it's like what camer size should i use typically if you sequence to 50x coverage we use cameras between 60 and 70 and that works pretty well but in the next part um i'm going to talk a little bit more about this trade-off between camers yep when you were talking about tip removal you highlighted the start and as tips but then right it's very observant um so usually what you do is going back you you'll you'll initially mark the ends of contigs the first node in a contig as a tip as i've done here um and what the assembler will do is it'll count how many steps it has to make to go back to a branch point and the the camers that were marked as tips at the end typically have to go much further to a branch point than the ones that are caused by sequencing errors so you only will remove a tip if it's very short in terms of the number of nodes back to the branch so this um this one here you only had to go one step to a branch this one here you had to go two steps to a branch one step to a branch so the assembler would remove all those because what we call short tips um this one probably uh would be marked as a tip but this is just because this graph is very small typically the ones that are at the end of contigs are hundreds or thousands of bases away from a branch point so you can protect them so typically they're two different strategies um you'll either if you have a very good reference genome you'll just map reach the reference genome call snips and indel structure variation if you don't have a reference at all or your reference is low quality you might do de novo assembly instead some programs um do a hybrid of these strategies so they'll map reach to the reference genome if the read look very different than from the reference genome they'll do a local assembly in that region um but typically it's it's either mapping or assembly okay i'm going to move on to the last part of the talk now which um is going to describe some of the difficulties of doing short read assembly and some of the genomic features that cause us problems so short read assembly was a very active area of research um there's at least a dozen probably more like 20 or 30 short read assemblers that have been developed um i wrote three of them which is a bit of a shame because i never could get it right um and what the community wanted to do was develop methods of benchmarking how the different short read assembly strategies would work um and comparing them to each other so uh there's a project which was called the assemblophone and the assemblophone two which took three different species which don't have a high quality reference genome sequence them release the results to the community they would take those sequence sets reconstruct the genome as best they could submit the genomes to the organizers who would benchmark how well each one of those assemblies performed um and the results of this project were a bit shocking because the results were highly variable both across the three different species we tried to assemble and the different assembly strategies that were developed so what i um started to think about are different genomic features and different factors about the data that make a given assembly difficult and whether we can help users people like you decide how to select an assembler or select parameters for the assembly like this camerside uh that we did just discussed so i sat down i wrote a list of all the things that makes genome assembly difficult uh and it came up with this list so something that i've gone on and on about is repetitive sequence so if your genome is highly repetitive it's going to cause a lot of problems during your assembly if there's high levels of headers igosity so a lot of headers i guess snips or indels this causes more of these bubble structures in the graph which can give uh the assembler problems if you didn't sequence deeply enough if you have low coverage sequencing uh your graph becomes disconnected if your sequencing is biased towards gc rich or gc poor sequences uh that can cause problems for the assembler the classic example that i use here is the parasite that causes malaria plasmodium felsiparum which has 80 percent at content it's incredibly difficult to sequence but because it's so uh important we want to be able to sequence it a lot of work is going into sequencing and assembling plasmodium if your reads have a very high error rate or if you have a lot of chimeric reads where non uh adjacent sequence in your genome has been joined together in the reads that causes problems there's a lot of sequence adapters in the reads if your sample was contaminated by your genomics core that can look like multiple genomes within your genome assembly graph which causes issues or if you're just sequencing something that's very small uh like say a fly you might not be able to get enough dna from it so you have to sequence multiple individuals and now rather than sequencing a single genome you're sequencing a population uh which also causes difficulty for this so i wrote a program to assess these different factors and write a report out that the user can look at to see how difficult their assembly is going to be and the way this program works is that it takes in this to brown graph uh and it analyzes the structure of the graph to determine how many errors there are or how high the header's igosity is or what the repeat content is by classifying the branches in the graph into these different groups so i've come back to this idea of kamer coverage uh and the first output of this program is a histogram of how many times each kamer was seen across uh your sequencing reads so he used a histogram of 51 mer counts for a human genome sample we see that there's this roughly Gaussian distribution here centered around kamer coverage of 30 which reflects that we had about 30 to 40x sequencing coverage this is good this is the minimum standard that we need to do to know of all assembly and we see that we have a very nicely behaved distribution here now going back to the question we had about error correction we see the secondary peak here where there's this high proportion of kammers that have been seen just a single time so this all of these kammers here they're seeing a single time are these kammers that contain sequencing errors so the assembler is going to be able to look at this kamer count histogram find all these kammers are seeing a single time classify them as errors and then try to correct them back to the true genomic sequence now humans aren't particularly heterozygous there's a snip every around a thousand bases um and we see that reflected in this kamer count distribution there's this little shoulder here on the distribution which corresponds to a slight increase of kammers that have half sequencing depth which cover heterozygous uh positions these are the kammers that contribute to these bubbles in the genome now not all genomes are this clean and let's look at a kamer count distribution for a different genome so we see something that's quite shocking and as a developer of assembler upsets me in that we see this bimodal distribution in the kamer counts here we've we're showing the kamer counts for an oyster genome which has about a one percent snip rate so about one in a hundred positions of the oyster genome there's a heterozygous snip now this clearly shows up as this bimodal distribution in our kamer counts where there's this large proportion of kammers that have half sequencing coverings that cover heterozygous now this is an incredibly poor distribution and if if someone was to show this to me i'd gently advise them that they're going to have a very difficult time assembling this genome from short reads and maybe push them towards getting packed bio data instead um this is indeed what happened with the oyster genome um oyster genome project the the organizers of the project sequenced the genome found there was a very heterozygous abandoned a short read and assembly approach and uh used fozmin sequencing instead so the next step of this pre qc program which measured assembly difficulties to classify the branches in the assembly graph and measure the branch rates to assess how often you have snips or how often you have repeats in the graph so the way this works is we built a statistical model to examine the coverage at each branch point in the graph and classify them whether they might have come from a sequencing error would have come from a heterozygous snip or they come from a repeat now the intuition here is that if you have a sequencing error which causes one of these tips one of the two branches is going to have much higher coverage in the other so we'd expect one kamer to have say 40 times coverage and the other kamer to only have one times coverage conversely if it's a heterozygous snip we expect there to be balance between the two branches we expect roughly the two alleles of the of the heterozygote to have half coverage each so this just formalizes this notion of comparing the coverage across these two branches to classify which type of branch it is the program will then emit a report that looks like this which classifies which gives you the frequency of the branches caused by heterozygosity across the different samples so here are the six different species that we're using for this example report one is a yeast genome which is a haploid yeast which shouldn't have any heterozygosity one is a lake malawi cichlin fish which was part of this assemblophone project one was a boa constrictor snake also part of the assemblophone one is a human genome one is a parakeet which part of the assemblophone and then there's this oyster which i'm using to demonstrate which a very difficult joint genome looks like and we see the branch rate of the oyster genome as being roughly one in a hundred bases which is what we expect from the heterozygosity that we saw on a kamer count distribution the branch rate of the human genome is roughly one in a thousand bases which is what we expect when sequencing a human genome and our negative control here is this yeast genome which is haploid shouldn't have any branches its branch rate is about 150 000 bases which represents the uh false positive classification rate of this program so the user can quickly take a look at this compare the branch rates and figure out how difficult the assembly is going to be based on the heterozygosity of the genome so you look at this see the oyster genome is the highest heterozygosity so it's probably going to be the most difficult to assemble out of these six the program will also classify repeat branches as well here the oyster and the human genome are determined to be the most repetitive they have the highest frequency of repeat induced branches we know the human genome is very repetitive here the assembly the parakeet in the assemblophone had the lowest repeat induced branch rate so it's going to be a little bit easier to assemble than the other and again this simple yeast genome which is only 12 mega bases has a very low repeat branch rate rate as well the program will also predict the genome size as well often you if you're sequencing a new genome you might not precisely know what the genome size is so when you're ordering how much sequencing coverage to get if you don't precisely know the genome size you might under or overestimate how much sequencing you need this program can predict what the genome size is to within about 10 percent to let you know whether you need to order more sequencing data or not it will also do basic quality comparisons by calculating the mean quality score across your reads by position this gives you a hint at what where the errors might occur here we see that the lake malawi cichlid has lower quality scores across most of the length of the read the data is going to be a little lower quality some might be more difficult to assemble it will also directly calculate the per position error rate in the reads and this is the figure that I showed earlier on showing the increase in error rate across each one of these data sets from the five prime to three prime end of the read to assess whether the sequencing data is biased by gc content it will construct a 2d histogram uh where each bin is a gc camer coverage pair and what you look for is independence of sequencing coverage across a range of gcs and this fish data set pretty good we don't see a dependency between uh camer coverage and gc content if we look at the yeast data we see a little bit of a dependency where higher gc content has slightly lower sequencing coverage just shown by this slope here from left to right if we look at this oyster genome we see two blobs on here one for the homozygous camers one for the heterozygous camers as well and we see a little bit of a dependence between camer coverage and gc content the program will also estimate your fragment size histogram if you have paired end data by looking for the distance between pairs in the assembly graph these are for the six different samples again we see that say for the snake data set the fragment size distribution is around 350 to 400 bases finally the the program will do a simulated assembly to help you select the camer size they're going to use so here is the length of your contigs as a function of the camersize so for the yeast data which i've said is very easy to assemble we get very long contigs across a large range of camersizes for the other genomes which are harder to assemble like the oyster or the human genome we get much shorter came uh contig links across this range of k and we see that for each one of the assemblies for most of the assemblies there's this sweet spot where the contig length will uh hit a peak and then there's diminishing returns where we get shorter contigs by increasing the camersize and the reason for this goes back to your question where you get lower and lower camer coverage as you increase the camersize that increases the fragmentation of your assembly graph the assembler isn't able to find uh long contigs there okay so this program is designed to be quite easy to use you just need to run three commands to uh to run it and generate this pdf report in the lab that's coming up uh i'm not going to have you run the program i'm just i'm just directly provided the report because it takes about an hour an hour and a half to run so that would take up most of the lab time if you have if i had you run it directly um but here's the commands that are used to run the software is available on github uh a paper describing it is up on by on archive uh if you want to read about the technical methods okay so this lecture is giving you an overview of how assemblies work talks about the different assembly strategies for short and long reads um i've given you the general recommendation of long read assembly particularly pack bio is generally better but it's more expensive than illuminous sequencing uh and then i've walked you through some of the factors that make assembly difficult and how we can measure them using this program i just described um if you have any questions how to take them now during the lab or uh if you think of something later on uh feel free to just email me uh with those questions so before we go into the lab i'll i'll take some questions now so this is a cancer genomic sort job so are there applications for assembly get to research because we have references for humans and most of the model organisms so like when you want to use this so usually um a lot of the variant callers now are assembly based internally so a big one that you may use for calling SNPs and indels of the ghtk haplotype caller it uses de novo assembly internally within its data structures um so to get an understanding of how that algorithm works and it's based on the brown graphs that that's part of this lecture um we know from sequencing cancer genomes we miss a lot of structural variation um and then large rearrangements particularly when they occur repetitive genomes so something my group is involved in um and we have collaborated around toronto is sequencing using long reads to try to to figure out where these more cryptic structural variations occur and there we use assembly based strategies as well so right now for most cancer genomics you you'll typically be mapping to a reference and then the software you use by using assembly internally but in the upcoming years maybe uh using more and more assembly based tools so like is there any difference between what you got from de novo assembly and uh like the standard reference uh that you got is there like a quality reference or um at least for human reference the human reference very high quality and it's you know billions of dollars of going into generating curating human reference so a typical assembly will be lower quality than human genome um the human reference genome particularly in terms of the base called accuracy but the the consecutivity of long read pac bar nanopore assemblies is starting to approach the continuity of the human genome assembly for sure reason the assembly would be much much inferior uh than the human reference