 Hi, everybody. So I'm Jared Simpson. I'm a principal investigator here at OICR. I work just upstairs. So I'm in the informatics and biocomputing department, and I work on developing algorithms for DeNovo assembly, so putting genomes back together from sets of sequencing reads and other types of novel sequence analysis. So, as Michelle said, this lecture is going to be focused on what DeNovo assembly is. I'm going to talk about both some theoretical aspects of genome assembly, what's actually going on under the hood when you run a DeNovo assembler on a set of sequencing reads, and also talk about some of the practical aspects of what makes assembly challenging, what sort of things to watch out for when you're trying to assemble your own data and hopefully help you get better results when you go to an assembly genome. Now, talking about the theoretical aspects can be a bit difficult, but I think it's really important that people understand what happens when you give data to an assembler. Assembly can go wrong in a lot of different ways if you have a genome that's very large or repetitive, or if there's a lot of headers, like OICR. And sometimes there's different tuning knobs and switches that you can play with for a genome assembler that can give you better results. And to help you parameterize assemblies and work with those tuning knobs, it's important to understand what the assembler is actually doing. So I'll try to give you some indication at a high level of what's actually going on under the hood. But we'll start off quite simply by just answering the question of what is genome assembly. So we've been talking about sequencing all day. Now, when we take a genome, which I'm depicting with this cartoon here, and we sequence it, what we're doing is just breaking the genome into a bunch of random fragments that are much shorter than the genome, we then put those on to a DNA sequencer, determine their sequence, and then we want to put them back together into the genome. So the assembly process can be thought of as just inverting the sequencing process. So sequencing is fragmenting the genome into many pieces, assemblies taking those many pieces and then trying to stitch them back together into a genome. Now, if you work with human genomes, maybe this isn't necessary, we have a very good quality and very expensive reference genome that we can use to map reads to, but if you're working on any other organism, often you'll have to resort to assembling the genome before running downstream analysis, which is why I want you to sort of understand this process from this lecture. And at this point, I'll just say that if there's anything that's unclear or any questions that you have, please just feel free to ask as we go along. So here's an overview of what I'll talk about. First, I'll talk about this, the theoretical aspects of what assembly graphs and the data structures that we use to work with genome assemblies. Then I'll talk about an example assembly pipeline walking you through step by step of what an assembler is actually doing. Then I'll talk about more practical aspects of what makes assembly difficult, what can cause problems when you're trying to assemble a genome. And finally, I'll talk about what the future of assembly is going to look like when we have much longer reads. For example, from the Oxford nanopore sequencer, where assembly becomes a lot easier because we're able to better resolve repetitive sequences. Okay, so assemblers work with the data in the form of a graph. Now, you've probably seen graphs in a lot of different contexts. There are collections of vertices and edges, but an assembly graph is a specific type of graph which represents relationships between sequences. So an example might be each sequence read is a vertex in the graph, and we might connect two sequence reads if they overlap where the end of one sequence is the same as the start of another sequence. And this overlap relationship might say that the sequences are from adjacent positions of the genome. So if we assemble them together, we then extend the information that we have into a slightly longer bit of the genome. And the way of working with this type of data at a large scale is by just putting the sequence into a graph and modeling the relationships between the different sequences by edges. And then assembler will try to traverse this graph to reconstruct the genome. And I'll give you an example of that over the next few slides. Now, some terminology, there's two predominant types of assembly graph. You might hear the term to brown graph used a lot. This is a type of assembly graph where we take each sequence read, and then we break the sequence reads into even shorter pieces called k-mers. Those are just fixed length substrings of a length k, and we make each k-mer in the data a vertex in a graph. Now, this is the predominant way of assembling a lumina data as these types of graphs, the brown graph, is extremely fast to construct, and it's extremely easy to work with. So almost all the assemblers that you'll download and run are based on this type of model. An older representation of sequencing data is called a string graph. This represents all overlaps between reads, and I'm going to describe this over the next few slides as it's a slightly easier way to depict what's happening in an assembly graph. So over the next few slides, I'm just going to progressively build up an assembly graph from a set of sequencing reads. So we'll start with a single read, which I've labeled read 1 here. We add that into our assembly graph, is this vertex which is labeled v1 down here. Now if we sequence a second read and add it to the collection, we'll add a second vertex to the graph v2, and as the start of read 2 is the same sequence as the end of read 1, we're going to link those two vertices with an edge. And that just tells us that v1, or the read representing v1 overlaps read 2. And now we've labeled the edge by this string TAC, which is just the overhanging part of read 2 that's not matched by read 1. Now if we sequence a third read, we'll then add a third vertex to the graph, and as the start of it matches the end of both read 2 and read 1, we link it with edges to these other sequences. And as we sequence more and more reads, we just keep building up this graph by adding more vertices to it and more edges. And now the fundamental principle of de novo assembly is that walks through this graph, and a walk is just some path of visiting vertices in the graph in a specific order, reconstruct the genome. So in this case it's quite trivial, there's just one path, it goes from v1, v2, v3, and v4, and if we take the sequences and then just join them together along these edge labels, we end up with a sequence of the genome. So you can think of a genome assembly as just constructing this big graph representing all of the possible sequences that can be spelled from the set of reads, and then the assembler trying to find the one true path through this graph which spells the sequence of the actual genome. And that's really the only theory you need to understand for assembly is that we're trying to make this representation of the data and then find a path through it that spells the genome. So now I'll go into what a generic assembly pipeline looks like. Unfortunately, I'm pretty much a computer scientist, I want to think of data that's just perfect and clean and we can work with these strings without any sort of preprocessing. Unfortunately, we work with real biological data where there are sequencing errors and there's other types of errors that can go into the data that we need to clean up before we can actually assemble the data. And to deal with this, we've built assembly pipelines which is just a fixed set of steps which progressively clean up the data and allow us to get an assembly out of it. So for example, we might start with our reads off the sequencer and then trim them and filter them to remove bad data, perform some forms of error correction, then construct the graphs that I was just talking about, clean them up a little bit, assemble contigs, scaffold them together, and then finally fill in gaps of repetitive sequence at the end. And I'll talk about these individual stages one by one. Now, one of the most important quality control measures you want to perform on your data is removing sequencing adapters. I think when Francis talked about sequencing this morning, he might have told you that there's sequences, non-biological sequence that's added to the ends of the fragments of DNA to prime the sequencing process. Sometimes the sequencer will actually read those adapters and output them in the reads. And now that non-biological sequence, which might be present at the end of almost all of your reads, will look like an extremely high copy repeat to the genome assembler. And it can completely break your assembly if the read adapters are present at the end of your sequences. So a common thing to do is to perform adapter trimming by just identifying those adapter sequences and then removing them from the reads, which gives a cleaner representation of the data to the genome assembler. There's a second type of trimming that you might want to perform, which is called quality-based trimming. So the sequencer's output sequencing errors where it just misidentifies bases. In these cases, it might output low quality scores for the reads. So you might want to run a quality trimmer that just removes low quality bases at the end of the reads. Again, cleaning up the data so the assembler has a bit easier job. Now, a lot of assemblers will do this internally. They'll have a built-in quality trimmer that, well, they'll take the data and get rid of bases based on the quality scores. But most assemblers don't have built-in adapter trimmers. So what you'd want to do is take an off-the-shelf piece of software. I've did three of them here, Krappkin, Trimomatic and Syf, and run them on your data if you think that there's adapter contamination. All of these are quite good pieces of software with good help tax and tutorials that will let you just take your data and trim adapters on them without too much trouble. Now, once you've performed that initial read trimming and read filtering step, we want to get rid of the sequencing errors that can be present in almost all of the reads. Now, many of these errors can be readily identified and corrected. And there's a variety of programs, and I'll list some over the next few slides, that will take your sequencing reads, identify sequencing errors, and then flip the bases to be the correct sequence. Now, the important thing to know about sequencing errors is they tend to accumulate towards the end of reads. I've indicated this here in the figure on the slide for six different sequencing runs. This was for a big assembly benchmarking competition of a variety of species. And you can see that some data sets are quite good. This snake data, which was a boa constrictor, the error rate is less than half a percent along the entire length of the read, and going up to just under 1% at the last base. Conversely, this bird data, which is a parakeet, has an almost 3% error rate at the end of the read. So we can just intuitively see that this bird data would be more challenging for the assembler to deal with, and we might want to perform some error correction on that data to just clean up the data even further. Now, the way we do this is we use what's called Kamer counting. Now, the idea here is that positions in the read that have sequencing errors tend to be quite rare. We only expect to see sequencing errors once or twice if they're happening randomly across the entire data set. So what we can do is we can then count Kamer's across the read and identify rare Kamer's, and those might be the ones that contain sequencing errors. So for instance, if we count the number of times this sequence is present across our entire data set, where this sequence doesn't contain an error, we'd expect to see it about as many times as the depth of coverage. So if we sequenced the genome to 40x, we'd expect these sequences to be seen about 40 times. Conversely, if we have a sequencing error, the substring that has an error might only be seen once or twice. So we can use this idea to scan reads for errors. We do that by generating a Kamer count profile, by just counting the number of times that you've seen these Kamer substrings in the reads across the entire data set, and we'll see these places where it drops down to one at the positions of errors. And then the error corrector will just search for an alternative sequence that corrects all the counts up to about 40 times, and that's the corrected base. Now there are many programs that implement this idea of Kamer-based counting. There's Quake, SGA, Soap De Novo, BFC, Bless, Lighter, Musket, and this isn't even a complete set. There's probably about 10 to 15 of these published by now. And they all have various trade-offs in terms of memory usage, correction accuracy, and how fast they are. This one in the middle here, BFC, is written by Hang Lee, and I think it's currently the state of the art in terms of speed and accuracy, so that might be a good one to try for running error correction on your data. There's also error correctors that are based on finding overlaps between reads. So this idea is that you will compare the ends of reads and find mismatches at the ends of reads and correct them to a consensus sequence. Unfortunately these tend to be too slow to be run on very large genomes. If you're trying to sequence and assemble a bacterial genome, overlap-based error correction might be appropriate, but if you're trying to assemble, say, a human genome or a plant, you're really stuck with k-mer-based approaches. They're the only ones that can scale to the volumes of data that you'll be working with. Now, once we've performed this initial cleanup of the data, now we're able to start constructing these assembly graphs and working with the data in a little more detail. So I'll now talk about how Brown graph-based graph construction works. So the idea here is, again, we're going to work with k-mers. In this example here, we're going to work with substring of length 4, and we're going to take every substring of length 4 that's in our read set, and we have five 6-base reads here and put it into the graph. And we do that by just sliding a 4-base window over every read, and every 4-base sequence we find becomes a vertex. So for example, the first former of this read is CCGT, and that becomes the first vertex in the graph. The second former is CGTT, and that becomes this vertex, and so on. So we're just going to slide this window over all of our reads and place those sequences in the graph. Now we see that one of these formers is special, this one CGTT have highlighted in red, and it's a branching point. There's two possible alternative formers that follow it, GTTA and GTTC, and it's the assembler's job to find a path that walks through this graph that resolves this repeat. And in this case, it's actually quite easy, we would just go from this vertex to this vertex, then follow this branch here, go around, back here, and then up to this structure there. And that's a path that follows, that visits every vertex in the graph and resolves that repeat and reconstructs our genome. Now a quick note about computational efficiency. When we first had Illumina sequencing data back in the days of 36 base pair reads, assembling a human genome was incredibly challenging. If you took an off-the-shelf assembler, for example, velvet and tried to assemble a human genome using a Brown graph-based method, it might take a terabyte of memory. And people generally didn't have terabyte memory machines around, so a lot of the work that's been put into developing assembly algorithms in the last 10 years has been trying to improve the computational efficiency of these algorithms. So there are two main approaches, one of them is distributing the graph across a cluster of computers to just spread the memory load over an HPC system. And the second approach is to use compressed data structures to shrink the amount of memory that's required to represent these two Brown graphs. And that 10 years of development has really has shrunk the memory requirements from about a terabyte of memory down to about 10 gigabytes of memory for a human genome. So we've gone from this being really a difficult research question to something that people can run on their desktop computers in the span of about 10 years. And that's been one of the most exciting developments in the field is just how much algorithm and development has gone into genome assembly algorithms. Okay, so once we've constructed the graph, we then want to perform a little bit more cleanup of it. There's typically still sequencing errors that remain uncorrected. And also if we've sequenced a diploid genome, headers like OCD can can cause various structures in the graph they want to get rid of. I'll talk about those now. So the assembly graph turns out ends up to have these things that we call tips in them with chart artifacts that are caused by sequencing errors. So here's an assembly graph built from three different reads, two of them which have one sequencing error towards the end of them. And these sequencing errors cause what we call tips in the graph. These are just these little structures that end up going nowhere. They cause a branch in the graph that might confuse the assembler, but they don't actually contribute new sequence. This is just erroneous sequence that were caused by these sequencing errors. Now since this will confuse the assembly, we want to design algorithms that will find these structures and remove them from the graph to make it a little easier to find the true path that forms the genome. Now, the second type of graph artifact that I want to talk about are called bubbles. So if you sequence a diploid genome like a human genome, any alleolic variation, for example, heterozygous snips or indels will cause this structure in the graph. And what happens here is that the two alleles share sequence. So all the shared sequence becomes one chain of vertices here. But once we've hit this point of variation, for example, this C to G heterozygous snap here, the graph branches. And once we rejoin after the shared flanking sequence on the other side, the graph comes back together. So this would again confuse the assembler, it wants to just find one path through the graph. But because there's this branching structure here, it needs to try to resolve that. So I'll give you an indication of how these algorithms work. So we start with a graph that looks like this, it looks like a big mess. We have vertices everywhere, edges connecting all over the place. But we want to identify these tips and bubbles and get rid of them. So first we'll identify tips and by to do this, we just look at all of the terminal vertices in the graph. So every vertex that's just connected in one direction. And then we just slowly walk back until we find the branch point. And once we've found the branch point of where every one of these tip structures will remove them. Once we've done that, the graph is cleaned up quite a lot. And we're left with a much cleaner graph. Second, we'll want to remove these bubbles. To do that, we'll identify all the branch points in the graph, then traverse the graph until they come back together, and then just collapse the bubble into one sequence. So arbitrarily choose one of the two heterozygous alleles to represent the sequence in the genome. And we can see now that the graph is a lot simpler after we've performed those two tips, two steps, removing tips which gets rid of residual sequencing errors, and bubbles which gets rid of heterozygosity. Finally, we're now able to build contents from the graph. Now unfortunately the graph isn't perfectly unique, there's still branches in it, so it limits the amount of content, the length of content that we can build. So when we're going to build content, we're just going to compact all of the chains, all of the vertices that can be unambiguously merged together into content. We'll end up with a set of content that looks like this. Some of them are long, some of them are just single vertices representing the branch points. And that's all really the assemblers doing, is just finding the unambiguous sections of the genome that can be assembled together. Now how long should you expect contigs to be? If you're sequencing just simple bacterial genomes, your contigs might be a few hundred thousand bases in lengths. If you're sequencing large difficult genomes, like for example human genome, your contigs might be on the order of gene sizes to about 10 to 20 kilobases in length. Now finally, we want to scaffold these sequences together. You heard this morning about paired end data, and when we can use paired end data to span over repetitive regions, or regions that weren't covered deeply enough to assemble. The way that we do that is we map all of the read pairs to this set of contigs that we've just generated, and then we find positions where one half of the read pair aligned to one contig, and the other half of the read pair aligned to the other contig. So I've highlighted these read pairs in blue here, and there are mates in blue here, so you might say that this contig is followed by this contig. Similarly, these reads are clustered together, these reads are clustered together, so you might say this end of this contig is followed by that. And we can use this information to build what we call a scaffold graph, which represents the connectivity between contigs, and we can estimate the distances between the contigs by the fragment size distribution. So if we know that on average our pairs are all expected to be about 500 bases apart, if we look at where they map on the contigs, we might be able to say, okay, I think this contig is followed by 100 base pair gap, then followed by this other contig. And once we've done that, once we've connected all of these contigs together into a scaffold with these gaps that are represented by just runs of ends, we can then perform an additional step of local assembly to fill these in. And there's various programs that will do this. There's one that I wrote called SGA gap fill, or gap closer from SOF to NOVO that will try to fill in these sequences in between contigs. I know here that you can use other sequencing technology to fill in these gaps. The gaps in the contigs are typically the most difficult parts to assemble because they're the repetitive sequence that the assembler can resolve. So you might want to bring in a different type of sequencing technology, for example one that generates much longer reads like PacBio or Oxford Nanopore to try to fill in this repetitive sequence that goes between the scaffolds. Okay, so that's really how the assemblers work. And now I'm going to switch to talking about more of what makes assembly difficult from a practical standpoint. But I'll just ask now if there's any questions about that part of how sort of the interworking as assemblers go. Michelle, do you have a question? Yeah, that's right. You reuse the same reads. You can do things like only take reads or anchor nearby with a pair to assemble the gap. That's a way of making it less repetitive, making it easier to assemble locally. Or if you wanted to use a different type of technology, that's a really strong way of filling in the gaps as well. Okay, so about two years ago there was a large benchmarking competition in the assembly field called Assemblathon 2. And the results of this were a bit shocking in that the results between different species that were sequenced for this competition and between different assemblers were highly variable. One set of assemblers might do really well on one genome and very poorly on another. And conversely, one set of one species might be much easier to assemble than another. So it got the field thinking about what are the properties of genomes that make some difficult to assemble and how can we help users, people who want to assemble a genome, select an assembly strategy and parameterize that software. So I came up with a list of what I think makes assembly difficult. It's unfortunately quite long. So repetitive sequences obviously, one of the main factors contributing to assembly difficulty. For human genomes which are littered with transposons, all of those different copies of transposons cause branches in the assembly graph. And it just confuses the assembler and makes it very difficult to find the true path through the graph. And that's why when we assemble a human genome we might only get 20,000 base pair contents. A second thing that it makes assembly challenging is very high heterozygosity. So human genomes have a heterozygous snip of about one in a thousand bases. For some outbred organisms, like there's an example that I'm going to show, the oyster genome snips can appear every about 70 to 80 bases. Now that's an order of magnitude higher heterozygosity than human genomes. And what happens is that these bubbles in the assembly graph that I just showed you end up appearing all over the place. And it becomes very difficult to resolve these bubbles and collapse them together to come up with a linear sequence. So anything that has a very high heterozygosity becomes more challenging to assemble. Low coverage, of course, if you sequence to 10x you're not going to be able to assemble the genome. The graph just becomes very fragmented where it's not connected with edges in places. So the assembler is not able to reconstruct the full sequence. I typically recommend people's sequence to at least 30 to 40 acts if you want to attempt an assembly of a challenging genome. And more is almost always better. Some genomes are very GC or AT rich and that biases sequencing coverage towards certain regions of the genome. For example, the malaria parasite plasmodium falciparum is 80% AT. It's extremely hard to sequence and get good coverage across the whole genome. So it's a classic example of a difficult genome to sequence and assemble. Of course, if your error rate in your reads is very high, it's going to be difficult. If you have chimeric reads or sequencing adapters in the reads, these quality problems can be quite challenging to get around. Or if you've just performed sample contamination, if the lab has made an error and sequenced two things, or if there was a bacterial contamination along with the organism you're trying to sequence, that's going to cause additional challenges for the assembler. And the final one is something of a special case, sequencing multiple individuals. So often if you're sequencing things that are very small, you can't just get enough DNA from one individual to sequence it. So what people might do is, okay, I'm going to sequence, say, 10 of these fruit flies and then try to assemble them. But that just increases the heterozygosity because they all have slightly different genomes, so that's going to make it a lot more challenging for the assembler to assemble it. So I almost always suggest never sequencing multiple individuals unless they're completely inbred to have no heterozygosity if you want to get a good assembly out of them. Okay, so I showed you a picture of an assembly graph that looked like this earlier. Now I'm going to talk about how we can look at these features in the assembly graph to quantify how difficult a given assembly will be. So now I've annotated these features in the ways that we've talked about before. So I've shown that these tips on the graph are sequencing errors, these bubbles are caused by SNPs and indels, and this complex branching structure here is just a sequence repeat. So I'm going to now talk about methods that we can use to estimate, for example, heterozygosity to predict whether the genome is too heterozygous to assemble. And the way that we do this is we build a probabilistic model of the coverage of the graph. So I talked about camer coverage before. This is just the number of times each one of these camers appears across the entire set of sequencing data. So for example, if we sequenced our genome to about 40x, we expect each camer to be seen about 40 times. And we can build a histogram of the camer count distribution. And this is what a good camer count distribution should look like. We see this very strong peak here at about 30x. We sequence this genome to 30x. We see an additional peak here where camers have been seen only a single time. These are all the camers that have sequencing errors in them. Now this is a human genome, and we see a little bump here. You probably have to look closely for it, but these are camers that are at half depth. So we've sequenced to 30x, and this is a small little subpopulation of camers that are just below 15x. So these are all of the camers that are covering heterozygous positions in the human genome. So this is what we, so if I, if you send me this plot, I'd say, yeah, this is a great camer distribution. Go ahead and assemble this. It's basically well-behaved. Now let's look at one that's not so well-behaved. So this is this oyster genome that I was talking about before. Here we see this peak at 1x of things that have sequencing errors, and then we see two peaks here. So going back to human, one dominant peak. Here there's two peaks. So these, this peak here are all camers that are present on both alleles of the oyster genome, and all of the camers here at half depth are the, the camers that are present on only one of the two alleles. So this indicates, because this is so high, that there's extreme heterozygosity in this genome. This data actually came from a paper where they, they, they sequenced the oyster genome using just normal aluminum data, tried to assemble it and completely failed, so they reverted back to old, longer read technology to assemble the data. This is basically so heterozygous that they, you can't get a good assembly out of it from short read sequence. Yeah, I think, like the best thing to do would just ingreed it to, to increase, to decrease heterozygosity. If you, if you sequenced a population, it would just, you'd end up with, instead of two peaks, sort of like, would look like an exponential distribution of allele frequencies, and, and a lot of it would be just not covered enough to, to be able to assemble. Yeah, I guess I meant multiple individualism. Yeah. I think, yeah, I, I still think you'd have the same problem with heterozygosity. Did you have a question too? You would actually see, that's, that's a good question, I wish I had a plot for that, but you would see, yeah, if, if the sample is contaminated, for example, you, you sequenced eukaryogenome and there was bacterial contamination, would you see a second peak? And you probably would. It wouldn't, like, these peaks are, are special and then they're, they're exactly half-depth, but your, your contamination peak would be some arbitrary position based on the percentage of contamination. Not, basically, I'm acknowledging multiple end-of-the-sequencing, taking the same sample and sequence that many times, to avoid the end-of-the-sequencing. We, you can do, you, if you sequenced the same sample multiple times, that's fine. That's just going to increase coverage, but you don't want it to sequence, say, different, different samples. Yeah, different individual. Okay, so now I'm going to turn back to this question of whether we can actually measure how difficult the assembly is going to be. I'm going to just go over this very quickly, so essentially what we can do is, is build a probabilistic model of whether branches in a graph, which are represented by this structure here, come from sequencing errors, heterozygous variation, or repeats. And the idea here, without going into the mathematics, is that if it's a heterozygous variation, we expect both branches to be roughly equally covered. We expect 50-50 per portion of either allele to be sequenced. If it's a sequencing error, we expect one branch to have normal depth and one branch to be only seen a few times. And if we do that, and we run this model on our data, we can then predict the frequency of heterozygosity without actually running the whole genome assembly or without having a reference genome. So this is the results on the six different assemblathon genomes. So the human genome is this bluish-green here, and the predicted heterozygosity is about one in a thousand bases, which is what we expect. The oyster genome, which is this nightmare case that I've been showing, is about one in a hundred bases, with these other genomes falling in between. This one down at the bottom is a haploid yeast, which is negative control in this case, which is just shouldn't be predicted to have any heterozygosity. But because of classification errors, it turns out to have one of these branches about one in 50,000 bases. So we can also use this to predict how repetitive the genome is. Here the human genome tops the list as being one of the most repetitive, along with this oyster genome, where this yeast genome, which is only 12 megabases, is the least repetitive by this measure. We can also predict the genome size using these methods. So the predicted human genome size from this classification is about three gigabases, which is just a little bit lower than what we know. The snake genome is 1.5 gigabases and so on. So this is a set of programs called PreQC, which you can run on just a set of raw sequencing data, which will predict these metrics for you and let you explore your data a little before you actually set off to assemble the whole genome. So I run a mailing list called SGA Users, where I invite people to run this QC program on their data, send me the reports, and then I'll comment on how ways that I think they might be able to improve their genome assembly based on the characteristics that this program predicts. So here's some other things that the program will give you. It will plot the quality score by position as just a measure of the quality of your reads. So this is the same data we saw earlier with this bird data being lower quality towards the three prime end. And it'll also predict error rates for you all without a reference genome. So you can just run it on any sequencing data set and figure out what your error rate is regardless of whether you have a reference genome or not. One thing I talked about was bias sequencing. So when you sequence something like Plasmodium, which is 80% AT, the sequencing tends to be quite biased. You don't have uniform sampling across the entire genome. So a way of measuring this is just plotting the coverage of the genome as a function of GC content. So this is a cichlid fish that was part of the assemblathon. And we see that the sequence coverage is largely independent of GC. This is what we want to see. Now here's a yeast data set where we see that it's not quite as uniform where the higher GC sequences are less represented than the sequences up here. So we see this linear drop in sequence coverage as a function of GC. So this is something we don't want to see. This is something where you might want to complain to your sequencing core that your libraries weren't good or something and try to get a little bit better data out of them. But this program will let you explore these sort of problems in your data. This is what the oyster genome looks like. Again, we see two different distributions here. One for homozygous sequence, one for heterozygous sequence. The program will also estimate fragment size histograms for you. So it will take in your paired-end data and let you see whether the size distribution matches what you expect from your size selection. So for example here, the snake data had a size distribution of just under 400 bases. The bird data was about 500 bases. The oyster data was a mixture of three different sequencing libraries. One at 200 bases, one at 450 bases, and one at 700 bases. And that's shown up here by three different peaks in this plot. And finally, this program will actually run a simulated assembly for you as a way of indicating what you should expect when you run the full genome assembly. So this just samples base, samples vertices from the graph, sees how well connected they are, and puts out a distribution of the contact lengths as a function of the k-mer size that is used to parameterize the Brown graph assembly. So we can see that vary here. The yeast data, if we use a k-mer size of about 41, we'd expect quantities of about 30,000 bases. Going down to the human data where if we use a k-mer of about 71, they'd be about 10,000 bases. So by running this, and it just takes a few hours to run, you can get a sense of how well your data will come together when you spend a few days or a week to complete the whole assembly. Okay, so here's just an example of how to run it. It's just four commands. You build an index of the reads, you run the sampling procedure on it, and then you run this Python script that will generate the report and all the plots that I just showed you. And it's relatively fast to run. You can run it on a human genome in about 24 hours. And there's a paper that describes this on... Okay, so that's really the practical difficulty of assembly. Any more questions on that before I move on to long reads? Okay, yeah. Yes, it depends on your definition of maintain. So SG is really in bug fixing mode now. So if bugs come in, I fix them. If people ask questions on the mailing list, I'll answer them, but largely new algorithm development has stopped. I've mainly moved on to long reads, which we're going to talk about as I think... We've been working with Illumina data for seven or eight years now, and I think we've sort of topped out on what we can get from short reads. And I think a lot of the big improvements are integrating long reads, like Pac-Bio. In terms of the several graphs you showed, the sequencing error increase above this 100 VP, I'd like to know how do you think about the Illumina TrueSeq, which claim there are 250 AP reads? Yeah, that's a good question. So typical Illumina, so just to reiterate the question, typical Illumina has been about 100 bases for the last, say, four years. And now, especially on the MySeq and I think the more recent HighSeq versions, you can get kits to sequence up to 250 or 300 base pair reads. But because the error rate is continually increasing, the questions whether those long reads up to 250 bases can be trusted. And when they first brought these 250 base pair kits out, I think a lot of people had trouble with them getting good data out to that length. But now I think it's more or less stabilized. And then the reads are trustworthy all the way about 250. At least I've seen very good data sets, particularly by the Broad Institute. They had a good human genome assembly paper last year using 250 base pair reads and they got an excellent assembly out of it. So I think that data can be used and it's certainly helpful to have that read length. So that's replacing some chemicals? Yeah, it's basically just running the sequencer for longer. It's just, it's cycle by cycle and you run it for 250 cycles instead of 100. And because they've been able to improve the chemistry over the years, they're able to get clean enough signals out to that length where they weren't before. But the chemistry improvements, that's sort of getting out of my area. But it seems that they've largely solved the problem where they can run it for that long. Okay, so in the last few years, there's been a lot of interest in using long reads to improve assembly. This was really prompted by PacBio's work. And they released an assembler called HGAP, hierarchical genome assembly program, which demonstrated that they could overcome the high error rate in their reads and assemble actually complete bacterial genomes. So the characteristics of PacBio reads is the reads can be up to 10,000 to 15,000 bases in length, but the error rate's about 15%. So what the scientists at PacBio did is that they found ways to find overlaps between these very long but very noisy reads, calculated consensus sequence for the long reads, and then use the Soler assembler to assemble them. And this pipeline works, and they were able to demonstrate that they can assemble E. coli into a single contig. So you don't have to worry about a fragmented assembly anymore. It just assembles into one piece, which is really the gold standard and what we shoot for in the assembly field. Now there are two platforms for long reads, basically. There's PacBio, which people are maybe familiar with, and there's Oxford Nanopore. So Oxford Nanopore is much more recent. They've only had data available for about one year, but it's a very exciting technology in that the sequencer is about the size of a smartphone. It's about the size of my phone. You just piped a little bit of sample in the top, plug it using a USB cord into your laptop, and then you can sequence directly on your desktop. So a lot of people are excited about using this in the field, for example, sequencing things when you don't have the infrastructure to support high-seeks or just having wrap in sequencing without having to support a large sequencing facility. Now on the long read assembly pipeline largely looks the same as the pipelines that I've been showing you, except it has two differences. So now you can't really use Kamer-based methods to perform error correction for long reads. You go back to these overlap-based methods, which tend to be computationally very expensive, so long read assemblies tend to take a long time. And then the final step now, which I've added, instead of scaffolding, you don't really need to scaffold if you're assembling the genomes to a single piece, but what you do need to do is do very precise polishing of the consensus sequence of the assembly. So this step is just trying to improve the base level accuracy of your final assembly to get over the noise of the raw reads. And these methods use a probabilistic model of the sequencer to infer a new consensus sequence, and they tend to be quite computationally expensive. Now I'm just going to point out a few long read assembly papers. So this is the one that I talked about earlier. This is the HGAP paper where PacBio showed that you could take their data and assemble complete genomes from PacBio data. Now that assembling microbial genomes of PacBio is more or less routine, people have set their sights on larger and larger genomes. This paper, I've put a screenshot of bioarchive, but it's now published in Nature Biotechnology, is assembling a human genome with PacBio data. So Adam Philippi and Serge Karan, they developed some very fast ways of finding overlaps between long, noisy reads that scale up to human genome size. And this final paper that I've put up here is just a preprint that myself and Nick Lohman in the UK and Josh Quick have been working on for assembling complete bacterial genomes. So assembling just one bacterial genome into one contig using nanopore sequencing and developing this sort of assembly pipeline for nanopore data. So I should have put up a link to a Github site where the software is published, but on github.com.jts.nano polish is our assembly pipeline for nanopore data. And that's really the main research direction that I've been focusing on for the last few months. Okay, so this is just to recap. These are the different long read assembly pipelines that you can use. You can use HGAP. You can use Soler assembler 8.2 to assemble PacBio data or this nano correct and nanopolish pipeline that I've been developing for Oxford nanopore minnion data. And the results are often much better than short meet assemblies. If you can afford it and if you're trying to assemble something that's quite small like a bacterial genome, I definitely suggest going for PacBio or nanopore data over illuminated data as you'll often get a much more contiguous and higher quality assembly. But because the compute requirements of long read assembly are so high that efficiently working with this data and accurately computing a consensus sequence is really an open research problem. And that's the type of work that I'm doing in my research group. Okay, so I'm going to end the talk there. And if you have any more questions, feel free to ask them now. Or if you want to help with your assembly, feel free to email me afterwards.