 I'm Jared Simpson. I'm a principal investigator at the Ontario Institute for Cancer Research in Toronto. My research group develops algorithms for processing DNA sequencing data of the type that you've all heard about today with a particular focus on de novo genome assembly. So this part of the course is going to be a little different than the previous sessions today in that this is only going to be a lecture. There's no associated practical session. But if you have any questions about choice of assemblers or how to run genome assemblers, feel free to ask either during the next hour after the course or email me. My email address is on the last slide here. Now in this talk, I'm going to go through really genome assembly from both a technical level in what's happening under the hood in genome assemblers, how they're taking a set of sequencing reads directly off a DNA sequencer and putting together that genome. That would be the first part of the talk. And the reason I go into the details of how the algorithms work for genome assembly is that it's still quite a difficult process to assemble a genome. And often when you'll take your own data, try to assemble it, you get results that aren't as good as you'd expect. And really key to troubleshooting assemblies and understanding how it's gone wrong is this understanding of what's going on under the hood. So the first part, I'll talk about exactly that, what's going on inside the genome assembler. I'll then walk through an example assembly pipeline of starting from the raw reads, going all the way to sets of context and scaffolds. And then finally, I'll go through some examples of these problems that can come up, what are different features of a genome, what are characteristics of the data that might give you problems when running a genome assembly. And hopefully this will allow you to have a better understanding when you go off and run assembly on your own. So just to put us all on the same page, what is genome assembly? What do we mean when we talk about genome assembly? Well, it means just taking a genome and here I'm just representing a genome as a set of colored bars where the genome is a mixture of unique sequences, which is this green, blue, light blue and purple bar here, and repetitive elements. And repetitive elements are these blocks of sequence that have been duplicated within that genome and these repeats are what cause trouble for the genome assembler. There are sequences that look identical and if your read link isn't long enough, you can't resolve where to place one repeat versus another. So I'm just going to depict a genome like this and I'll come back to this question of how it repeats affect assemblers throughout the talk. So we sequence a genome and from the standpoint of the genome assembler, the sequencing reads are just an unordered sample of fragments of the genome. So you just randomly select the positions from the genome, put them into a bag and then the genome assembler has to reverse this process and put the genome back together hopefully end-to-end into whatever that original source sequence was that you sampled with all these reads from. So first I'll talk about assembly graphs, which are the data structures that we use to work with sequencing reads, then I'll talk about this example assembly pipeline, then some of the features that make genome assembly difficult. And finally I'll end with talking about genome assembly with long reads like you get from Pac-Bio sequencers or like what you get from an Oxford Nanopore sequencer and how that makes the problem a little bit easier due to the read link. So key to understanding how genome assemblers work is understanding this concept of an assembly graph. Now an assembly graph is used to represent the relationships between the reads that you've sequenced. So if two reads overlap we might want to record the information that this pair of reads overlap and the genome assembler will use this to try to figure out what the correct sequence of the genome is. So I'll be coming back to this picture a little bit later on but when we're working with these genome assembly graphs usually a sequence read is a vertex in the graph and if two sequence reads overlap they make an edge in this graph. Now there's two types of graph that are used by genome assembly software. There's what's called the de Brown graph which breaks all the reads into short fragments called k-mers. If you sequence a hundred base pair of read the assembler might break that read up into sixty one base pair substrings, construct a graph of those substrings, I'll give an example of this in just a few minutes. And when we're building assemblies of luminate data most of the algorithms are based on this de Brown graph model. The reason being that it's extremely fast and efficient to work with these short fixed length substrings rather than complete reads and inexact overlaps between reads. So the first type of graph is this de Brown graph which is based on k-mers and then the other type of graph is called a string graph where we keep the reads intact. Every read is a vertex in the graph and if two reads overlap we connect them with an edge. So I'll give an example of this overlapping based graph now. So here we have a single read which we'll call read one, it's about forty bases at length. We're going to add a vertex into the graph for that read. If we sequence another read called a read two we'll add another vertex into this graph and since the start of read two is the same sequence as the end of read one we connect them with an edge here. So we're just going to progressively build up this assembly graph so now we've got two vertices connected by an edge. If we sequence a third read then the beginning of read three matches the end of read two and it also matches the end of read one. So we add this vertex into the graph and we connect it with edges to the previous two vertices. Then if we sequence a fourth read here we're going to add another vertex to the graph and connect it with edges to all the reads that it overlaps. So just this beginning of some reads matches the end of another read that's what we're talking about when we refer to overlaps between reads just that a suffix of a read matches a prefix of another read. Now we've labeled all of the edges in this graph with the overhanging string between the reads. So when we connected this edge between read two and read one the overhanging part of read two is this string TAC so we've labeled the edge as the string TAC. Now what the genome assembly software is trying to do is it's going to try to find a path through this graph which is just a sequence of vertices that are visited and concatenate all of these edge labels together just joining them together into a string that spells the original sequence of the genome. So all the assemblers trying to do is build up this graph which we call string graph and then find a path through it and then the string that that path spells is the sequence of the genome. So that's really the high-level technical part of how genome assemblers are running. Now I'm going to use an example assembly pipeline to illustrate these concepts further. So this is what an assembly pipeline might look like. The assembly software that you download from GitHub or somewhere on the internet it will implement these steps and take the data from your raw fast Q format all the way through to a fast file with contigs and scaffolds. So the first thing we might do is some cleanup of the data to get rid of sequencing errors, sequencing artifacts by either trimming the reads or filtering them out. We'll then try to correct sequencing errors in a read. We'll build one of these assembly graphs, clean up the graphs of additional artifacts, build contigs from the graph, scaffold them together and finally try to fill in the gaps in the scaffolds. And I'll go through all of these stages individually. So the first one and actually one of the more important stages is just cleaning up your data before you give it to the assembler. Often times when you get an Illumina sequencing run if there was some processing artifact like you've concatenated adapters onto the end of the sequence and the sequencer read into the adapter sequences, if you don't get rid of those adapters they can look like very high copy repeats to the assembler. And what will happen is that the assembler will just stop at the position of wherever it finds these adapters. So this is an example run so you guys all saw IGV in the last module. This is an example of a sequencing run that had a lot of adapter contamination. And all of these colored bars here, these are soft clip reads. All of these colored bars are an adapter sequence, it was on the end of one of the reads. Now I tried to assemble this data and the assembly completely failed because of the fact that 40 to 50 percent of the reads had these sequencing adapters on them. If you then run an adapter trimmer of lists, some of them on this slide here crack and traumatic and scythe, they will get rid of those adapters on the three prime end of the reads, present just genomic reads to the genome assembler and everything will work a lot better. So this is an incredibly important step if you do have adapter contamination. Another way that you might want to trim your reads is just if you have low quality bases at the end. You might have seen an Illumina air profile and I'll show it a little later where the error rate increases at the three prime end of the read. If that error rate increases quite drastically, you often want to trim the read back just to get rid of those bad bases. That makes it a little easier on the assembler. Right, here's actually a slide with the error rate for six different sequencing runs. So this was part of the assembly competition, we'll talk about it a little bit later. We sequenced a yeast fish which was Lake Malawi sicklet, boa constrictor, human genome, parakeet, and an oyster and this is the error rate as a function of the position in the read for these six different data sets. As you can see for this parakeet bird data set, they sequenced 150 base pair reads and the error rate increased to about 3% of the three prime end of the read. This is a little bit higher than we're usually comfortable with. So you'd probably want to trim that back to about 100 bases just to get rid of those extra errors. As an alternative to error trimming is we can perform error correction. And this is where instead of just getting rid of all of that sequence that might be bad, we'll try to fix it to be what the true genomic sequence was at that position. And how we do that is we use what's called camer-based error correction and illustrate this. Here's a single read where this position was miscalled. There's a sequencing error here where the sequencer said there is a seed. And how we're going to do this is we're going to exploit the redundancy in sequencing. Usually when you sequence a genome, you'll sequence it to 30 to 40x, which means each base in the genome is covered 30 to 40 times per read. So if we take a substring of one of these reads and count how many times it was seen in other reads, the substrings that are true genomic sequence will be seen a lot. The things that are caused by sequencing errors like this substring that contains the seed will be seen very few times. So what we want to do is we want to compare the number of times we see the genomic substrings to the number of times we see the substring with errors to identify where the sequencing errors occur. So an error correction algorithm will do exactly that. You will take every substring starting from the first to go into the last, and it'll count how many times that's seen. We'll get a coverage profile that looks like this for the true genomic sequence where it bounces between 20 to 30. And then when we hit a sequencing error, we see that the coverage drops to one. That's the algorithm will then say, OK, I think there's a sequencing error. It will then try to correct that error by just trying different substitutions of that position. And if it finds one that corrects the count profile up to this range of about 20 to 30 times, it would then make that fix and say, it's corrected that C to a T or whatever the true base was. So this idea of K-mer based error correction has been quite popular. And there's a lot of software that implements this idea. Some genome assembly packages that you can download will have algorithms built in to do the error correction. Others you can download an external package that only does error correction. Some of the most popular ones are listed here, like Quake, SGA is a software package. I wrote, SOPE DeNovo, BFC, Bless, Light, or Muscat, or all camera-based error correction programs. An alternative way of doing error correction is finding overlaps between reads like I showed before. So you find where the end of one read matches the beginning of another. If you find a lot of overlapping reads, you can then calculate a consensus sequence for the reads. This works very well and it can handle different types of errors in camera-based correction, but it's typically too slow to run on very large genomes. So if you sequence a human genome, you might have a billion reads. Finding overlapping pairs out of a set of a billion reads takes a lot of compute time. So typically we use this faster camera-based error correction for very large genomes. Right, so once we've cleaned up the data a little bit, we want to build the assembly graph. So one of the popular ways of doing this is building a brown graph, which I mentioned before, and again this is built on the idea of fixed-length strings, which we call cameras. So here's an example of what a brown graph looks like for a camera size of four from these five different sequencing reads which are length of six. So what we're going to do is we would take every four basis substring of these reads and we put it as a vertex. So the first four basis of this reads, C, C, G, T is a vertex here, then the second one, C, G, T, T is a vertex here. So we put all of the formers that are present in any one of these reads into the graph and then we connect them by an edge if they're adjacent in a read. So we've connected C, C, G, T with C, G, T, T because they come next to each other in this read. When we build up the graph from all the reads, it looks like that. And we see that this substring, C, G, T, T is present twice in the reads once followed by G, T, A, once followed by G, T, T, C. This causes a branch in the graph and it's the assembler's job to try to figure out what the sequence of the genome adds to end the counting for this branch. So here there's actually a unique solution. It can go from here to here, follow this branch around, go down, back up to here, and then follow there. And that's a super string that spells every one of these reads in this little example. Now, a quick detour into type of research that my group does. So for a long time I focused on trying to make genome assemblers very fast and memory efficient. When we first got Illumina sequencing in say around 2008, 2009, you could only sequence about 36 base pair reads when you try to sequence a human genome with that. You'd use a kamer size of about 27. And if you tried to store one of these the Brown graphs I just showed you with 27 MERS for a human genome, it would take maybe a terabyte of memory. Now, typically people didn't have a terabyte of memory to use on genome assemblers. You might have to go to a super computing center where they'd have very large memory machines to do your assembly. But since we've started working with Illumina data, we've developed new algorithms and new compressed representations of these graph-based data structures. And they've gotten that down to about 10 gigabytes of memory to do human genome assembly now. So this is after about seven, eight years of fairly difficult algorithmic work that we've gotten to a point where we can do fairly routine genome assemblies for humans off of Illumina data. If anybody's interested in the computer science of how this works, now we're able to do these assemblies so efficiently. I'm happy to answer any questions about that. Okay, now after we've constructed a graph, we're not quite ready to build contigs yet. And what we want to do is a step called graph cleaning. And this is an additional data cleanup step where we try to get rid of different types of artifacts that appear in the assembly graph. So the first type of artifact is what we call TIPS. And these artifacts are caused by uncorrected sequencing errors. Now, when I showed you the ground graph before, we had a vertex for every camber. Again, we're working with the same type of graph. I'm just representing the camers as these gray circles here. And when you get a sequencing error, you still put the camber for that position into the graph. But since errors tend to be rare and only occur say once or twice, what happens is that causes this spur off the graph where it has a couple of connected nodes that then ends up going no. And we've got an error here that causes this spur or what we call TIPS. And this error here causes this what we call TIPS as well. So if your assembler is coming or along and it's trying to assemble the genome, it might get confused by these two alternative paths that go nowhere. So what the assembler wants to do is get rid of these erroneous branches that are just caused by sequencing errors. So that's one type of graph artifacts. The other is something that we call bubbles which are caused by heterozygosity. So if you sequence a diploid genome, you're gonna sample camers from either of the alleles covering heterozygosites. So here we have CG heterozygos. And what happens is when you sample camers of those two alleles, it causes a divergence in the graph where you can either follow a path for alleles, one at the top or alleles two at the bottom. And this is a very distinctive structure where it's this divergence and then coming back together. So the assembler can actually efficiently identify these positions and say, okay, I think there's a heterozygous snip here. Do you have a question? Yeah. So what is the bubble showing to four if there's only one? Yeah, that's a good question. So let's say we're working with fairly short camer. So this camer, TCGAT, is shared by both two alleles. So that would be here. Then if you slide it one base over, there's a camer CGATG and CGATC. So that's the first point of divergence and that's this pair of nodes. If you slide it over again, we have GTGATGG, that's these two. And then it's just, you keep sliding over to get multiple nodes for that point. Thank you. It'll be the exact size of your camer. It's exactly the camer. If you're sequencing a haploid organism, would you expect to see bubbles? No, you don't. You do get them, you don't expect them to actually occur, but if you have a sequencing error that's recurrent, it happens four or five times, that might give you enough coverage such that you don't get this tip structure, but rather it rejoins and you can have an artificial bubble because of that. Also, if you have diverged copies of repeats, you'll get bubbles as well, just because of parallelogous sequence variance. Okay, so after the assembler's built a graph, it might look like this. And figuring out what the genome sequence is from this graph is probably fairly difficult. So the assembler's gonna try to identify these tips and bubbles to make the graph simpler. So it'll start by finding all of the branches that are only connected on one end. It will then walk those back to see where they rejoin the graph and get rid of them. That simplifies the graph quite a bit. It will then try to find these divergences where the bubbles occur. It will walk to see when they rejoin the graph and then collapse them down to just a single allele that will be represented in the final assembly. It will then try to build contigs, which typically is just trying to find the largest unbranching segments of the graph. So we'll compact all those nodes here. Then it sees there's a branch here that can't be resolved. It will compact all those nodes here, here, and so on. Intel is just merged. Everything can be merged together without introducing any misassemblies. Now how long are these contigs? Typically for an aluminum sequencing run of a bacterial genome, the contigs will be around 100,000 bases in length. If you sequence a large eukaryotic genome, let's say a human genome, they might be 10 to 20,000 bases in length. If you do pack bio sequencing, long read sequencing, you can get contig sizes in the megabases and that's something I'll return to a little later on. Now finally, once we've got our contigs, if we've sequenced paired ends, we're not done yet and we can build what are called scaffolds. And scaffolds are just a higher order structure where you try to link contigs together by paired ends that map between them. And you separate the contigs with a gap, which is just unknown sequence that you don't, that you can't figure out in between the contigs. And we call that higher order structure of linked contigs with gaps, scaffolds. So the way we build scaffolds is that we map paired end reads to the contigs and then we identify where a group of pairs maps the end of one contig and their mates maps to the end of different contigs. So a color code from here, so these blue pairs map here and then went here and we have red here and here. It's hard to see if it's purple here and here and then green here and here. So we would just link up those contigs by paired ends and we build what we call a scaffold graph, which is really similar to these read graphs I've been talking about before and we'd construct the scaffold from those. We can use the paired end insert size distribution that we learned about in the previous modules to estimate how far apart these contigs are in the scaffolds and then you just fill in the sequence with ends representing the uncertainty of that sequence between the contigs. There's a set of programs that are called gap fillers which will take those scaffolds and then try to do local assembly which is not as strict to try to fill in what the sequence is. Now those gaps are caused usually by two things either repetitive sequences that are too difficult for the assembler to resolve or just drops in coverage because you had either you didn't sequence your genome deeply enough or perhaps there was some GC bias in that region that didn't allow it to be covered by reads. If you have long reads to augment your short read assembly there's programs that will allow you to use the long reads to fill in the gaps which is especially helpful for highly repetitive regions. Okay, so I'm gonna, is there any questions about that part because I'm gonna move into now what are some of the difficulties for assemblers? Yeah. The assembler choice is six sequence for a graph. Is that random? Yeah, so it will select a k-mer size and then every k-mer will become a vertex in the graph and then when it's trying to do the compaction where it's finding the paths in the graph it will usually just pick a node at random but it actually doesn't matter. No matter what node you pick you'll arrive at the same set of context. And the second is that we might have some graphs that resolve equally well. Do you have some sort of scoring that helps to resolve two possible paths that could resolve? Yeah, that's a good question. There is somebody of research on trying to find the genome assembly that maximizes the likelihood of the read data. So this is like a very principled framework for trying to find the best reconstruction of the genome. Those algorithms tend to be quite heavy because you're doing a lot of these calculations you're looking at like a lot of different alternative path, a lot of different alternative paths through the graph and then pick whichever one's best. You have to use hidden markup models, things like that to score the sequences. So they tend to only be used on very small genomes. Usually when we're doing human genome size assemblies we're working so hard to get it to run quickly enough that we just do very simple things, take the contigs that are 10,000, 15,000 bases in length and then that's the result. Right, so in 2013 there was an assembly competition which aims to figure out what the best performing genome assemblers were. Genome assembly's been an incredibly popular research topic to work on. I worked on it for quite a few years. Many other researchers have worked on it and there's probably been 10, 15, maybe even 20 assembly software packages developed in the last decade. So the question was, would assemblers work best? What genomes do they work well on and why do some genomes assemble very well whereas others have problems? So this competition which was called Assemblathon 2 really aimed to address this question. And there's a quote here from the lead author, Keith Bradenon who really summarized what the field of assembly was in 2013 by saying, don't trust the results of a single assembly. If possible, generate several assemblies with different parameters and different assembly software and then select whatever the best one is for your purpose. And I think this really underscores the state of the field in that there's a lot of uncertainty in genome assemblers. We're relatively happy to read mappers now where you can use BWA and it's gonna give you a pretty good job. That's not quite the case with assemblers and it's much more specific to the genome that you're trying to assemble. Now on this slide I'm just trying to summarize some of the factors that make a given assembly difficult. So I've touched on the idea of sequence repeats as being one of the key factors. If there's high levels of heterozygosity, so if you have these bubble structures occurring in multiple places in genome quite often that can confuse the assembler. If your genome's not covered very well that causes breaks in the contigs because you just don't have sequence reads there that allow you to link reads together. If your sequencing is highly biased, if you sequence something that's either AT rich or GC rich, you might not cover the genome completely in records and that can cause problems with the assembler. Of course, if your reads have a lot of errors or there's chimeric reads where you've joined up different places of the genome into a read that causes problems with the graph. You've mentioned sequencing adapters. If your samples are contaminated or if you've sequenced multiple individuals rather than a single individual, these are all factors that make it more difficult for the assembler. So what I wanted to do after this assemblathon paper was put together a package of tools that could measure all of these factors about how a given assembly might go wrong automatically without any input from the user. So they could just take a FASTU file, it will then measure if your coverage is good enough to do an assembly, how repetitive the genome is, how heterozygous it is, as a way of giving the user information about how difficult that assembly is gonna be. And the next few slides, I'm gonna explain how this program works and go over some examples of the reports that it generates, which is just a PDF file, which allowed you to explore the properties of the genome you're trying to sequence and choose an appropriate assembly strategy. So I'm gonna go back to this slide, which I showed at the very beginning and essentially how the program works is it's going to analyze the structure of this assembly graph to estimate how difficult it is going to be to assemble. So what we're gonna do is we're gonna try to identify how often these branches occur due to sequencing errors, how often branches occur due to heterozygous snips and indels and repeats as well. And the way that we do this is we analyze how often these camers occur in the reads. So just like I was saying when we were talking about error correction, when you've sampled genomic camers, you usually see them 30 to 40 times if you've sequenced the genome quite deeply. And we can use that information to learn about the error rate and in particular the heterozygosity of the genome. So if we draw a histogram of the number of times 51 mer, so just a fixed length string of 51 bases occurs, we get a plot that looks like this for a human genome, peaks around .28x, and that reflects the fact that we've sequenced the genome fairly deeply. Now if we show the same histogram for a highly heterozygous genome, like the oyster genome, the histogram looks like this. And then what's happening here is that we have these bimodal distributions. And the reason for this is that the oyster genome has a snip around 190 bases. So a lot of the camers are coming from, are covering heterozygous snips and indels, it's from this peak, and then another population of camers are from just homozygous position that are present on both chromosomes. And this sort of distribution causes a lot of problems for genome assemblers. They want to treat the genome as being just a single string. They don't want to see a lot of these bubbles forming. So this oyster genome will definitely cause a lot of problems when you put it into a genome assembler just because of this high heterozygosm. And the level of heterozygosm is so high, it's readily apparent just looking at this camercount distribution. So now we can formalize this concept by just calculating how often these different types of branches occur within the graph. And in a paper I published in 2014, we have a statistical model for taking an arbitrary branch that's happened in the graph and classifying it into either sequencing error, whether it came from a heterozygous snipper indel, whether it came from a reed gate. And this is just based on these coverage counts that I talked about. I'm not going to go into the math of how this works, but essentially what we get out is a prediction of how often there's heterozygosity within the genome. So this is the six genomes I showed before, three of them were part of the semaphon competition. And we see that the branch rate for the oyster genome is predicted to be about one in 100 bases. And this is just reflecting this high level of heterozygosity within the oyster genome. For the human genome, which is in this color here, it's around one in a thousand bases, which is also reflective of the diversity we expect when we sequence a human genome. As a negative control, we sequence a haploid genome, we predict it about, I think one in 30 to 140,000 rates of heterozygosity in that. That's just these bubbles I was talking about that are caused by our current sequencing RFLACs. So that gives us a lower bound on the sensitivity of predicting heterozygosity by analyzing this assembly graph. We can also do this for predicting the frequency of repeats. Again, for the oyster genome, we see very high rate of repeats for the human genome. We see a high rate of repeats as well. This is what we expect. We know the human genome is larger and repetitive. And what we get from looking at these three pictures is a measure of how difficult the graph is going to be to resolve for this assembly. So what you want when you sequence your genome and you put it up to one of these plots is for its branch rate to be down here like this yeast genome, which is relatively simple to accept. This program will also predict the genome size for you. So here's just a prediction of genome size for these six test genomes. Human genome comes out to be about three gigabases what we expect. Yeast genome's tiny down here at about 12 megabases. The program will also assess the quality scores of your reads, so just plots the mean quality score by position, just like when we looked at the error rate of these six data sets. You see that the bird data set has problems at the three prime end. Might need a bit of trimming there. And we'll also predict the error rate as well for you. Now the question of GC bias came up a few times. I heard you in the previous module. And this program will also allow you to measure GC bias. So now we're looking at a 2D histogram of coverage versus GC concepts. Coverage is on y-axis, GC concept is on the x. So this is for the fish data set. We see that there's no real coverage is relatively dependent of GC in this case. This is what we want. We want to see most of the genome covered uniformly independent of what the GC content is. If we look at the yeast data, we start to see that there's this trend downwards in coverage for higher GC. So by looking at these plots, you can start to see whether your sequencing coverage was biased for this run. Again, we go back to this oyster data set, which is very problematic. And we see that this has actually split just because of this high level of heterozygosity, whereas all of this density has come from heterozygous position. This part of the density has come from homozygous position. And the program will also predict what the fragment size distribution is for your paired end reads, allowing you to check whether this matches what you expect from your library prep. So here, again, we have a variety of different data sets for the snake data to involve the constrictor. The intersize distribution is around 350, 360 bases. For the bird genome, it's around 500 bases. And finally, the assembler will actually run a simulated assembly for you in just a few hours rather than having to wait multiple days to run the assembly. To give you a measurement of how long it thinks your contigs will be as a function of this k-mer size for the de-brow graph. So we'll just sample pass through the graph, look to see how branched they are, and then tell you how long the contigs are. Now, the point I was trying to get at in these last 10 slides is that this yeast genome, which is very easy to assemble, you're gonna get very long contigs, no matter what k-mer size you pick. But for this oyster data set, which was extremely hard to assemble, you're always getting fairly short contigs. So this program will give you an answer in about six hours total runtime. It gives you a measure of how difficult your assembly's going to be. If your assembly then turns out to look very poor, you can go back through these other slides, say, okay, maybe my coverage isn't so good, maybe the error rate's too high, maybe my have too many branches in this genome. And that helps you troubleshoot your assembly. So here's the link to the code. I won't read this out since it's all on your printed out slides. It really just takes three commands to run. You build an index from your FASTQ data, you then run this pre-QC program, which computes all these metrics, and then you run a program which prints out a PDF report for your genome. Okay, in just the last few minutes, I'm gonna talk a little bit about the long read assembly. So this has become a really prominent research direction, probably the last four years, since long read sequencing has become more available. So the Pac-Bio guys really deserve a lot of credit for building this field, as maybe five years ago, it was considered that the error rate within long sequencers was too high to do genome assembly. So if you sequence with Pac-Bio, the error rate will probably be between 15 and 20%. And as I said before, trying to find overlaps between reads is computationally difficult, and that is just compounded by high error rates. You have to search very hard to look for reads that have a true overlap when there's such a high error rate in the reads. So like the engineers at Pac-Bio, mainly led by Jason Chin, they developed very fast ways of calculating overlaps, and they developed ways of error correcting Pac-Bio reads such that they can give very high quality genome assemblies. Of course, the reads are much longer than Illumina reads there. You can get 10,000 base reads on average, versus 100 base reads for Illumina, and this gives you a lot more information about how the repeat structure of genomes are compared to the short Illumina reads. So as I said, Pac-Bio really developed this field. More recently, Oxford Nanopore reads, which developed this, Oxford Nanopore has developed this miniatrized USB-powered DNA sequencer, which has similar properties Pac-Bio reads are now starting to be used for genome assembly as well. Now, the assembly pipeline for long reads looks very similar to short reads with a few notable exceptions. Here, instead of doing camera-based correction, we typically use overlap-based correction for long reads. It tends to be computationally much more expensive. The reason we don't use camera-based error correction is that the error rate's too high to have perfect cameras in your reads. They're usually interrupted by an error. And so you'd need to use a very short camera if you want to use camera-based methods. And finally, the last step is typically polishing the consensus sequence. So when you take your long error-prone reads through the assembly pipeline, the final contigs that you get will often have quite a high error rate because there's so many errors in the reads. So what we do is we do this step called polishing, where we use fairly sophisticated statistical models of how the sequencer works as a way of accurately correcting the final assembly and generating better consensus sequence. This slide has just a few papers that are noteworthy to read if you're interested in long-read assembly. This paper in Nature Methods, three years ago, is one of Stockholm Pac-Bio first demonstrated you can get completely E. coli genome assembled into a single content using their sequencing data. We're recently at, Philippi's group has done great work on speeding up overlap calculation for long-reads and never-pay-for-on-bio-archive. I think it's now Nature Biotechnology that you can read. And this is a paper from my group just showing you can do complete genome assemblies of E. coli using Oxford Nanoport data as well. Right, so final slide is just some recommendations for long-read assemblers. Luckily, if you do have long reads and I should caution that it is more expensive to obtain than short reads, but you tend to get much higher quality data. And one of the benefits of sequencing with long reads is that the pipelines are a lot more stable and they're a lot easier to run. Unlike Illumina where you have this laundry list of problems that can happen when you sequence a genome, a lot of those problems go away when you have very long reads. So I can make a recommendation of just two pipelines for long reads. So for Pac-Bio, use Canoe to build contigs followed by Quiver to polish the genome. And for Oxford Nanoport data, use Canoe to build contigs followed by Nanopolish to build a consensus sequence. And as I've said, typically the results are far better than when you get short-read assemblies. I think the best human genome assembly with Illumina data had contigs around 30 kb for Pac-Bio, they're around 20 megabases. So you just get much, much longer contigs that aren't really comparable. If you're doing serious genome assembly where you want to produce a reference genome for something that's very important, it's definitely worth it to pay for very long reads just because the quality assembly is so much higher. Right, so as I said, there's no tutorial, but if you feel free to ask me questions and hear now or just email me if anything else comes up. I haven't. That's a very good thing to bring up. So the question is whether I've tried 10x genomics technology. So this is a new way of doing library prep where you can tag very long fragments of DNA, hundreds of kb, such that you know that all of the short reads came from the same 100 kb molecule through this library prep. And this gives you much longer range information than just made pairs or any sort of paired ends. And it's becoming a very popular way of getting more information out of short read sequencing. So while I haven't done it, I know people are getting megabase size assemblies from using the sort of tagged information. So that's a good one to look into as well. I think it's cheaper than packed bio sequencing, but the drawback is you're still using, basing it off short reads. So if you have very locally repetitive sequence, it'll struggle to assemble that, but you will have a lot of long range information. Yeah, usually, so the traditional way of doing assembly is you would just do a sweep over the cameras. You submit 50 jobs to your cluster with different camera sizes and then you'd pick whatever one gives you longest context. That takes a lot of time. If you're assembling a very large genome, you're gonna spend just hours and hours doing that. And exactly why I wrote this program is that you could just narrow it down. So rather than saying, these cameras are always giving you fairly short values. You might not try them. You might just narrow it down to this ring, which is giving you the best assemblies. So while it's not gonna predict exactly the optimal one, it should narrow the choice down to a smaller range. Okay, and could you just re-explain the N25 and Sion can add your context length and then that I can never understand that? Yeah, so N50 is the length of the contig such that half of the genome is in that length or larger. Okay, so you can think of it like a weighted medium of the distribution of contiglates. You got one? Yeah, so we usually shoot for, when we do nanopore sequencing, 8KB for bacterial genome, that works very well. At least in E. coli, the longest repeat is I think 6KB, which is ribosomal RNA. Now, if you could only sequence, say, 3,000 base pair reads, just because you don't have enough DNA, it would probably give you a better assembly than an aluminum assembly, but it wouldn't be as good as sequencing the 8-10KB. Yeah, or at least the dominant repeat. Like for human genomes, ALU repeats, which are 400 bases, they cause a lot of problems within the assembly graph. So you could get pieces that are longer than that. You'd be in quite good shape. If you want a complete genome, which even for human genomes isn't really feasible just because the centromere is so repetitive, even you'd max out, say, megabase-sized content if you only have 3,000 base pair reads. But yeah, it's gonna be a trade-off. One thing that a lot of people do is if you're sequencing something like, say, Jusofla where you can't, it's very small, you can't get a lot of DNA. They might pool multiple individuals to get more DNA, but that's generally a bad idea because then you're sequencing the population, there's a lot more diversity, and it causes just all these branching structures in the graph, so as much as possible, you wanna sequence isolated bits, and then, yeah. Bias, really? Yeah. If you know that you see bias problem, would you pick that fact? Typically, you'd wanna try PCR-free libraries. So that just, so there's always gonna be this bridge amplification PCR that happens on the flow, so you can't get around that. But to make the library, you can remove the PCR step. Of course, this requires a lot of DNA if you're not doing the PCR, but that's the number one thing for getting rid of minimizing GC5s at least. Other than that, that's all you can do. Yeah, so the interesting graph-based assemblies is mainly when we map a read to the reference genome, we're just treating it as a linear sequence with no diversity, if that read came from something that's very different than the reference, it's not gonna map very well. Say we came from some novel sequence and there's a lot of novel sequence that's not present at human reference, that's where it's gonna go unmapped. If it has a very long insertion deletion in it with respect to the reference, it might not map very well. So the idea is if we can build these reference graphs, which incorporate all that known polymorphism, it would map to one of the alternative branches better than just the linear sequence. So rather than losing that information, we'd be incorporating it directly within the reference. That's really the cutting edge of algorithm design right now is graph-based reference genomes. It's not really in a state where people can use it, but it's an incredibly hot area of research. Yeah, I was just curious what the use is about. Yeah, it's mainly just so we don't lose. We wanna use it as much information as we can when placing reads into the reference. Yeah, we're quite good at accounting for just sequencing errors. When it's just single-based edits, if it's larger edits like insertions and deletions, if that information's in the graph, then it becomes, like Yeom said, a much simpler problem. But I guess there's not that. There's not yet, but it's growing, especially in the last year. There's an organization called GA4GH, Global Alliance for Genomics and Health, and they're pushing forward the use of these graph-based algorithms. There's five or six groups who are all actively developing methods around this idea, and everybody expects that to be the next generation of mapping and varying calling tools. To me, that sounds like using consensus data before assembly, is that a computational trick to make assembly easier? Because the same data in the end will be used to generate a consensus, right? Yeah, that's a good question. Something we've worried about a lot is the size of these assembly black graphs. If you have a human genome assembly graph, it can have 10 billion vertices. That's quite a lot. It takes a long time to work with, a lot of memory. So if you correct the errors up front, it just shrinks the size of the graph down. It makes it much more manageable to work with. So it's mainly a data reduction approach. So you do that more conservatively? Yeah, exactly. You typically be very conservative so you don't generate a lot of artifacts. You don't want to make the overriding goal in assemblies, like don't make errors. Because if you make an error in reference genome, it goes into some database and then it's there forever. At least that's how I've always designed my programs. If the genome is fragmented because you're too conservative, you can then add in long reads later, fix it up, get longer contigs. But it's typically a lot harder to fix misassemblies. Yeah. So usually what you'll do, so when you pick your camera parameters, say here, you want to pick something that's high enough such as you're getting around a lot of repeats, but you don't want to pick a camera that's so high that, right, this isn't my computer. You don't want to pick a camera that's so high that you're not covering the genome in these cameras. You need much higher coverage if you're using a long camera size. I don't know if that was clear. But because of the fact that there's this tension between using long cameras to resolve repeats and short cameras to have high coverage, if you pick a value that's too high, you just get coverage breaks. And what you can do is when you find that there's a gap, you can use a shorter camera to try to fill in that gap. If a gap is caused by repeat, you might use a longer camera, but it allows you to just adapt the parameters such that you can fill in that exact sequence. Yeah, that's something my group works on quite a lot, is that an alternative to using reference graphs is to just do everything in de novo assembly and then compare the assembled sequence to the reference genome afterwards, which tends to be easier. The main problem with de novo assembly based variant calling, which so to give you a concrete example, we try to do find somatic mutations, they're only occurring in a tumor, not in the norm in individual's normal genome by comparing the reads in an assembly graph. That works very well for long indels, which are a weakness for mapping based approaches, but unless you have very high coverage, you might not see the very, so the drawback of de novo assembly based variant calling is that you just need much higher coverage and you have slightly lower power in repetitive regions because of these problems that we've all talked about of having these branching structures. Yeah, so this program, which will do a simulated assembly to let you pick camers, that's doing a very similar thing to camer genie. So camer genie's trying to predict the best K. There's no more questions. As I said, my email's on the last slide here. Feel free to email me questions if you have any problems with your assemblies. Happy to answer those.