 So, this lecture is called the Fundamentals of Genome Assembly. So this is really to give a bit of a theory about how genome assemblers work, sort of under the hood or treating them as if they're not black boxes that just take sequencing data and spit genomes out. And the idea here is, like I said yesterday, I want to give you an idea of how genome assemblers work such that if something goes wrong, you can figure out ways to improve your genome assembly. So the first half is really going to be both theory, how the assemblers work, type of data structures and type of algorithms we use, and then later it will be about more practical things of what features of genomes make assembly difficult, so notoriously one of the harder challenges in bioinformatics, and we'll talk about why that is. So in the last day and a half, we've really got a picture of how DeNovo assembly works or what the idea is. We take a genome, we sequence it by fragmenting the genome into many, many different pieces. We sequence those individual pieces and then the genome assembly problem is really just taking those pieces and then trying to reverse this project and reconstruct what that original sequence was. Now, unlike when we have a reference genome, we have to do this in the absence of all other information. All we have is to read sequences themselves and we need to figure out how they go back together. Now, in this cartoon of genome assembly, I've drawn some regions of this genome as these red boxes, and these are just standing in for genomic repeats, which are really what gives us the challenge in genome assembly. So we have these three red boxes located three places in our genome, and when we chop up our genome into these sequencing reads, we don't know where these red ones came from. They could have come from any one of those three copies of the repeat, so trying to figure out the order of this genome is confused by the fact that there's these three identical sequences that go together. If genomes are completely unique strings, if they didn't contain any repeats, even if we had a hundred base pair reads, we could completely resolve the genome and just assemble it to end to end and be really simple, be really efficient, but it's these repetitive regions that cause a lot of the problems and why it's such an interesting problem to study. Okay, so first part, I'm going to talk about assembly graphs and data structures, that's really nuts and bolts of how genome assemblers work. Then I'm going to go through an example assembly pipeline, just like Matthew showed on SNP calling pipelines, they typically occur as a stage of independent steps. I'll talk you through each one of those stages. Then as I said, I'll mention what makes assembly difficult, what are the genomic features that complicate my life, and then a little bit about how we're doing now genome assembly using long reads, which helps assemble some of these very difficult genomes. So the key data structure that we use in genome assembly is called an assembly graph. Now when we sequence the genome, we have this huge collection of reads, up to say a billion reads for an aluminum sequencing run, and we don't know how they come back together. We need to infer that. The way that we do this is that we compare reads to each other, look for reads that share some common sequence, and if they share a common sequence, we add them to this assembly graph where each sequence is a node in the graph, and they're connected by this edge if they share some amount of common sequence. Then what the assembler is going to try to do is they're going to try to find a walk through the graph that is consistent with all these relationships that we inferred between these pairs of reads, and then that's our final assembled sequence. I'm going to come back to this graph a little bit later, and it's colored in a special way, but I'm not going to reveal that just yet. All right, so when you work with genome assemblers, you'll hear two different types of graphs that we talk about. So the dominant way of doing genome assembly for the initial part of bioinformatics or say Sanger sequencing is we use a data structure called an overlap or a string graph. This would compare the ends of two reads to see if they share common sequence at the beginning of end, and the classic genome assembly algorithms use this idea of looking for overlaps between pairs of reads, putting them into these graphs, then trying to find these paths and reconstruct the genome sequence. Now these algorithms didn't scale very well to when we got shortread sequencing, so the field switched to a data structure called the Dubroun graph, where we break the reads into short sub-sequences, which we call KAMERS, and then we constructed a graph of all the KAMER sequences and their relationships to each other and do the assembly that way. Now why Dubroun graph was so popular for shortread sequencing is that it's extremely efficient to construct these type of graphs. You can just put your KAMER sequences into a hash table and then do queries over this hash table to construct the graph. That's unlike the string or overlap-based assembly where you had to take all pairs of reads, compare them to see if they had a suffix prefix overlap, and then put them into the graph. Dubroun graphs can be constructed in linear time in the size of your data set, whereas overlap graphs can be constructed in time quadratic in the size of your data set. That sort of scaling property makes it infeasible to do a lot of overlap-based methods on the limited data set. Here's an example of what an assembly graph looks like. We have a single sequencing read here with a sequence, AC, GAT, and so on. When we're constructing our assembly graph, we're going to add these reads one at a time into the graph as vertices. We have one read and we've added a vertex into our graph, which we're going to call v1. We then have the second read which came in here, which is called read2. It's going to be represented in the graph as this vertex we've called v2. Now we notice that there's an overlap between read1 and read2. The suffix or the end of read1, starting from ATG, CA, and onwards to the end, matches exactly the prefix or the beginning of read2. Because of that overlap, we've connected these two vertices in our graph with an edge. And we've labeled that edge with the sequence TAC, which is the overhanging part of read2. So it's the part that read1 doesn't match on read2. Now if we sequence a third read, we put it into the graph as a vertex v3. And we see that it has an overlap of both read1 and read2. The prefix of read3 matches the suffix of both of those reads. So we put edges into the graph between v2 and v3, v1 and v3. And we've labeled them with those overhanging sequences as well. Now we sequence another read and read4, put it into the graph as v4. It overlaps everything that came before it. So we added a bunch more edges into the graph here. Now what the assembler wants to do is it's going to try to find a path through this graph, this graph here, which reconstructs the original sequence, which is the sequence of the genome. And here it's a simple example. There are only four reads in our data collection. And there's one path that goes from v1, v2, v3, and v4, which spells the sequence of our original genome. And the way that you spell it is you take that sequence of the start of the path, which was read1 sequence. And then you just concatenate all of these edge labels that we added into the graph. And that's the sequence of our genome in this very simple example. Now for real data sets, these graphs are much bigger. Yeah, two questions in the back. Who wants to go first? So this is the most general form of an assembly graph. If you made it to brown graph, it would always be a single base on these edges in this general string graph example. It's always the sequence that isn't matched by the overlap. So between v3 and v4, the unmatched sequence is a, a, t, g, a, a, c. And that's what you label the match. So it's everything that isn't part of the common sequence. Does that make sense? Yeah. Did you have one as well? I did, but no. Yes. Exactly. You just take the first sequence and then you concatenate in t, a, c, c, g, t, a, a, t, g, a, a, c. And that's the sequence of your genome. Yep. Are the edges directed? Yes, they are. I didn't show in this example that these are implicitly directed where you can put it, you direct the edge from the read with the suffix to the read with the prefix. And you can also direct them in the other direction. So we, we would, usually when we implement these graphs, we'd have an edge going from v4 to v3 with the overhanging part of the start of read 3, s of the edge label. The reason we, we do that is the graph is actually bi-directed because a sequencing read can come from either strand. And we have to account for reverse compliments in these graphs. It gets really complicated. And like I'd have to draw four different edges coming out of every node. So for example, I just do this, but in, in, in practice we, we do bi-directed graphs. Okay. So here's going out into the more practical part. Here's what an assembly pipeline looks like, just like for doing alignment based analysis, we would do some read trimming or read filtering up front. We then do error correction of the data. We then build our assembly graphs. We then clean artifacts from the graph. We build contigs from it by finding these walks through the graph. And then we build higher order structure called scaffolds and try to fill in the gaps between the scaffolds. And we're going to go through each one of those steps in detail now. So the first step, and this is an optional step is read trimming and filtering. So the most important thing, like Matthew was saying for doing alignment based methods is trying to get rid of adapter sequences when you're, when you're doing your assembly. The reason for that is if your adapters are left into the reads, they're going to look like an extremely high copy repeat. And that's going to confuse the assembly graph and you're going to have this huge sink in your graph where all these edges are going into the sequence that has the repeated, the repeated encamer, repeated substring. Usually, you don't get a lot of adapter contamination and you don't have to do this. When I've done assemblies, I've usually not trimmed adapters. But if your assembly drastically fails and you think that maybe you've got some adapter contamination in there, you'd want to run one of these programs like Trimimatic or one that I quite like from the EBI is called Kraken, which will automatically determine if there's adapter contamination and get rid of them. So this is really a step that you'll perform if you either suspect adapter contamination or if your assembly's gone wrong in some way. Now error correction is the next step. This is quite an important one. Now something that makes assembly computational difficult is these assembly graphs can become extremely large. So if you build an overlap type graph, you're going to have a vertex in your graph for every one of your sequencing reads. And an Illumina sequencing run usually contains about a billion sequencing reads. So your graph is going to have a billion vertices. Every read is going to be connected to something like 30 other reads. So you might have a graph with 30 billion vertices and 30 billion, sorry, 3 billion, 1 to 3 billion vertices and say 30 billion edges. That's extremely large. You're going to run out of memory. You're not going to be able to run that unless you use a really incredibly expensive computer. So what we want to do is reduce the size of the graph by removing as many sequencing errors from your dataset as possible. And we use approaches called Kamer based error correction. And I'm going to describe how those work now. So this is the standard error profile for Illumina sequencing data. These are six different genomes that were sequenced to be done over assembled in a project called the Assemblathon. And this is the error profile for those six different sequencing runs. Most of the reads are between 100 to 150 bases in length. And you see at the beginning of all of these reads, the error rate was quite low, less than say one in a thousand. But as you get up towards the three prime end of the read, the error rate goes up to over 1% typically. This is for these reasons we talked about last lecture where you have these phasing and pre-phasing problems where the signal becomes mixed to accumulate errors at the three prime end of the sequencing read. Now when we sequence a genome for de novo assembly, we typically over sequence it. We sample it redundantly such that you have a lot of coverage of the genome. We typically aim for between 30 to 50x coverage when we sequence for de novo assembly. And that means that we should see a lot of the genome, a lot of the subsequence of the genome in many different sequencing reads. And we can use this redundancy, this oversampling of the genome, to try to correct errors. And the way that we do this is we count the number of times each camber within the read, within any read is present across your entire data set. So here we're using short camers of around 25. And we're showing the counts of these camers across the entire data set. And this read has a single sequencing error in it where this red C is. When we see that the number of times that the camers that don't contain the error, they should be roughly what your sequencing coverage is. You've seen that 30, 40 times if you've sampled the genome redundantly. But when you get to a camer that contains a sequencing error, that sequencing error changes what is a genomic camer, which you should see 30 or 40 times to this new rare camer that you're only going to see once. Because the sequencing errors are typically random. So when you flip any base in a sequence of the genome, you're probably not going to see that again. So we just go from left to right, counting the number of times all these camers are seen. We'll see a drop in the camer coverage when we hit these sequencing errors. So here's what the camer count profile looks like for a real read. So if we take the first camer, camer index is zero. We count how many times it's seen across our entire data set, seen about 27 times. We then go to the second camer, seen 28 times. We do that for every camer in a sliding window across our reading. We hit a point where the camer coverage drops down to around one. And we can infer that there's a sequencing error at that position where the camer count dropped. When we do that, we would search for alternative base calls at that position. So say there was a C there, we check whether A fixes the camers, whether G fixes the camer, whether T does. And if one of them does by correcting the camer coverage up to what we expect in this 25 to 30 range, we would just change that base to what we think the fix is. So there's a lot of software that will do camer-based error correction. One of the first ones based on this principle was called Quake. There's one in an assembler I wrote, which is called SJ, Soap De Novo, BFC, Bless, Lighter, Musket, are all published fairly well-performing camer-based counters that can be used for this purpose. Why don't you go back up there? So when we do this part, so the red line here is the camer profile after we've performed the correction. So after we figured out there was supposed to be a G there. All right, yeah. So if the position of your error is less than, sorry, more than a camer away from the end of the read, you slide the window past the error and you go back into sequence that's truly in the genome. So then it jumps back up. So we can actually infer the position of the error here. Let's say it's a 25 mer. That means that the error is probably at the 74th base of the read. Usually because we expect the errors to be in the very end of it, it would just go over here, drop, and then we wouldn't see a recover. But in this case, the error was a little bit further from the end so we could figure out where it was exactly. All right, there's also a class of error correctors called overlap-based correctors. Where just like when constructing an overlap graph, we would take all pairs of reads, find subjects, prefix, matches between them, and then try to detect whether there are differences in them and then correct by using multiple sequence alignments. Those are typically too slow to run on very large genomes if we're trying to assemble human genomes. So most people in the field just use these very efficient camer-based methods to do error correction. I'd say of this list of them, BFC is probably the best one. This was written by Hang Lee who wrote BWA mem and this BWA family of short-read aligners. It's particularly memory efficient and runs quite quickly so it can be used on very large data sets. BFC, yep. Okay, next up in the pipeline is graph construction. We've talked a little bit about this already. We talked about how string or overlap graphs look. Here's what a de Brown graph looks like. So de Brown graph is defined over the spectrum of camers in a sequence collection. And in this case, we're using formers. So four base pair subsequences across this set of five sequencing reads, which are six bases in one. So to construct a de Brown graph, we take all formers in our sequencing reads. So the first former of the first read is CCGT and that becomes a vertex in our graph here. We then slide our window one base over, take CGTT that becomes a node in our graph and so on. So we do that taking all those four base pair subsequences, putting them as nodes in the graph and then we connect them by edges if two camers appear in adjacent positions in our read. So here CCGT appears before CGTT in the first read. So we connect them by an edge. Now in this example, we have a rekey because this former CGTT appears in two different locations and it's followed by different sequences each time. So in one of the reads, it's followed by GTTA and in another read, it's followed by GTTC. And that causes our graph to branch due to the fact that this camer has an uncertain extension. So in this case, the graph is still quite simple and there's a single path that goes through this graph and visits each one of our sequences so it can be assembled uniquely. And that path is going through here down this lower branch, back around to this one and then going back through here. So we still have a unique reconstruction of our genome even in the due to despite the fact that we have this repeated vertex in our graph. Now another word about memory usage. As I said, genome assembly is notorious memory hungry procedure. The early genome assemblers for Illumina data would require almost a terabyte of memory to do an assembly of a human genome. Something I was quite heavily involved in during my PhD and a little bit afterwards is coming up with more succinct representations of genome graphs and using better data structures and things like probabilistic data structures or compressed data structures, we can now get a human genome assembly to fit into around 10 gigabytes of memory rather than a terabyte of memory. So it's been about two orders of magnitude decrease the amount of memory it takes to do human genome assembly and doing things like when Matthew was talking about or we can use assembly for structure variation detection is now practical because of this improvement in genome assembly efficiency. Okay, so once we've constructed our assembly graph, we need to clean it of artifacts, things like sequencing errors or header zygosity that are gonna affect our ability to find paths through the graph that reconstruct our genome sequence. And we call this the graph cleaning step. So when we perform error correction, we get rid of most errors in our dataset but we can't get rid of all of them. This camera-based error correction doesn't work in all cases so we can have artifacts in the graph which are just branches that don't go anywhere. They just terminate after a short distance and we call these tips. So let's say we have a brown graph constructed of three reads where two of the reads read two and read three has sequencing errors near the end of them at these red bases. Because of the fact that sequencing errors create unique new camers that we don't see anywhere else, these called branches in our graph that just don't go anywhere. They split off for two camers and then stop. And in this one, which is the sequencing error for read three, it splits off for camers and then ends here. So the fact that it has this disconnected structure where there's no extension in this way or this way is why we call it a tip to the graph. We also get artifacts in the graph which we call bubbles and a bubble is a structure where it splits and then rejoins a short distance away. So the classic example of bubbles are heterozygous snips if you're assembling a diploid genome. So for all of the camers that are covering the heterozygous position, you're gonna have a path that goes through one allele and a path that goes through the other allele. So all of this sequence to the left is common between the two appetites. And then this sequence follows the C allele in this example and this sequence follows the G allele. And then after when you rejoin the common sequence, the bubble collapses. Now usually an assembler will look at these structures, find these bubbles and then remove one half of it to give a linear structure in the graph that's easier to traverse. But a lot of people have been building de novo assembly-based variant collars or reference-free variant collars. They will search for these structures in graph as a way of finding snips or indels without having to align your reads to the reference genome. So this is a very important source of information if you're trying to work with, say, an organism that has a lower quality reference genome, you can look directly in the assembly graph to try to find a sequence variation. So here's a slightly more realistic example of what an assembly graph looks like than what we saw before. After all of these steps of adapter trimming, error correction, and building the graph, you might have an assembly graph that looks like this. Now this is quite complicated and we're not gonna be able to find a single path here that reconstructs our genome, yeah. Sure. So usually it depends on what size of this camer or the subsequence you use. So for a single header, it's all just snip. There will be exactly K camers that cover that position because you can slide when, the first camer's gonna have the difference in the last base, then you slide it over and it'll be different to the second last base and you can slide it over K positions until it's in the first base and then the next one, it won't be covering the variant. So we know exactly if we see a bubble in the assembly graph that contains exactly K nodes, we know that that's probably a snip. Now for indels, it might be a little bit shifted because of the difference in the allele size, but usually it's around K as well, yeah. That's right, yeah. You can get these bubbles if the sequencing error's right in the middle of the read. Usually what you'll do is you'll annotate the bubble with how many times that camer occurred and if it's roughly balanced and we're gonna come into this in a little bit, if it's roughly balanced, it will be headers like a snip. You classify the headers like a snip. If one half of the bubble was only seen in a single read, you'd say it's probably just a sequencing error. Okay, so once we've loaded this assembly graph, we want to clean it up, apply this tip cleaning and bubble popping algorithm to give us a structure that's more manageable. So the way the assembler does this is it's first going to look for all of the end points of the graph, so all of the vertices that are only connected in a single direction, one direction. So I've highlighted these in red and then you recursively walk backwards from these tips that you found until they rejoin the graph. So for this one, we found two tips here. We walk backwards, mark this node as being a tip extension and then we come here, we see that it has an alternative extension to another base, it would no longer be a tip and we'd get rid of these three. Now bubble popping happens a similar way. We find branch points in the graph like here and then we traverse both halves of the path until we come back together. If we see that they successfully rejoin after around K vertices, which is what we expect, we then collapse them down to a single path, usually recording the alternative allele in a file that can be used later on to try to analyze where the variants in this genome were. So once we've cleaned up the graph, now this is a much more manageable structure than what I showed four slides ago and what we want to do is perform Contig assembly. Now the problem of trying to find a single path through the graph that reconstructs our entire genome is too hard, you computationally can't do that. So what we do is we just try to find the unambiguous stretches of the graph where we know that we can compact all these sequences into a stretch of the genome that's not gonna contain a misassembly. So we can pack these four nodes together until we reach this branch point, we then compact all of these together, these together and so on, compacting all of these vertices that don't contain a branch into what we call Contigs. Now how long are Contigs? If you assemble a simple bacterial genome like what we're gonna do in the practical session after the break, your Contigs from short read sequencing will be around 100 KB in length. If you assemble a very large genome like a human genome, they'll typically be between 10 and 20,000 bases in length. This is for short reads. If you use long reads like pack bio reads or awkward nanopore reads, you can typically finish or close bacterial genomes. Well, you'll assemble it into a single piece and for human genomes, the best long read assemblies give you Contigs in a range of maybe four to five megabases. So much more continuous due to the fact that you can cover a lot of repeats using long reads that you can with short reads. This one? Yeah, so here, the reason that those now are not also considered bubbles is just because the size goes beyond the feet. Yeah, usually what you do and another thing when you try to collapse the bubbles, but in the back, is you'll find a sequence, you'll assemble the sequence of each bubble and then compare them to each other. And if it's just one base of difference, you'll pop it and say this is probably variation. If it's much larger differences, like you might have a few indels, you'd say maybe this is two copies of a repeat, we're not gonna collapse that, you'd keep them separately. It's easy to do these kind of heuristic checks just to make sure you're not overlapping repeats. Okay, so there's a few reasons that the contigs stop growing. Running into these unresolvable repeats is one of them. If your coverage is too low, such that you just didn't sequence part of the genome to enough depth, you can stop growing a contig. And to get over these problems, we try to build these higher order structures what are called scaffolds. And a scaffold is just a set of contigs that we think are in some order along the genome, but we don't know the sequence that's in between them. And the way that we build scaffolds is that we use these read pairs that we heard about in structure variation to link them together. Well, we don't know the sequence in between them. So what we'll do is we'll take all of our read pairs in our data set, align them to our contigs and look for contigs where there's a consistent pattern where one half of the read pair aligns to the end of one contig and the other half of the read pair aligns to the end of this other contig. So here I've colored the read pairs by which contigs they link up. So the read pairs for this contig go to the beginning of this contig and then this one goes to the beginning of this contig and so on. And then we can build another type of graph using these read pair relationships, which we call a scaffold graph, which is just a way of visualizing how these scaffolds might be arranged. So we've put an edge from the first contig to the second by using these blue read pairs from the second to the third using the red and so on. And since we know the insert size distribution of the read pairs, we know it's roughly a Gaussian distribution with say 400 base pair mean, we can use that information to estimate the distance between these two contigs along the genome. And when we do that, we build our scaffolds and we fill in the gaps between the scaffolds with just N nucleotides, which stands for any of the four nucleotides. And we put the amount of Ns in between them based on the size of that gap that we estimate. We can then try to use local assembly, which Matthew mentioned, which is just you pull in all the read, they're just in that neighborhood and you try to assemble locally within that neighborhood to fill in the gaps. There's a few programs out there to do that, one's called SJ Gap Filler, another's called Gap Closer from a program called Soap De Novo, or you can try to fill in the gaps using other sequencing technology. So if you have PAC Biodata, which gives you much longer range information and you have scaffolds from Illumina data, you can try to fill in those gaps with the PAC Biodata. Okay, now I'm gonna talk about some of the factors that make assembly difficult. So I was involved in an assembly benchmarking project called the Assemblifon II, which was a project where the organizers sequenced three different genomes, very large genomes. There was a snake, a boa constrictor, a cichlid, which was one of these very variable fish from the African Great Lakes, and there was a parakeet, all that hadn't been sequenced before. So we didn't know anything about their genome. They then released this data into the community, people who developed assembly software. We ran our assembly programs on them, we submitted the assemblies to the organizers, and then they used orthogonal data like long reads or optical maps to score how well the different assemblies performed as a way of doing unbiased benchmarking what the state of the art of genome assembly was. Now, here's a nice quote from the main organizer of the Assemblifon, which said based on the findings of Assemblifon II, we make a few broad suggestions to someone looking to perform a de novo assembly of a large eukaryotic genome. First one is don't trust the results of a single assembly. If possible, generate several assemblies with different assemblers and or different assembler parameters. Some of the best assemblies entered for Assemblifon II were the evaluation assemblies rather than competition entries. Now, the takeaway from this is that there's a lot of variability in assembly. Different users of the same software can get different results just based on the parameters that they chose for the assembly, like the k-mer size, how they performed error correction, things like that. And they also varied across assembly species as well. So the fish genome was very repetitive and very heterozygous and it was very difficult to assemble very few of the entries performed well on that one. And entries that might have performed well on the fish didn't perform well on the snake genome. So essentially there was a lot of variability within here. So the recommendation was to try to run multiple different programs on your data and then select the best one for your biological question. So I got thinking after this project of what are the features of genomes that make a de novo assembly difficult to run? And here's a non-exhaustive list of what I came up with. So the main one that people think about is the repeat content of the genome. So if you have a lot of transposons, you a lot of segmental duplication, that's gonna cause these branches in the graph that are difficult to resolve. If your genome is highly heterozygous, so there's a very high number of snips within the genome, that causes a lot of these bubble structures in the graph, it can be difficult to resolve and lower your content lengths. If you don't get enough sequencing coverage, we aim for 30 to 50x. Sometimes you can only afford 20x coverage that lowers the quality of your assembly. If your sequencing is biased towards either AT rich or GC rich sequences, that can cause breaks in the assembly by just not covering those regions. A classic example of a genome that's fairly small that's difficult to assemble is the malaria parasite, Plasmodium, and it's about 80% AT, and it's very difficult to sequence that genome, and if you try to assemble that genome, it's almost certainly gonna fail. High error rate in your reads, of course you want your reads as correct as possible. If you have chimeric reads in your data set where a read contains sequence from different parts of the genome, that causes problems. If you have sequencing adapters, if you have sample contamination, or if you have a genome from an organism where you can't get enough DNA from a single individual, so you extract DNA from multiple individuals, you're then sequencing a population. This causes a lot of problems. The very common thing to try to do if you're sequencing very small organisms, but it is definitely a factor that impacts quality of your assembly. So I'm gonna come back to this assembly graph that I showed at the beginning. We've talked about different features of the assembly graph. We've shown you what sequencing errors look like in the graph, what heterozygous snips and indels look like in the graph, and here's a depiction of what repeats look like in the graph. So we want to measure how often these structures occur and try to predict whether an assembly is going to be easy or not, just based on how often we see these structures in the graph. So my group has worked on a probabilistic model for traversing the assembly graph, finding these structures, and then saying what the heterozygosm of that genome is or what the repeat rate in that genome is as a way of predicting whether a given assembly is going to be hard or not. So our probabilistic model, which I'll only very briefly cover, is based on this idea of measuring camber coverage. So the number of times an individual camber is seen in your data set and using that to make predictions about branch rates. So the most basic result we can have is looking at the histogram of camber counts in our data set. So this is a human genome data set that was sequenced to quite high coverage, about 40 to 50 X, and it's a histogram of how many times 51 mers, so camers of size 51 occur within that data set. And this is a very nice distribution. It's essentially Gaussian. We see this feature at the left here where camers with count one are seen about 6% of the data set. Those are all those camers that contain sequencing errors that we tried to correct. So we've seen very few number of times. All of these camers in the bulk here are just genomic sequence that you've seen 30 to 50 times in your data set. Now there's a classic genome that I use which is extremely hard to assemble as a cautionary tale for what your sample you don't want to look like. Here's what the camber count distribution looks like for the oyster genome. And we quite clearly see it as bilateral. We've got a peak here and a peak here, and the secondary peak has depth, roughly half of the main peak. And the oyster genome is interesting in that the heterozygosity rate is around one in a hundred bases. It's human genomes about one in a thousand bases, give or take. But the oyster genome is about an order of magnitude more heterozygous than that. So all of the camers that are coming from this part of the distribution are seen on just one of the two haplotypes. That's why it has half of the depth of the main peak and then the other camers are seen on both haplotypes. This level of heterozygosity is much too high to try to assemble using alumina reads. So when they try to assemble the oyster genome, it essentially failed because of this extremely high heterozygosity. So what we're going to try to do is measure these branch rates to assess whether an assembly is gonna be difficult or not. Here's the statistical model. You're essentially just trying to derive what, how many times you'll see each one of the two halves of a branch based on whether it's a sequencing error, whether it's a heterozygous snip or whether it's a repeat. Here's all the math. I'm not gonna go through it, but essentially we can infer how often a graph branch is due to heterozygosity or due to repeats using this model. So here's a histogram. We're gonna be looking at these plots in the practical session of the frequency of what we think are heterozygous branches in these six different genome data sets. So at the top here in yellow is this oyster genome and the frequency of various branches is about one in a hundred, which is what we expect given that we know the heterozygosity is about one in a hundred bases for the oyster genome. The other end, we have human genome data set here where a branch is about one in a thousand bases. We know that based on what we expect, how heterozygous we expect a human genome to be. The other three data sets for the assemblophone, the bird, the snake, and the cichlid are all in between those two extremes. Just by looking at this, we can say, okay, maybe the bird genome's gonna be a little bit harder to assemble than the other three because it has much higher heterozygosity based on this branch rate analysis. This one down here is a haploid yeast sample, so Cerviciae, and predicted branch rate is very low and we expect there to be very little branches that are called as heterozygous snips because this is from a haploid yeast. So this is essentially just included here as a control sample to predict what our error rate of this branch assessment is. We can also do this for repeats and we see a similar picture here. The oyster genome is the most repetitive, along with the human genome. This matches our expectation. Human genome's fairly difficult to assemble as is the oyster genome with the three assemblophone genomes coming at the bottom. And then again, the yeast genome, which is fairly small, it's only 12 megabases, having an infrequent repeat branch rate as well. This program will also predict the genome size for you. You can take that Kamer count histogram, do a pretty simple calculation of what the genome size is, and it also calculates QC, like metrics, like what we saw with fast QC, like the variation in quality score across the length of the reads as a way of measuring the quality of your basis. It will also output this error rate plot that we saw earlier when we discussed error correction, showing the error rate as a function of the position along the read. Again, this gives you an indication of some samples that might be more difficult to assemble. This one here is looking a bit higher in the error rate in green. We see that that's that fish genome for the assemblophone, and that did prove to be a pretty difficult sample to assemble. Yes, there is run-to-run variability in sequencing error rates, so this might have been just a difficult, just a lower quality sequencing run. You would expect with caveats that the sequencing error rate should be roughly the same across different species, so I would say this is more due to run-to-run variation. Again, if you have genomes that are really high AT or GC content, that can cause difficulties, but here it's probably just run-to-run, yeah. Yeah, so this is the error rate in sequencing reads. So this is a benefit, so the oyster genome gets a check mark for this QC. It had a very clean data set, but the genome itself is very difficult because it's heterozygous or repetitive. So it's really the sum of all of these different metrics that go into whether a genome's gonna be difficult to assemble or not. We also calculate whether the coverage of the genome is GC biased. So this is a 2D histogram of the camer coverage, like we're looking at the 1D histogram as a function of the GC content of the read. And we see that this one is pretty unbiased. There might be a little bit lower coverage as GC increases, but it's pretty good. We'll show an example of what on this not-so-good looks like. Here it's much more biased. This is a yeast data set, which is easy to assemble, but as the GC content increases, we have lower representation of those bases. Here's what the oyster looks like. So Histogram has these two blobs here. Anybody wanna say to guess why this is? Exactly. So this is the header of zygosity again. So this density is due to camers that occur on one haplotype. This is camers that occur on both haplitites. Bad, bad, bad. We don't wanna see that. It will also estimate the insert of the fragment size histogram from your data by finding paths through the graph that link up read pairs. This graph has a few different things going on. Let's say I'm looking at the snake data. The insert size distribution here around 350 bases in blue here is our yeast data set. Purple is the bird data set. This gives you a picture of whether your insert size matches up with your expectation based on how your sequencing library was prepared. And finally, what the program will do will actually perform a simulated assembly. So it won't assemble the whole genome. It will just run a simulation of the genome, which can occur much faster to show you what the length of your contigs will be as a function of this camer parameter that you use to do the Brown graph based assembly. So in the simulations, let's look at the yeast data set to start across a whole range of camers. The yeast data set has the longest contact length. This is exactly what we expect. This is not header zygous. This is very low repeat content. So we expect to get very good contact lengths from that genome. Conversely, let's look at the oyster data set, which I've gone on and on about how difficult it is to assemble. And here, no matter what camer we pick, we get very short contact, which are only around 5,000 base pairs at maximum. So again, the simulation should just give you some insight into whether this genome is going to be easy to assemble or not. Some interesting things here. The yeast data set, the contact lengths stop going up at around camer 50, and then they decrease. This is due to just not having enough coverage to use really long camers. So if you increase your camer length, you need high coverage to make sure that you have sampled all of the distinct camers within your genome to make the graph well connected. There are some samples, like the snake here. There were sequenced extremely deeply. And we can just keep increasing the camer size, and that increases the length of our content. But only if you sequence to, say, 100x can you use these extremely long camers. All right, so this program is relatively easy to run. You just run three commands. We're not going to run this in the practical, because it takes around an hour to run on the example data sets. And then we wouldn't have enough time to run the rest of the steps. But I'm providing the PDF report as a link from the tutorial web page so you can look at this report for the data sets we'll be assembling and explore these different metrics and answer a few questions that I sent there. And it's all on GitHub. The code, there's a link to the pre-print, but it was published about three years ago. So I really should update this slide. All right, finally, I'm going to talk about long read assembly. So I mentioned this yesterday when talking about sequencing technologies that the field is starting to move from doing short read assemblies that are fragmented and difficult to resolve these repetitive regions to using longer reads. And the two dominant long read platforms are the PacBio sequencer and the Oxford Nanopore sequencer. So both of them give you much longer read lengths, over 10,000 bases compared to 100 bases for luminous sequencers. But the error rate's much higher. It's between 15% to 20% versus about 1% for luminous sequencing. So the pipeline to assemble long read sequencing looks quite similar. You do some pre-processing, read trimming, or filtering, and perform error correction, construct an assembly graph, clean it up of artifacts, assemble contigs, and then you do this new step, which we call assembly polishing. I'll talk about that. So the difference here is that the error correction is usually not based on camers, but it's based on finding suffix prefix overlaps using dynamic programming between the pairs of reads. Because we're using overlaps and because there's a very high error rate that we have to account for, this tends to be computationally very expensive. So long read assembly typically takes longer than short read assembly due to this expensive error correction step. Another difference is that the graph construction doesn't use de Brown graphs. The de Brown graph paradigm of assembly requires using these short camers, and the camers have to be exactly correct. They can't contain errors. Now I showed you when we're doing this simulation of assembly, a lot of the assemblers were performing well at around K51-61. It's very unlikely that if you sequence a packed by our Nanopore read with a 15% error rate, that you're going to have these 51-mers that don't contain sequencing errors. So we can't use this camer-based method of doing error correction. We have to use overlaps. The final stage of the assembly pipeline for long reads is doing consensus polishing. This is where we use a fairly sophisticated probabilistic model of how the sequencer operates to infer a better consensus sequence, which means lower errors in our final genome sequence using the raw measurements from our sequencing platform. So for an example, yesterday I talked about how we use hidden Markov models to analyze Oxford Nanopore sequencing data. Those hidden Markov models are used to calculate the final genome sequence when doing an Oxford Nanopore assembly. The downside of this is that just like for overlap detection or error correction, they're incredibly expensive to run. Running a hidden Markov model across a whole human genome is probably going to take a few hundred thousand CPU hours to run. So you have to pay for this extremely expensive consensus process when you're doing genome assembly from long reads. In the practical, we're going to talk about a way where you can use short read data to improve the consensus sequences of the long reads. And you'll get a chance to do that in the exercise. So here's just a few examples of very successful long read program projects where you're either able to sequence bacterial genomes and assemble them into a single piece or where you are able to do assembly of very large contigs for human genomes. So I suggest having a look at these papers if you're interested in seeing just how well long read assemblers can perform. So software for long read assembly, there's really one program that's kind of dominant in this field. It's written by a group at the NHGRI, run by Adam Philippi. And it's called Canoe. And Canoe is the standard way of assembling of PacBio and Oxford Nanopore data with a difference between the pipeline of these two different data types of how you implement the statistical model to do the assembly polishing at the end. So for PacBio data, we use a program that is developed by PacBio, which is called Quiver. And for Oxford Nanopore data, we use a program called Nanopolish, which is from my group. We're not going to run those two programs at the end just because they're too computationally expensive. So in the practical, you'll be doing polishing using a limited data. And as I've mentioned, results are much more contiguous in short read assemblies. You can expect complete bacterial genomes or mega-based size con tigs if you're assembling human genomes. So that's the end of the assembly lecture. I said my email's here. If you have any questions about assembly, either ask in the next few hours or feel free to email me.