 So I'm going to give a lecture on how genome assembly works, go over some of the theory behind it, how we assemble genomes from short and long reads, then we'll have a coffee break, come back and you'll work through a lab where you'll actually get to assemble a small genome using some Pac-Bio and Illumina data. So there's a lot of different ways of presenting a high-level picture of assembly, here's how I like to think of it. We have a genome, we chop up many copies of the genome into fragments, we then sequence those fragments and the process of genome assembly is just essentially reversing the sequencing process and stitching together the original genome from those reads that we've sampled. People have used different analogies of putting together a puzzle without being able to know what the puzzle actually looks like or cutting up a bunch of newspapers, one that all uses, let's say I grab all the binders off your desks with the coarse materials, I take them to the paper shredder, shred it up into pieces, dump it on the floor and say reconstruct the pages that's in this binder. You'd be quite mad at me, but you could get a better understanding of what an assembly is. Now the key things here are that if we, if I shred that into really tiny pieces, you'd have a lot of leftover white pieces that give you no information of how they go together. Those are the repeats in the genome where we just don't have enough information from say short reads to assemble them back together. Conversely if I chop it up into very large fragments, just each page into three or four pieces, you'd have a lot of information to stick those pages back together and it would be relatively easy to assemble. So the talks in our three parts, I'll give you an overview of how assemblers work, a little bit of theory behind it. Then I'll talk about assembly algorithms for short and long reads. So that is two distinct parts because the methods that we use for short read assembly like Illumina reads are completely different than the methods we use for long reads. They're all based on the same underlying theory which I'll present first, but I'll give you a view of what the assembly pipelines for the different problems look like. And then finally, the third part which is going to lead into the practical session. I'm going to talk about what type of properties or features of a genome make short read assembly difficult and how we can measure that beforehand using some software that my group develops. Okay, so then here's a view of how assembly works. We have an input which is our genome sequence which is shown up in here in red. It's one point you're working, yes, kind of, up in here in red. We take many copies of that genome. We then fragment it into pieces and then read each one of those individual pieces using the DNA sequencers that I introduced to you yesterday. Now we call this process whole genome shotgun sequencing. The shotgun meaning that we randomly fragment the genome. We don't know where the pieces came from in the genome. We just blow up the genome, sequence each one of these fragments, then we need to put it back together. And this is a term that goes back since the human genome project. They talk about shotgun methods of genome assembly. Now if we knew where every one of these reads came from in the genome, if we knew that this read here down at the bottom is the leftmost read or covers the first base through to the 15th base of the genome, we knew that this read starts at the fifth base and so on, it would be really easy to do the reconstruction. We could just merge all of these sequences along the genome to stitch it back together. So we'd start with this read, merge it with this one which adds that CG base, merge it with this one which adds GC, TCT, AGG and so on. Just walking along these ordered fragments until we've reconstructed a whole genome. Now the difficulty comes in that we don't actually know where the reads came from. We don't have this total ordering of the reads. We only have their read sequences and then we need to computationally infer what the order of those reads were on the original genome. And unlike when in the other practical sessions you've done where you've mapped the reads to a reference genome, which give you this ordering or an approximation of this ordering, in de novo genome assembly, we don't know anything about the genome. We only have this set of reads. And we need to look at the reads and try to figure out how to reconstruct the sequence. So this is one of the key terms in genome assembly is the idea of coverage. Coverage has probably come up in the other practicals, but I'll just reiterate what we think of it in assembly. And this is the average number of reads that's covering any position in the genome. So if in our read sequences, which are shown in blue here, we have 177 nucleotides total and our genome length is 35 nucleotides. Our average coverage is 177 divided by 35, which is about 7x, or what we say 7x coverage. What we mean by that is each base of the genome should be covered by around 7 reads. Another way to look at coverage is it can refer to the number of reads covering a particular position in the genome. So the coverage of this T here is six, as six of these blue reads cross this T position. Now, when you're doing a genome assembly, coverage is an incredibly important thing to consider. To assemble the genome, we need a lot of redundancy between our read set. We need reads that overlap each other that we can compare to each other to say that they might have come from the same region of the genome. And to ensure that reads overlap, we need to redundantly over sample the genome. We need to sequence it to 30, 40, 50x coverage to make sure that we have enough overlapping reads that we can walk across this whole sequence. The coverage requirements is a really common question I get. Someone will come to me and say, I want to assemble this genome. How much sequence? How many illuminated lanes do I need to put towards it? Or how many nanopore flow cells? And it depends on the size of the genome, how repetitive it is, what sequencing technology used. But typically we aim for something between 30 to 50x coverage not to get a good assembly. Okay, so the basic principle of assembly is that if we compare any two pairs of reads, reads that might have come from the overlapping positions of the genome, so the same location on the genome should be quite similar to each other in their sequence. We should be able to find reads where the end of the read or the suffix of the read is similar to the prefix or the beginning of another read. So if we look at this example here, we have our genome sequence in red. This read started from this position. This read started from this position. And they share this overlap of about maybe, say, 15 bases here. So they're very similar in their sequence. We can find that suffix of one read that matches a prefix of another. And this gives us some bit of information about what the genome sequence might be. So I mentioned in the intro that we use different assembly strategies for short and long reads. I'm going to talk about long read assembly approaches first. So as I introduced yesterday, long reads are produced by the Pacific Biosciences Instrument and the Autrib Nanopore Sequencer. We can get reads that are in excess of 10,000 bases. That's quite common, but they have a much higher error rate of 5% to 15%. And the key computational challenge and the thing that labs like mine work on are trying to overcome this high error rate in the sequencing data. So on the other hand, short read sequencing, which we've heard about throughout the course from Illumina, is very high accuracy, less than 1% error rate, and very high throughput. So you can get a lot of short, 100 base pair reads. The drawback of Illumina sequencing, which I've touched on before, is the short read length limits our ability to resolve repeats. And the key computational challenge is just being able to efficiently deal with this massive amount of sequencing data that Illumina produces. Yeah. Those 30X to 50X sort of typical coverage guidelines are different from the long read to the short read sequence. Yeah, it's a very good question and very insightful about the long reads. To cover the genome and have long enough overlaps, you can get away with shallower coverage for long read sequencing. But because you have a higher error rate to get accurate base calls at the end, you need more coverage to go into your consensus model, which we're going to talk about later. So to get the same, say, overlap percentage, you can get a shorter, shallower coverage with long reads, but the trade-off is that you'd have a lower consensus accuracy. So you get the same quality of assembly with less confidence in the read data meeting with both the same amount? Yeah. We usually, for nanopore assembly, I suggest at least 50X, 100X, is it better just because the error rate is so high and the error rate is particularly biased towards difficult sequence motifs? And you need statistical power from many, many reads at each base to get over that, those biases. With PacBio, you can get pretty good assemblies between 25 and 50X. We did an assembly of the Canadian Beaver Genome with 25X through PacBio. There's a point of national pride to do that. And we always get away with a bit lower coverage there. But we also use Illumina data than to correct some of the errors that we weren't able to fix with the PacBio reads alone. For Illumina sequencing, the method that we're going to talk about is based on the Brown graphs, which use k-mers. And you want to sequence really deeply to be able to use long k-mers to get around these repeat problems. So the coverage thing is incredibly, like I could give an entire lecture on just coverage considerations and the different trade-offs. But yeah, it's different between the short and long reads. Mainly the trade-off is between overlap length and accuracy. So the key computational challenge for Illumina sequencing is just making your software and your algorithms computationally efficient enough that it's going to complete the assembly in a reasonable amount of time. So the methods are used for long reads really have an origin of assembling Sanger sequencing data, which we talked about yesterday. So the very first of genomes were essentially assembled by hand by comparing these gel images. You could do that for viral genome. But when larger and larger genomes were sequenced with Sanger, computational or software-based assemblers were developed. And the dominant paradigm was called overlap layout consensus assembly, or OLC. Since short-read sequencing took over, but now long-read sequencing is becoming popular again, OLC-based assembly has seen a resurgence. And the main assemblers for long reads, like PacBio data, are all based on this idea. So OLC assemblers take in your sequencing reads as input. They first find overlaps between reads and construct an overlap graph. They then process that overlap graph to compute a layout of the reads, which is this ordering of the reads along the genome. They then calculate a consensus sequence at the end to overcome this high error rate. And then they output contigs as their output. And a contig here is a stretch of the genome that was assembled into one piece. It's short for a contiguous sequence. So it's a sequence of reads that had all their sequences merged together into some segment of the genome. So I'm going to talk about each one of these stages in turn here. So as I mentioned earlier, we have this unordered set of reads. And we want to figure out relationships between the reads, which reads may have come from the same location of the genome. The way that we do this is we take pairs of reads. We use dynamic programming, which is a way of performing string comparisons. And we look for significant overlaps between the reads. So the overlap has to be at least a certain length, let's say for 1,000 base pair PacBio read. The overlap length might have to be 500 bases with few mismatches or insertions and deletions. So here's an example of a small short pair of reads that overlap with one mismatch in the middle. But you can see the identity between the two reads is fairly high. So when constructing our overlap graph, we'll create a vertex in the graph, which is depicted by one of these circles for each one of our reads. And when we find one of these significant overlaps between a pair of reads, we connect them up with an edge. Now, for early signer sequencing data, where we didn't have that many reads, you maybe have a few thousand reads for your sequencing project, you could just take every pair of reads in your sequence collection, compare them to each other, look for an overlap if they have it, great, put an edge in the graph. When you assemble human genome-sized data sets with overlap-based methods, there's just too many pairs of reads to exhaustively compare all of them. So groups like mine develop ways of very efficiently screening for the pairs of reads, but might actually have an overlap. And the algorithms we use are quite similar to the mapping alignment algorithms that you heard about earlier in the course. So at the end of this, we get an overlap graph. So an overlap graph is just a graph where every read is a vertex. So the reads that we had on the previous slide are a vertex in here. So we had one read GCA, TTA, one read ATT, so on. In each one of those is a read, and those are the vertices of the graph. For all of the reads that had a significant overlap, we've connected them up with an edge here. Now, if the assembler's task to find a path or a walk through this graph, so the path that visits each one of these nodes that reconstructs the original genome sequence. Now, just looking at this graph, it's a little bit complicated. We couldn't tell exactly what the genome sequence is just by looking at the structure of the graph. So the assembler has to work a little bit harder and try to figure out what the true path through the graph is that reconstructs the genome from all the possible paths that you could take through this graph. So that's the job of the second stage of the assembler, which is called the layout stage, which is going to bundle stretches of the overlap graph into these contiguous sequences or contigs. So I'm going to turn to an example of reconstructing a fragment from a song. The fragment is to everything, turn, turn, turn. There is a season. And the reason we've selected this to demonstrate assemblies, there's a repeat in here. This word turn is present three times. Now, if we fragment this sentence into a lot of different copies with about five character words and construct an overlap graph from those words, it looks like this. It's pretty nasty. We don't really see what the reconstruction is by walking through this graph. So the assembler is going to do is it's going to examine the graph and try to figure out whether some of all of these edges in the graph are necessary. And overlapping layout consensus assembly is based on the idea that a lot of these edges in the graph are redundant in that they'll spell the same sentence as another path through the graph. So as an example, we have a series of blue edges here that go from this node to ever and then o every to every. And the sequence spelled by this path is the exact same as this sequence spelled by this path that bypasses this middle vertex. So in assembly language, we say that this edge can be transitively inferred from the other edges. There's a green one. And therefore, its presence is redundant in the graph. So what the assembler does is removes all these transitively inferrable edges, starting from edges that bypass a single vertex. So edges that have this pattern that goes from here all the way over and skips some middle vertex are going to be eliminated from the graph. So let's do that now. We get a graph that's much more simple in structure after doing that transitive reduction. Now if we take that one step further and we eliminate edges that skip two vertices, so these type of transitive edges, we again get a graph that's even simpler. And now we have these segments in our graph that are unbranched chains where there's no ambiguity. The assembler then does is builds contigs from these unbranched chains in the graph. So the first contig would be to everything, turn, and then the second contig over here is turn, there's a season, and then this structure here is an unresolvable repeat as we don't know how many times we should put the word turn into our output sentence. So the assembler would stop at that unresolved repeat and just output these two contigs as the assembly. We've having assembled the full thing but we've done a pretty good job. We've gone all the way up to where we had this repeated unit where we couldn't resolve how many copies of the repeat there are. The final step of overlap layout consensus assembly is picking the most likely nucleotide sequence for each contig. So we're using overlap layout consensus assembly for noisy long reads like Pac-Bio and Oxford Nanopore and we need to find, we need to develop an algorithm that will eliminate the errors in the sequencing reads. The way that we do that is we make what's called a multiple alignment where we take every read that makes up one of our contigs and we line up the bases so that the first base of every read is lined up, then the second base of every read lined up and so on. And then we walk from left to right and pick the most frequent base at each one of these columns. And that's what we call the consensus or the majority base at that position. So this first column is completely unambiguous. There's five T's here. This one's unambiguous, it's five A's, five G's. This one, there was a sequencing error where there was a deletion at this read. So we have four A's, one deletion. So the consensus nucleotide here is A. Let's keep walking along. Here we had four C's, one T, the consensus is a C. Here we had four C's, one gap. Four G's, one C. And over here we had four gaps and one A. So we infer that this A here is an erroneous insertion of an A in this read and we'd output the consensus as a gap character here which then would be removed. So we're just walking along the sequence taking the most frequent base at every position. Now this is how old Sanger sequencing consensus algorithms worked for the very noisy PAC-BIO and Oxford Nanopore data. We use more sophisticated methods which build generative probabilistic models of the underlying raw data, whether it's being these current traces from Oxford Nanopore data or these light pulses in PAC-BIO data and then we infer what the true sequence is using these generative probabilistic models which improve accuracy quite a lot but are very, very expensive to run. Okay, so overlap layout consensus assembly was developed for Sanger sequencing. When we first got Illumina sequencing the natural thing to do was take our old OLC assemblers and try to run them on Illumina data. Unfortunately this profoundly failed because the reads were just too short. The very first Illumina reads and it was still the Selexa company when I got involved that I saw were about 27 base pairs in length. If you think about trying to calculate an overlap between 27 base pair reads you're looking at an overlap of about 18, 19, maybe 20 bases. This is, there's very many repeated 20-mers in the human genome and any of the reads that cross over these 20-mers are going to have edges between them which are just spurious edges between different copies of the repeat. So if you try to construct an overlap graph of 27 base pair reads for a human genome you'd have this graph that has billions and billions probably hundreds of billions of edges, billions of vertices, it wouldn't fit into memory, your assembler would crash, and you'd have a very bad time. So what people developed were methods that were overlap free. They didn't want to have to compute pairwise overlaps between Illumina reads. So the field settled on a technique which is called the Brown graph based assembly which uses short camers. Prior to building the assembly graph, a requirement of doing the Brown graph based assembly is doing error correction on the reads. So rather than tolerating sequencing errors by allowing inexact overlaps, we're going to pre-process the reads first to find where the errors might be and then correct them to the true genomic base and then that makes our assembly task a lot easier later on. I'm going to come back to this figure later. It shows the error rate for six different data sets, Illumina data sets, and it's just here to show you what the error profile of Illumina reads looks like, where the error rate increases from the five prime to the three prime and up to around a few percent at the very last base. Now error correction is one of the major problems with short read assembly and people have worked very hard to develop computationally efficient methods to error correct Illumina reads. These are all based on counting the number of times camers or windows of length K occur across our entire data set. So to walk you through an example, let's consider a read that has a single error. Let's say this C was miscalled by the base caller. The way the error corrector is going to work is it's going to take a camer length, let's say 20, count the number of times the first 20 error in the read occurred across your entire data set and count the number of times the second, third, fourth, all the way to the last camer in your read and you're going to compare the counts as you go across the length of the read. Now the camers that don't contain sequencing errors that are true genomic sequence, they should be present in very high copy number depending on how much sequencing depth you acquired. So if you have 30 X sequencing coverage and you have a very low error rate, you should have around 30 camers that contain that particular sequence. Now if you have a sequencing error, the sequencing errors are quite rare, it's going to flip a base in true genomic sequence to something that's never been seen before. So the count, the number of times that camer seen across your entire read set should be very low. It should probably only be one, maybe seen a couple of times if it's a recurrent sequencing error. So the idea here is that we can use this camer count as a proxy for the number of times these bases are seen across our entire data set and we're going to switch the ones that are low frequency to sequences that are high frequency to fix errors. So that's said down here, we're going to look for rare camers, camers that are only seen once or twice across our read and then flip them to correct them to common camers to make our corrections. Now the benefit of doing it this way is that camer-based error correction is incredibly fast and memory efficient. You can error correct an aluminum data set for human genome in a few hours using one of the state of ER programs like BFC, whereas if you try to calculate overlaps between aluminum reads that allowed mismatches, it would take orders of magnitude longer and it wouldn't be practical for very large data sets. Okay, so once we've performed our error correction step, we then want to build this de Brown graph and use it to extract quantities. So the de Brown graph is also based on camers and the idea here is that we're going to break our reads into shorter fragments of a uniform length, which would be called K, and then we're going to construct a graph of the camer relationships shown in the reads. So here's an example, we have five reads here, the first one's C, C, G, T, T, A. We're going to take the first four base pair sub-sequence, which is C, C, G, T, and we're going to put that camer into our de Brown graph down here. We're then going to take the second camer in this read, C, G, T, T, we're going to put it into our graph here. And now because the camer C, C, G, T is followed by C, G, T, T in some read, we link them up with an edge. So this is trying to get across the same idea of overlap assembly, where the overlaps represent sequences that might have come being adjacent in our original genome. These edges between camers are representing camers that might be adjacent in our original genome. So if we do this process, so we just keep extracting all the camers from these reads, adding them into the graph as vertices and linking them up by edges, we end up with this graph here. Now this graph is fairly simple in structure, but it does have one branching point because of this camer C, G, T, T is repeated in our genome. So it's followed by G, T, T, A at one position and G, T, T, C at another position. And because this graph is simple, there is a unique reconstruction of the genome where we just walk from here down through this lower branch, back around, visit this vertex a second time and then go up here. And that's the assembly of this little simulation genome. But the nice idea of this Brown graph based assembly is that all you're doing is taking camers from your reads, these fixed length strings and linking up adjacent camers. So this graph can be constructed in linear time in the size of your input data set. And it can be very efficiently implemented using things like hash tables, which is what allowed Illumina sequencing reads to be assembled using this technique. And there's been a tremendous amount of work about building faster and more memory efficient Brown graph based assemblers. And that's actually what I worked on for my PhD. Right, so after we've constructed our graph, we need to do some post-processing of the graph to get rid of any sort of artifacts that we found in our sequences. So the main artifacts are residual sequencing errors that our error correction algorithm wasn't able to remove. So what happens is if we construct a Brown graph, let's say these three reads, if there's an error at the end of one of the reads, it causes these spurs off the graph where there is some camers that are, that diverge from the graph but don't really go anywhere. They stop at some position. So we have a couple camers, a couple of nodes here from read two, a couple of nodes here from read three. And we call these tips or spurs off of the graph. Now the reason that they just branch off and don't go anywhere, it's because the errors tend to be random, unique and at the end of reads. So there's no camers that contain this A base in the first position of the camer. So you don't have these as well connected things that go back into the graph. They're just these little tips that don't really go anywhere. If you're sequencing a diploid genome and you have heterozygous SNPs, that can cause what we call bubbles in the graph, which I divergences in the structure. So if you imagine all of the camers for this allele where there's this heterozygous CG SNP here, all of the camers before the SNP and all of the camers after the SNP are common between the two alleles. And that gives rise to this common gray structure here. Then the camers that cover one of the two alleles diverge, causing the split where one path represents allele one, one path represents allele two. And then they come back together once the divergent base has worked its way out of the camer back into this common sequence. So these are two very common structures that you find in de Brown assembly graphs and getting rid of them is what allows a de Brown graph assembler to construct very long contigs from short reads. So here's a cartoon of how these graph cleaning steps improve our de Brown graph. So let's say this is our input graph here. It's quite messy. It's not really clear what the genome sequence is. We'll first make a pass by identifying all of the one side connected vertices in the de Brown graph. These are nodes that only have an edge in one direction not coming out. We'll then walk those edges back until they rejoin the main part of the graph. When they do, we'll annotate them as being a tip or a spur and then we'll get rid of them, giving us a much cleaner graph. We'll then walk through the graph looking for these divergent structures where there's a branch point that then comes back together. We then select one of the two alleles to represent the sequence in the genome, remove the other one, giving us a much cleaner structure that looks like this. Now this is looking a bit like our overlap layout consensus graph where we have these long unbranched segments of the genome that can be compacted together into context. And that's exactly what the Brown graph assembler does. You would select it based on the frequency. You usually, what you'll usually do is you'll calculate the camber coverage along each branch and pick the one with higher coverage. Yes. It's an arbitrary decision. Usually it should be 50-50 the ratio. It doesn't really matter. The assembler doesn't care which allele it puts in there. It just wants to put one of the two in there. You can get these bubble structures from sequencing errors that occur right in the middle of a read. And that's why we pick the higher frequency one because if you have 30 coverage on one half, one coverage on the other, the one coverage is probably sequencing error and you want to get rid of that one. Okay, so the assembler is then going to take this much cleaned up graph, compact the nodes together into context. And that's just merging all of these unbranched segments into the final genome assembly. So it's going to ignore the places where the graph branches and just compact everything else together. Okay, so we've heard about paired end reads. The final step of short read assembly is leveraging the paired end information to try to jump over these repeats that stop the counting assembly. So if you have long insert paired end information, you can map your reads back to your contigs and then keep assembling 500 base pairs downstream after the other pair continues. So the first step is to take our contigs that we produced in the last stage, map our paired end reads to these contigs. So we're going to have paired end reads that map internally on the contigs, which are shown in black bars here linked by a line. But we're also going to have overhanging pairs where one half of the pair maps to the end of the contig and the other half of the pair maps the end of some other contig. So we can infer that in the genome, this contig might be followed by this contig here. Likewise, this contig might be followed by this contig. Now we can formalize this idea by constructing a scaffold graph where we can, we put a contig as a vertex in the graph link up a pair of contigs with an edge if there's pairs supporting those two contigs as being linked together. And then we post-process that scaffold graph to construct our scaffolds, which are just an ordering of our contigs with gaps in between, not more the gaps or the sequences that we can't resolve. Either they're too repetitive or we don't have enough coverage there. So our scaffolds contain gaps. We fill in the sequence between the contigs with N, the N big U.S. nucleotide code. And some assemblers take the approach where they'll try to use a local assembly to fill in the gaps. So they'll do a less astringent assembly of the reads that have pairs in this region to try to fill in that more repetitive sequence. There's some programs out there that will take in other sequencing data like low coverage pack bio data to try to fill in the gaps between your scaffolds as well. So what can you expect the output of an assembler to give you? Well, there's quite a lot depending on what the size of the genome you try to assemble and the type of sequencing data that you had. So for bacterial genomes, short read assembly will give you hundreds of contigs with an average size of between, say 10,000 and 100,000 bases. With long reads like pack bio or nanopore, it's typical to have the genome in only a handful of contigs, maybe one to five. With most genomes, I'd say being able to be completely assembled into one piece. You get a complete genome assembly. For large genomes, with short reads, your contact sizes, you can expect to be around 10,000 bases. For long reads, they'll be around a mega base in size. Long read data, as I mentioned yesterday is much more expensive. So the right question or the right technology to use depends on your question, what you're interested in getting out of your genome assembly and also the budget, how much money you can put towards the assembly. I typically recommend that people use long reads if they're assembling very large genomes. If they want to make a new reference genome or some crop that they're interested in, try to use long reads. It'll give you much higher quality reference if you're able to pay for it. Okay, I'm gonna switch gears now. Is there any questions at this point about sort of the theory on assembly? Can you reverse double-stranded if you're referred to double-stranded in here or in a single direction? Yeah, this is one of these ugly bits of implementing a genome assembler that are hard to deal with. Usually with this vertex that has the Kamer sequence is both the Kamer and its reverse complement. Because as you know by asking this question, like every sequencing read could have come from either strand, so we want to be able to incorporate that information. The other, some assemblers will have slightly different representations while they'll have sort of these mirrored graphs where there's one set of Kammers for one orientation, one set of Kammers for the other, but that gets pretty complicated to implement. So usually assemblers will just represent them both in the same vertex. Yeah, so we're, it's a very good question. Structure variation I think is the best application right now. There are other types of variation that are very difficult to call with the luminous sequencing. Indel calling, even calling short insertions and deletions of a few bases is notoriously difficult, especially for somatic mutations. There's a lot of, if you sequence a genome, let's say my genome, and you compared it to the reference genome, you'd find something like five megabases of sequence that's not present in the reference at all. Obviously if you're not doing de novo assembly, you have no information of the function of that sequence. So it's a combination of structure variation, short insertions and deletions, and novel sequence that gives you the benefit of assembly for human genomes. Okay, so the last part of the talk, I'm going to discuss some of the difficulties of performing short read assembly and how we can make predictions about how hard or easy a given assembly will be just from your set of sequencing data. So short read assembly algorithms were incredibly active area of research between about 2010 to 2015. And there were a lot of community projects to benchmark how well short read assembly software worked. And one of them was called the Assemblathon and Assemblathon two, which sequenced the genome of three species. One was a boa constrictor, one was Lake Malawi sicklet of fish, and one was a parakeet. They didn't know the genomes of these sequences. They released the read sets for these three genomes publicly. Everybody who was interested in doing assembly would download the data, assembled it, submitted it to the organizers, and they benchmarked how well the individual software performed. My group took part in that, along with about probably a dozen other groups around the world. And then the results were somewhat disconcerting as the results were very variable. Some assemblers that did very well, say on the snake genome, didn't do very well on the Lake Malawi sicklet. Some genomes just performed much, much better for all assemblers than other genomes. So this got me thinking about what makes a given assembly difficult and how can we help you guys, the users, figure out once we've sequenced something, is this gonna be an easy genome to assemblers? It's gonna be very hard genome to assembler. So I spent about a year working on a program called PreQC, which will take a set of sequencing reads and tell you whether you're in for a hard time during your assembly project. I'm sorry for that. Okay, so I sat down, I made a list of what features of a genome make assembly difficult. Something I've hammered on about for the last 40 minutes is repetitive sequence. This is an obvious one. High header is Igosity. If you have very high rate of headers, I guess snips and indels of your genome, you're gonna have a lot of these bubble structures which can confuse the assembler. If you didn't sequence enough, maybe you didn't have an accurate estimate of your genome size and you only have 15 X sequencing coverage, this is gonna make the assemblers job a lot more difficult. If you have very biased sequencing coverage, and by that I mean your sequencing coverage is very poor for low or high GC content sequences that causes a lot of problems. Classic example is the parasite that causes malaria, plasmodium, topsybarum, which has 80% AT. This is awful for aluminum sequencing. It's very, very difficult to get uniform good coverage of plasmodium for aluminum, but because it's so important biologically, people really, really want to sequence and assemble the genome. So the assemblers have to try to account for this, but is one of the factors that makes assembly more difficult. High error rate in your reads, that one's a bit obvious. If you have more errors, it's gonna be more difficult to assemble. If your reads are chimeric, where they contain a non-adjacent biological sequence, this confuses the assembler. If you have a lot of sequencing adapters in your reads, if your sequencing center contaminated the sample with something else, or if you simply can't get enough DNA from your organism, so you've taken multiple individuals from the population, you've sequenced them, now you've increased the amount of variation in that sample that makes the assembler's job a lot more difficult as well. So the way this pre-QC program works is it, it's going to look at the structure of the assembly graph and try to label each one of the branches it's seen with the cause of that branch, whether it's caused by sequencing error, whether it's caused by a heterozygous SNP, or whether it's caused by a repeat. And the way that it's going to do that is it's gonna go back to these methods that we talked about for error correction, where it's going to look at how many times each one of these camers occurs across our sequencing dataset. Now the first output of this program, and you're gonna have a look at these reports in the lab practical coming up next, is a histogram of the camer, in this case, 51 mer count distribution. Now this is an extremely clean example, and this is what we hope to see when we run a genome assembly. We have this nice peak around camer count of 30, which is roughly corresponding to 30x coverage. We have a peak over here of camers that are only seen a single time. These are camers that contain sequencing errors. And you see there's a little shoulder here, which are camers that are half the coverage of the main peak, which are camers that cover heterozygous SNPs, those ones that cause bubbles in the graph. Now, if we look at this, I'm immediately horrified when I look at this, it's a bimodal distribution. Now the reason it's bimodal, this is from an oyster genome, is that the heterozygous SNP rate in this oyster is roughly about 1%. There's a SNP every 100 bases. So what happens is you get this huge peak of camers at half coverage, which correspond to camers covering heterozygous SNPs. These are homozygous camers, these are heterozygous camers. Now no assembler, very few assemblers are designed to deal with this level of heterozygosity. And the authors of the oyster genome paper actually started with a short read assembly approach. They saw that this was a heterozygosity and they gave up and used long reads instead. So if you see something like this, it's gonna be a very difficult genome assembly. So going one step further, within PreQC we have a statistical model which will classify each one of these branches. Now I'm not gonna go into the details of how the statistical model works. It's essentially measuring the coverage between the three vertices that are caused by this branch and comparing the coverage of these ones, which are labeled CA and CB. So the idea here is that if it's a sequencing error, we expect one of the branches to have much better coverage than the other. If it's a heterozygous SNP, we expect the branches to have roughly balanced coverage. If it's a sequencing repeat, we expect the branches to have coverage which is corresponding to their copy number in the genome. So the output of this that the user can see is a prediction of how often the graph branches due to heterozygous SNPs. And this is the plot I'm showing here. Now this oyster genome that I just not told you was horrible to assemble has a heterozygous SNP branch prediction of about one in a hundred, which matches what we know. The human genome is about one in a thousand shown in blue here. The negative control here is haploid yeast genome, which has a heterozygous branch rate of about one and I think 30,000 bases, which is the false positive prediction rate of this approach. So these are the different assemblathon genomes are listed here. This is the cichlid, fish in green, the bow constrictor in red, pear peat in purple. And their difficulty of assembly roughly corresponds to heterozygosity shown here. Now we can also predict how often branches occur in the assembly graph due to repeats. This largely tracks our expectation of which assembly, which genomes are gonna be most difficult to assemble. The human genome, which is known to be very repetitive is up here with the highest repeat induced branch rate with the oyster genome. The yeast genome, which is very small and easy to assemble is shown down here. So by plotting where your genome lies on the spectrum from hard to assemble to easy to assemble, it gives you some idea of what kind of assembly you're looking at. This program will also predict the size of the genome. So it'll just take your sequencing read, calculate coverage statistics based on these camers and they make a prediction of what the size of the genome is. So if you wanna check that matches what the lab told you you can. It'll also do more basic quality control like looking at the mean quality score of each read across the sequencing read. So again, looking the same six data sets here. We see that the snake genome, the bull constrictor, which was actually sequenced by Lumina has the highest quality data across the whole read where the Lake Malawi Cichlid is the lowest quality data set here. You see a fairly steep drop off of the quality scores towards the three prime end of the read. I showed you this image earlier which is a per position error rate. This program will estimate where the errors in the sequencing reads are and plot the error rate as a function of the read length. Again, giving you an indication of how clean your data is. And we'll also try to get at this GC bias and whether there's GC bias in your data set. So this is the Lake Malawi Cichlid again and it's a 2D histogram where each bin is the camer coverage for camers with a particular GC content. And we see that this is a fairly clean data set where there's not a strong dependency between camer coverage and GC content. If we look at a yeast data set instead, we see there's a bit of a dependency between GC content and camer coverage with these higher GC camers having a somewhat low coverage. The oyster genome again, we see two peaks here because of this heterozygosity versus homozygosity camer problem. The program will also infer the fragment size histogram just like we saw when we can map to a reference genome but this just does it by looking at where are the pairs link up in your assembly graph and it'll automatically output these histograms for you. And finally, it will actually do a simulated assembly. It doesn't do a full assembly because that would take too long but it'll simulate a genome assembly process, output the length of the contigs that you'd expect to get for a particular camer size which will tell you directly how easy or how difficult your genome assembly is going to be. So the yeast data set which I've said is quite easy. You get the longest contigs across a very wide range of camers. This oyster data set which I'm using to demonstrate how difficult things are, you get very short contigs for pretty much all camer values. So in the practical session, you're gonna look at a report for the data sets we're gonna use. You're not gonna generate the report on your own cause it takes a little bit too long to generate for the practical session but the commands used to generate the report are shown here. They're also shown in the text file for the lab. If you wanna read the paper, it's up on archive and the codes available as well. I'll wrap up now. So this was just an overview of how assemblers work, a little bit of theory and more detail on how short and long read assembly pipelines work. We need different approaches for different types of data. My standing recommendation is use long reads if you can and then I just talked to a different factors that determine whether an assembly is gonna be difficult or easy. Here for the rest of the afternoon, so please feel free to ask questions now or during the lab practical. If you think of something after today, feel free to email me. I'm always happy to help people with assemblies. So do you have any questions at this point? In the back? Yeah. Yeah, that's a good question. So the question was about optical mapping. So this is a technology where you stretch out very long fragments of DNA and then you can figure out where little markers are on the DNA. So you can't sequence the entire piece, but you can figure out certain sub-sequences. A famous one is called BioNano that's used for optical mapping and that's really useful for building very long scaffolds from your genome assembly. So if you have an existing assembly with 100 kb contigs, you can use an optical map to order and orientate those contigs along the optical map to give you much, much longer scaffolds. This is becoming a really popular way of generating complete and nearly finished chromosome arm level assemblies. And we're starting to see a lot of that in human assembly as well. Since you asked that, I'm gonna use this to talk with different technologies as well, but I didn't get a chance to mention. One that you might hear about is 10x genomics. So 10x genomics is similar to the molecular technology. Matthew was just talking about where it fragments very large pieces of DNA around 100 kb and then it barcodes within droplets that all of the reeds came from the same source 100 kb molecule. And this can be used for scaffolding. It can be used for haplotype phasing as well. And that's becoming pretty popular in assembly too. All right, any more questions?