 Okay, we're in the home stretch now, everybody. This is the last module of the course. So again, we're going to split this into a theory lecture and then a practical session after the coffee break. So this module is on genome assembly. I alluded to this yesterday morning when I talked about genome sequencing. And the goal of the lecture here is to give you sort of an understanding of the fundamentals of what we mean by assembling a genome, then an overview of how some of the algorithms work. And I'll talk a little bit about short and long read assembly, some of the differences between approaches of whether you have Illumina reads versus longer sequencing data. And then at the very end, I'll give you some pitfalls about how assembly can go wrong and some of the things that we can do to fix that. Then, of course, you'll have a chance to practice assembly on some data that we've provided. All right, well, let's start with a very basic question, what is a genome assembly? It's essentially reversing the sequencing process. When we sequence a genome, we take our genome for doing whole genome shotgun data. We fragment it into many pieces, millions or billions of pieces. We figure out the sequence of those individual pieces using whatever our sequencing technology of choice is. And then the assembly process is just reversing that. We're going to take all of those reads and try to reconstruct what our original genome sequence was. Now, this is in complete absence of a reference genome. So when you do a de novo genome assembly, you're not using the reference genome. You have nothing to align the reads to. You just need to compare the reads against each other and figure out what the structure of the genome was. Now, in this cartoon that I've drawn for genome assembly, we've segmented the genome into portions that are unique and portions that are repetitive. The portions that are repetitive are shown in these red bars here, where it's present in three different copies. Now, these are going to be the real challenge of assembling the genome. And when we have one of these repeats that's longer than the length of our reads, like if you have a repeat that's 1,000 bases and you have 100 base per aluminum read, we can't actually figure out the order of what these pieces go to because the read doesn't cross the repeat boundary into this unique sequence. So that's why we're going to want to focus on trying to assemble either the unique regions if you have short reads or getting longer reads such that you can span the different copies of the repetitive regions. All right, so here's the overview. I'm going to talk about how assemblers work, some of the theory that I'm going to talk about, some assembly algorithms for short and long reads. And then finally, what are some of the factors of genomes that make an assembly a difficult problem? So typically when we assemble data, we talk about whole genome shotgun sequencing. So whole genome shotgun sequencing is the idea that we just randomly fragment our genome into these millions of pieces, sequence the individual pieces, and then try to reconstruct the sequence. So the shotgun refers to the idea that you've just kind of blasted your genome into these millions of pieces. You have no idea how the pieces relate to each other, but we need to computationally figure that out. So in this cartoon here, we have our input sequence, which is our genome. We take many copies of our genome, we then fragment it, and then figure out the sequence of these individual fragments. Now if we knew the ordering of the fragments along our genome, if we knew that this sequence that starts GGC is taken from the first position in genome, and then this one's taken from the fourth position, and this one's taken from the last position, if we know this complete order, it's really easy to do the assembly. We would just lay out our reads like this and then collapse their sequences back together by joining them at the positions where they overlap to reconstruct what that genome is. Unfortunately, we don't know that. We don't actually know the ordering of the individual pieces. We only know their sequences. We don't know how they relate to each other. So we just have this big jumble of reads, and we need to computationally figure out what their ordering was along our genome sequence. Now, a key term that we're going to talk about, and it's probably come up already before in the course, is the idea of genome coverage. Now, to sequence our genome and to assemble it, we need to have redundant information, it's not just enough to sequence every base a single time. But we need to sequence enough bases such that the reads are going to overlap on the genome, so we can find overlaps and then merge the reads together. So we typically aim for high coverage sequencing of around 30 X, which is like what Matthew was talking about. So the way that we define coverage is the average number of reads across a single position of our genome sequence. So here we've sequenced 177 different nucleotides. Our genome sequence is 35 nucleotides. So the average coverage is 177 divided by 35, which is about 7X. You wouldn't want to try to assemble a genome from only 7X coverage, and we typically want to get something like 30X, so their reads overlap by quite a lot. They have long overlaps, and also that we're almost guaranteed to have a read covering every position of our genome. Okay, so the basic principle behind genome assembly is that in the absence of this information about what the order of the reads are in the reference genome, we can try to approximate it by looking for reads that have very similar sequences. So the idea is that if we find a read where the suffix or the end of one read matches the prefix or the beginning of another, where they have very high sequence similarity, where they share a lot of bases that are matched to each other, there's a pretty good chance that they came from the same position of the genome and they're overlapping that position. So this example here, because of this read, this suffix matches the prefix here. We can infer that it probably came from overlapping regions of genome, and we could merge their sequence together to build a slightly longer sequence by just adding this terminal suffix cc onto here. And that's extended that read by a couple of bases, okay? Is this all clear, like why we want to find overlaps between our reads? So the field of a genome assembly is sort of like a classic problem in bioinformatics for people who are interested in sequence analysis algorithms. And the algorithms have been in development for probably over 30 years now, since Sanger originally developed his sequencing method. I think the very first genomes were assembled by hand by like comparing these sequencing gels to each other, just lining them up into longer and longer pieces, but it was pretty obvious that we'd have to use computers to do this once we started sequencing things that were more than maybe 5,000 bases at length. So the classical way of doing genome assembly was using what's called the overlap layout consensus paradigm. I'm going to walk through those three different steps in a minute. And those were used for all the genomes that were sequenced with Sanger sequencing. They used this overlap layout consensus method of assembly. When short read sequencing, like aluminum sequencing came into prominence, we found that we needed to develop entirely new classes of algorithms to handle the fact that the reads were very, very short and in huge abundance, where you could have a billion reads sequenced from a human genome. So my start in bioinformatics was developing algorithms to use short read assembly using a technique that's called the brown grass. And that's something I'm going to describe a little bit later on. But now that long reads are back in prominence with Pac-Bio and Nanopore as I described yesterday, these overlap layout consensus methods of assembly are now back being used for this type of data. And now the reason that we need to use different methods for short reads versus long reads are that they have very different characteristics. Long reads are maybe 10,000 bases in length but have a 15% error rate. Short reads are 100 bases in length but very, very accurate, where they might only have 1 in 200 base per error rate. So the algorithms that we are going to use have to be matched to the type of data that we have. So I'm going to walk through example pipelines for long and short read assembly. So here's what a long read assembly pipeline looks like. We start with our read sequences that come out of our sequencer, let's say Pac-Bio sequence or the auction Nanopore sequencer. We're then going to build what's called an overlap graph and I'll give an example of what an overlap graph is. We're then going to construct a layout which means we're going to try to figure out this ordering of the reads along the genome. We're then going to merge their sequences together into contigs, which are the assembled segments of the genome. We're then going to run what we call a consensus algorithm to get over sequencing error and improve the accuracy of our genome assembly. We're then going to output that into a fast A file, which is our assembly in a set of contigs. So the first stage is finding overlaps. So to find overlaps, we're going to start with our big list of sequencing reads. So in this example, it's only a little bit over 10 reads. But for a real sequencing project, it's going to be a million reads. And then we're going to start taking pairs of reads, comparing their sequences to each other, and figuring out where reads have a highly similar sub-sequence at the end of one read and the beginning of the other. So just like I showed in the previous example, the suffix of this read matches the prefix of this read. So we're going to say that they overlap, and we're going to start to construct an assembly graph where each one of our reads is a node or a vertex in the graph. And we're going to add an edge from the read that has a suffix to the read that has the matching prefix. And we're going to do this for essentially all pairs of reads in our sequencing data set. And we're going to build up this huge graph which models the relationship between these reads, where an edge going from one read to another read said that they have an overlap. And maybe they came from the adjacent position in the genome that we sequenced. Now, here's an example assembly graph for a very short sequencing project where we're just sequencing this string. And we've taken seven base pair reads. So each one of these seven base pair reads is a vertex in our graph. And we've put an edge between them if they have an overlap. So here, this read ends with the sequence A-T-T-A-T. This read begins with the sequence A-T-T-A-T. We put an edge between them and we denote that as a five base pair overlap. Just for this half of the room, we're looking at this read and the suffix of this read matches a prefix of this read. So we put an edge between them. And likewise, the suffix of this read matches the prefix of this read and so on. So each time we see an arrow or an edge in this graph, it's just saying that there's a relationship between the suffix and prefix of those two reads. Yeah. Yeah, exactly, we would. In here, we've only put an edge between reads that have an overlap of three bases. This is just a toy example. If you're sequencing with Pac-Bio reads where maybe a read length is 10,000 bases, you might put a minimum length of the overlap on, say, 1,000 bases. The reason you want to put a minimum length onto overlap is you don't want spurious matches between repetitive sequence and the genome. And the general principle is that the longer your overlap length is, the less likely it is to be just a repeat-induced overlap. How do you determine the threshold? Typically, the assembler will do it for you. They'll have it built in. The things you want to look at are the coverage of the genome. If you know the coverage of the genome and your average read length, you can calculate the probability of finding 1,000 base pair overlap. And the assembler will take all that information into it and then set the length of your overlap based on coverage and read length and genome size. So it's essentially automatically done. And if you want to read about the theory behind this, it's called Lander-Waterman statistics, where they developed this model of sequencing coverage and how long your overlap should be depending on how much you sequence and sort of a classic paper in this field. The overlap is relative to the ends or one we can entirely be inside of non-read. Yeah, you can. That's a special case of overlap that we call contained reads, where one read is a perfect or almost perfect substring and another read. Typically, when the assembler sees that a read is contained with another one, it will just discard it from the assembly. It doesn't actually add any information about how content should be built. So you just throw out any reads that are contained. So it won't be in the graph? It won't be in the graph. It'll just get discarded. That's right. All right, so we have our overlap graph here. It's not too complex. We maybe can see a path through here that reconstructs this string. Of course, this is just a toy problem and real genomes are gonna look a lot more awful. Okay, so we've built our overlap graph. Next step is to then construct a layout by bundling stretches of our overlap graph into contents. So I'm gonna quickly move from giving examples in DNA to giving examples in words, just because it's a little bit easier to visualize what a repeat looks like in one of these graphs. So we're now gonna construct the overlap graph for this fragment of a famous song from the 1970s, which goes to everything, turn, turn, turn. There is a season. So you can see why I picked this string. There's this repeat of the word turn three times in the middle of this string. So this is a good way of visualizing what happens to these assembly graphs when we have repeats. So we've broken up our string into pieces of length seven again and the individual pieces of length seven are the vertices in our graph. Then we found overlaps between these words and the assembly graph is at the bottom here. So here's quite a bit more complicated than, for example, we saw before. We can see there's some structure going on here, something quite awful happening here, and then something a little more simple here. But just by looking at this graph by eye, it's not really apparent what the structure of this sentence is. So we need to first do a step to clean up the graph a little bit. So one step that we can do to clean up the graph is removing redundant edges from our assembly graph. So let's look at a structure that has this blue, these blue edges and this green edge here and up here, these two blue edges and this green edge here. Now we said that this edge from this vertex down to this vertex is a transitive edge as this spells out the same sequence as these two blue edges here. So this transitive edge, which bypasses this node, is actually giving us redundant information in our graph and we don't actually need it. It doesn't add any information just like these contained reads. So we're gonna pre-process the graph by looking for these transitive edges, which are edges you just bypass one of the nodes, like here or here, and removing them from our overlap graph. So let's look at transitive edge reduction. Here's our original graph and we're gonna get rid of edges that bypass a single node like this. So we've gone rid of those edges and we now have some of these quite a bit cleaner than before. We can see some more structure here. It's a little bit simpler by getting rid of those edges. But still, there's some redundant edges like this one that bypasses here or this one that bypasses here. So we're gonna take this a step further and now remove edges that bypass two nodes. So we've gotten rid of second degree transitive edges and now we have a much more simple graph structure. We can see that there's this linear chain of nodes here, there's this linear chain of nodes here and then something more complicated going on here. And now this is exactly what an assembler wants to see. It doesn't want these graphs that have these tangles and edges going in all these different directions. It wants these linear stretches because whenever you can see a linear stretch in the graph, we can unambiguously merge those sequences together without any chance of a misassembly. So you can, whenever you see a linear unbranched path, you can just say, okay, this is a contig. I know this is a unique stretch of the genome. I'm gonna put those together and I'm gonna stop. So that's exactly what we're gonna do here. We're gonna find our linear stretches here. The first one spells out this fragment to everything, turn. This one spells out turn, there's a season and then we have this unresolvable repeat in here which has a branching structure that we can't resolve without having a chance of making a misassembly. And the reason we have this repeat is because you could have had three turns there, turn, turn, turn, you could have had four turns, turn, turn, turn, turn and you don't know the actual copy number of that repeat. If we had longer reads, if we take longer stretches of that that went from all the way across those three occurrences of that word, we could have resolved it uniquely, but unfortunately we don't, we only had seven character reads here, so we had to stop. So we put out two contigs, one for the unique stretch of the genome. In the unresolvable repeat, there's also like a transit edge, right? This edge, which one? The bottom one. This one? Yeah. It's actually not transitive because it's hard to see on here, but the arrow is pointed into this one, this direction. If the arrow was on the other side, then it would be a transit edge. Then you could select the content. Yeah, but this edge, if the other arrow is going here, we could remove that, but because it's going back, it's a loop rather than a transit edge. So this loop would allow the assembler to just keep cycling in this position and keep inserting the word turn, turn, turn, turn, turn, turn, turn. We don't want that, so we're just going to give up on trying to assemble this piece. Yeah. It does appear to have a transitably inferrable edge if you skip three nodes. No, it's just for illustration here. You could get rid of this edge. Yeah, that's all fun. But if you, fundamentally, the problem is this. Right, but you could keep simplifying three nodes, four nodes, five nodes. Real assemblers will get rid of all transitive edges, not just one or two jumps. But here, it says for illustration, we only did two stars. There's another question over here. Thought somebody had a hand up. I just want to know if, like, a new genus published, is it those positions that if someone would be fixed this later? Yeah, absolutely. So when we're sequencing genomes of Sanger sequencing, there's quite a high standard to publish a genome. People would want very long contigs. They'd want the contigs built into higher order structure which we call scaffolds, which I'm going to come to a little bit. But since high throughput, short read sequencing came out, the standards for publishing a genome dropped. And it was OK just to assemble things to 10,000 base pair pieces and then put it into a genome database. And people do go back, get long reads, get different types of information to try to finish or improve the genomes by filling in these repetitive regions. Yeah, that's right. The reads will span every position of that. But all the assembler will know is that there is at least three copies. But no, it could be four copies or could be five copies or any number of copies, all just based on the structure of the copies. So it can resolve the number of copies, which is the problem here. When we go back to constructing the assembly graph, the overlap detection part, it will find, we'll run an overlap detection algorithm, which is the standard programs out there. There are really fast ones called Minimap 2. And it will output these type of relationships. But if this read, say, was four bases shorter so that only aligned to this position, we'd be able to tell from the output of Minimap 2 that it ended internally within this read. And then we would discard it. So the read feed, in this case, only becomes apparent when we look across multiple nodes and we see that cycle in the graph. Does that make sense? So if the two contents that it means does it show? Yeah, it will show. Let's go back. Yeah, so both contents span and start with part of the read feed. And then embedded within this structure here, there's multiple other copies. So the assembler has essentially said, OK, I'm going to have two copies of this at least. But it's possibly more varied in that structure. I was going to have to make sure that's true. Yeah, in those, there's more than two. It outputs two, but it knows just because of the fact that there's a loop in here that there's more confidence. But it doesn't know how many. Would your assembler only throw out some strings with perfect matches to another string? Or if there's some mismatches, would it keep it? It's another one of these threshold parameters. Like when we're finding overlaps, how many mismatches are we going to tolerate in the overlap? Typically, it's set to be a percentage, like, say, 3% or 4%. But of course, it depends on your sequence and technology as well. So when it finds overlaps and it finds containment, it will throw it out if it's a percentage of difference between them. If you have two pack bioreads, which are 15% error rate, you're essentially doubling the error rate when you compare the two. So it might allow up to, like, 30% differences, which was why when pack bioread first became available, it was thought to be virtually impossible to assemble just because the error rate was so high and finding overlaps of 30% differences is quite daunting. But then is like all good scientific work. Somebody just showed it was possible. And then we all went, OK, it is possible. So let's work on that. OK, so the last step, once we've showed the question that I skipped, I want to make sure I got to everybody. Are we OK with the layout? OK. So the last step is doing this consensus approach where we want to capture the final genome assembly. So as we've talked about, all reads have sequencing errors. And we don't want to just stitch the read sequences together, or else we're going to start adding sequencing errors into our assembly. So what we want to do is we want to take all of our reads and then take the majority base at every position of our reads as our output of our genome assembly. Because we're taking the majority, we can imagine as if each read had a vote on what the true base is at every possession. So we call this the consensus algorithm. We're taking a consensus over every one of the reads. So to do this, we construct a multiple alignment where we line all of the reads against each other. And then we just iterate over every column of this multiple alignment and take the most frequently observed base. And that goes into our final genome assembly. So here, one read had a little insertion here. The other reads said that there was no base there. So we put a gap in the final consensus sequence. And likewise, although we have sequencing errors at some positions, we take the majority, which is the true nucleotide sequence. And we get a high accuracy consensus down here. So that's how long read assemblies assemblers work. Later on, we're going to run the canoe assembler, C-A-N-U, which is by Adam Philippe's group at the NHGRI. And it's basically doing these three steps of finding an overlap, constructing a layout, and then calling a consensus sequence. Very early on, when we had Illumina data, people would try to use these assemblers on Illumina data. You'd try to run the Solera assembler on Illumina data, and they would just completely fail. The reason that they'd fail is that when we construct our overlap graph, we're essentially comparing all pairs of reads to each other. And if you have an Illumina run for a human genome, you might have a billion reads. There is a billion squared pairs of reads, which is far too many to try to practically screen and find overlaps between them. So we've had to develop this entirely new class of algorithms, which was pioneered by Pavel Pesner at UC San Diego, which is based on an algorithm called the DeBrown graph. So here's what a short read assembly pipeline looks like. We're gonna start by error correcting our reads. So we're gonna look for sequencing errors in our reads, try to get rid of them. We're then going to construct our DeBrown graph. We're then gonna process the graph to clean it up of edges that we don't want, similar to these transitive edges that we just talked about for long read assembly graphs. We're then gonna assemble our contigs and then do some paired end resolution by constructing scaffolds and filling in the gaps. So here is an error profile for a set of Illumina. So here's six different Illumina data sets. This is data that we used in an assembly benchmarking competition called the Assemblathon. I'm gonna come back to the Assemblathon later on. We have a yeast data set, a fish, which was Lake Malawi-Siclid, a boa constrictor snake, human genome, parakeet, and an oyster genome. The oyster wasn't part of the Assemblathon, but it's a useful data set to describe anyways. This is essentially looking at how often bases are incorrect as a function of the position of the base within the sequencing read. And just like as I mentioned yesterday in talking about sequencing technology, the classic Illumina profile is that the error rate increases towards the three prime end of the read. And we see this in each one of our data sets where the reads are very, very accurate at the beginning and then as you go into these later cycles, the error rate increases. So what we want to do is so these errors are gonna cause erroneous structures in our assembly graph and they're actually gonna balloon the amount of memory it takes to run a genome assembly. So what we want to do is run a pre-processing step on our data, which is going to error correct our sequencing read. And the way that we error correct our sequencing reads is looking at what we call Kamer counts. So let's consider a read that has a single sequencing error sort of near the end. It's this red C here. Now I've already talked about why we wanna sequence the genome deeply. It's because we want redundant information. We wanna sequence to 30, 40, 50x data, which is gonna give us power to find overlaps, but it also helps us error correct our reads. And if you just count the number of times short subsequences of our reads, which we call Kamer's, in this case, we're taking 21 base pair sequences. We count the number of times each Kamer occurs across all of our sequencing reads. The Kamer's that don't contain sequencing errors are gonna be seen many, many times, in this case, 40 times. And the Kamer's that do contain sequencing errors are gonna be seen few times, in this case, just a single time. So whenever we have a sequencing error, that essentially flips a Kamer that's seen 40 or 50 times to a Kamer that's seen a single time. So this gives us an incredibly efficient way of finding where the sequencing error is in our reads is that we can just count the number of times these short words appear, and then if it's lower than a threshold, we're gonna say it's an error, we'd then try to fix every base to flip that from a low coverage Kamer to a high coverage Kamer. Now the reason this is efficient is it doesn't involve any sort of dynamic programming. We're not aligning any reads to each other. All we do is just insert these short words into something like a hash table, and then we can count the number of occurrences that they have. You do that for every read, yeah. Basically cover up, just through the window, throughout. Yeah, so what you're gonna do is you're gonna take, you're gonna initialize an empty hash table, you're gonna iterate over every read and every Kamer of that read, insert it into the hash table. Once we're done that, we have a hash table that's populated with every Kamer and its frequency the number of times it's been seen across our entire dataset. And the Kamer's based on breaking up what you see in a read. Yeah, exactly, so let's take just five rooms here. The first five rooms are reads ACGAT, the second five are it's CGATG. So we're just gonna slide that five base per window across our entire sequence and insert those into the hash table. And so how are errors then? Like, because if you're basing your Kamer based on that read, you're gonna create the Kamer to exclude the error. Yeah, exactly, but like, because errors tend to be rare and uniformly distributed across the genome, we expect that whenever we see and have a sequencing error that creates a unique Kamer that's only been seen that single one time in that read. If it didn't have an error, we would see it. The number of times we see it is in proportion to whatever our sequencing depth is. So it allows us to distinguish between reads with errors versus reads without errors just by looking at these Kamer frequencies. Right, so once we've identified our Kamer's to contain sequencing errors, we just start flipping the bases to see if one of the changes turns it into a Kamer that's seen 40 or 50 times. If it has, we change that and I'll put a new set of reads with our corrected sequences. So there's a lot of software that will do this. One of the first Kamer-based error correctors was called Quake. Other is one that I wrote in the assembler SGA, Soap to Novo has one, that is BFC, BLAST, Lighter Musket, and probably many others since I made these slides. The reason that we do this though is that the older way of doing error correction was looking for pair-wise overlaps between the reads and then calculating a multiple alignment. That's far too slow for large, illuminated data sets. So we use these Kamer methods. All right, we've error corrected our reads. Now we want to construct our graph. And as I mentioned, we're going to use what we call a de Brown graph. Now de Brown graph is this fundamental data set, data structure for shorting assembly. So I'm going to take some time to go through it. It's also based on Kamer's just like the Kamer's that we saw for error correction. And to construct the de Brown graph, we're going to take all of our reads, and here we have five reads, each of six bases in length. We're going to take the first four bases of this read, which is our Kamer size, and we're going to insert an Intar graph, CCGT. We're then going to take the next four base pair subsequence of this read, CGTTA, and we're going to put an Intar graph. And because these two Kamer's were adjacent in a read, we're going to put an edge between them here. And we're going to do that every Kamer in our read set. And whenever we see Kamer's that are adjacent in our read, we put an edge. So for the benefit of this stuff in the room, we're going to put a vertex in the graph for this Kamer, CGTT, here. And we're going to put an edge, sorry, vertex in the graph for this Kamer, GTTA, here. And because they're adjacent in our reads, we put an edge between them. Now this seems like a silly thing to do. We're taking our short reads, and we're breaking them up into even shorter pieces. Why would we do this? This is really, really fast to do. You can do this just using hash tables, computer scientists and programmers like me spend a lot of time optimizing performance of hash tables. So we can do this in a few hours for an illuminated data set for a human genome rather than months or years if we're trying to construct overlap graphs. And it also has a really interesting property in that it handles repetitive regions of the genome really elegantly. Now this Kamer in red here is special in that it has two different extensions to it. It's connected to this Kamer, GTTA, and this Kamer, GTTC. So we'd say that this Kamer is a repetitive Kamer in our genome. And the graph of branch, where there's two possibilities, we neither follow this path or this path here. So why this makes it nice to work with is that it compacts all of our repeats in our genome, down to this Kamer level information. We don't have this explosion of edges, like we saw in that sentence that we saw before. We just have these Kamer's that our branch points our graph. So it makes a very compact representation of the repeats in our genome. So it depends on your read, right? Because it also starts expanding, like expanding. Yes, yeah. So if there's sort of nested reads, it can keep expanding, but you're not gonna have, in an overlap graph, the number of edges grows quadratically with the repeat copy. In this graph, each repeat copy gets collapsed down to single nodes, so it grows linearly with the repeat copy. So it has a much better growth structure than for overlaps. Now this graph is interesting. There is actually a unique path that reconstructs it. If you start here, follow this path here, loop back around and go up here, you have a unique construction of the genome. But again, this is just a toy problem. For real assemblies, there isn't usually a unique path that will assemble it. All right, we're not quite ready to assemble contigs yet. Next, we need to clean up our graph of erroneous edges. So there's two types of errors or erroneous edges that we need to get rid of. There's one type that's called TIPS, which are short branches off our graph that just terminate nowhere. So we have this nice linear part of the graph that's what our assembler wants, and then it diverges here for four nodes, it diverges here for two nodes. Now the reason these structures appear are sometimes we can't error correct every error in our reads. So we have a leftover error here and a leftover error here. Because of the fact that errors tend to be unique, they just create these divergences that overhang off here. So these branches, these camers are all just corresponding to errors and they don't connect anywhere. They're not giving us any information of the structure of our genome. So what we want to do is just trim those branches off the graph by removing those sequence errors. Now there's another type of graph structure that we're interested in. If you sequence diploid genomes, which has heterozygous snips, our graph structure is going to have branch points where there's any sort of idyllic differences. So if there's a heterozygous snip at this point, at this blue G, we're gonna have one path through the graph that follows allele one and one path through the graph that follows allele two, where they diverge at this snip position and then come back together once our camer window has slid past that snip. Now the assembler wants linear structures in the graph so it's gonna look for these structures and try to get rid of them. It usually just makes an arbitrary choice. Usually it'll pick the one with higher coverage but if it is a true heterozygous snip then you just get a random allele in there. That's actually something that trips up people quite often is that they want to know what snips are in their genome. The assembler does not so it will get rid of them. Some assemblers will write it to a file and say, I removed a bubble at this position and it had this alternative allele. Okay, so we've constructed a brown graph from our alumina reads and it might look something like this. It's complicated, not as complicated as we saw from the overlap graph but again, we wanted to clean it up to make the true structure of the genome a little bit more apparent. So what we're gonna do is first find all of our tips in the graph. These are these branches that lead nowhere. We're then going to, sorry, Microsoft wants to auto-update something. I do not want to let it. Right, so we're gonna find these tips and then traverse backwards until they rejoin the graph and then prune off these structures. So already we can see that just by getting rid of these alternatives that lead nowhere we have a much cleaner graph. We're then going to iterate over the points of divergence in the graph see if they rejoin at some common vertex. If they do, we'll say that there's a bubble there. We'll call it as allele divergence and then collapse that down by just removing one half of the bubble. And already we now see the graph is quite clean and what we can do is just start compacting these linear stretches of the graph which the assembler likes to see into our contigs. We still have some repeats left over here like here and here that have broken up our contigs but that's typically unavoidable with short read assemblies just because the read length isn't long enough to cross all repeats in the genome. If we have paired end data which has come up a few times in this course we can do a little bit better though in that we can try to scaffold our contigs together. So what we mean by scaffold is just constructing higher order structures where we're gonna stitch our contigs together and we're gonna say contig A is followed by contig B but we don't know the sequence in between them. So we're gonna skip over the repeats and just fill in the scaffold with ends in between those contigs. The way that we're gonna construct scaffold is that we're gonna take our paired end reads we're gonna align them to our contig sequences and we're gonna see where there's a group of pairs that align to the end of one contig and then the start of another contig. So I've color coded them here these blue pairs, half of the pair aligned here half the pair aligned here these red pairs half aligned here, half aligned here this is purple it's hard to see half here, half here, half here and half here. So the fact that there's pairs going from one contig to another contig allows us to say okay these probably came together on the genome let's join them up. So the assembler will do is that it will formalize this look for contigs that had a statistically significant number of pairs joining them it will then use this insert or fragment size distribution to estimate how far apart those contigs are it'll then output a scaffold with ends in between them based on the gaps gap size that it inferred from that fragment size distribution. You know there's only a pair? Yeah you can only do if you have paired end data. There's older when people are really interested in doing a lot of aluminum genome assemblies they developed protocols for getting very long range paired end reeds which are called mate pairs where you could sequence reeds where five kilobases apart those were really useful for scaffolding your genomes because you could jump over much longer repeats if you have pairs that are very very far apart those aren't as common anymore just because long reeds have become fairly cheap to get. All right in the final stage you can try to do a local assembly like we were talking about for variant calling to try to fill in those gaps by having slightly more permissive assembly parameters and only looking at reeds that align into those gaps between our contigs. You can also use other sequencing technology like if you have a draft genome assembly with aluminum scaffolds you could then align pack bio data to your scaffold to try to improve it like we were talking about a few minutes ago. Okay what should you expect from a genome assembly? How good is your genome assembly going to be? For bacterial genomes which are typically very small on the order of a few megabases in length if you have short reeds your genome will probably assemble to about a few hundred contigs where the average contig length is about 10 to 100,000 bases. If you have long reeds like pack bio nanopore you'll typically get a handful of contigs maybe between one and five and often the entire genome will be assembled into a single piece. The reason that this is is that in many bacterial genomes the longest repeat is around 6,000 bases which is the ribosomal RNA sequence and if you have reeds that are longer than 6,000 bases you'll typically be able to span that repeat and then be able to assemble your genome. If you have large genomes like a human genome or say some crop genome your contigs from short reeds will typically be around 10,000 bases in length for long reeds there'll be a few million bases in length. Of course long read data assembly long read data is much more expensive to generate so whether that's the right technology depends on how many genomes you want to sequence what you want to get out your genomes. Okay, that is the assembly theory section is there any questions? Yes. Yeah, it does. If you have low coverage like less than 30x you're gonna have more contigs and they're going to be shorter. This sort of assumes sort of ideally if you've sequenced to 30, 40, 50x which is what most people recommend what length of contigs you'll get. You get, if you did 30x from short or long reeds short reeds you'd get about 10,000 base pair contigs long reeds you'd get about a million or more. Was there another question? Yeah. About the motivation of doing assemblies to begin with I mean can you give an example why would I want to use assembly versus reference genome? Sure, well the best example is if you don't have a reference genome and that's the primary use is if you know there's a lot of bacterial genomes that haven't been sequenced before like we're here in a cancer research institute obviously there's a very high quality human reference genome so most of our analysis will be reference based but there are pitfalls to lining to a reference things like finding very complex structure variation doesn't work very well when you're lining to the reference if there's novel sequence if we take any one of us sequence our genome compared to the reference genome there's probably something like five to 10 megabases of sequence that just isn't present in the reference at all you have no information about that sequence if you use a reference based strategy yeah so essentially any time you want to find something that's more than snips or small indels or very clean structure variations it's going to be better to use an assembly strategy it's sort of self-serving to say this because I work on genome assemblies but as long read technology become better I think we're going to switch from things being primarily a lineman base to being primarily assembly base and I think we're sort of at that inflection point now where it's starting to be impossible to do multiple long read human genome assemblies as another source of reference material you don't have a way to compare yourself when you do assemblies you don't have a way to compare yourself within yourself right? well if you do an assembly of a human genome then you would align it to the reference to do your comparison if you're just doing a complete de novo project and you have no information about it you know there is no way to verify that you know you can get orthogonal sequencing data like typically if you'll do a pack bio assembly you'll then get some illuminated data to map to your newly assembled reference to assess how accurate the genome is but yeah you'd need to use different orthogonal ways to verify that your assembly is correct okay any more before I move on to the last part of this talk okay so this isn't the assemblophone project the goal of the assemblophone I mentioned this already was to benchmark genome assemblers so a group at UC Davis went out sequenced three genomes that don't have a reference genome they released them to the assembly community people like me who developed de novo assemblers and they wanted us to run our assemblers and then they would do experimental comparisons to say what the best assemblers were and the conclusions and you can read it yourself at this this insect here is that essentially there's no single approach that works well for all genomes or all species some assemblers did really well on one of the three species and then poorly on the others and some species were really easy to assemble by any of the assemblers tested and some are more difficult so this got me thinking about the question of like what makes it difficult to assemble some genome that were given and I kind of want to throw that out to people and don't look ahead of the next slide because that's where the answers are but what kind of what kind of features would make a given assembly difficult it's like either the genome or the data quality we've heard a couple already repetitive sequence exactly that's a big one how repetitive is the genome you're trying to sequence and GC content GC content exactly both because it's going to make the genome look more repetitive if you have very biased sequencing data but also it's much more difficult to get uniform coverage across the genome if there's extreme GC content low coverage low coverage exactly a lot of people they if you don't know the genome size you might say okay it's one gigabase order the amount of sequencing data to give you a 30X coverage for a one gigabase genome if it turns out to be two gigabases you only have 15X and your genome is going to be very difficult to assemble sorry poor quality poor quality reads exactly if your reads have very high error rate it's going to make it more difficult to error correct the data and it also lowers the effective coverage of your data set anything else yeah in the very fine yeah it's going to be difficult to do like the analysis of whether your assembly is good or not for sure if it's a tumor sample yeah exactly what I'm what I'm going to take from that is that tumors can be seen as like a population where there's a variation in the population at different frequencies you can have mutations that are clonal the analogy in assembly is often if somebody's sequencing something that's quite small and difficult to get a lot of DNA from they might take multiple individuals and pool DNA assembling from a population is very very difficult because they're you're introducing all sorts of variation between the individuals in the population and it's the exact same as if you're trying to assemble a cancer genome anything else another one launching off that is just if there's a lot of headers on gauze it's going to be very difficult to get a lot of DNA there's a lot of headers, I got city in the genome if you you know human genomes have a snip about every a thousand bases that's not too difficult for the assembler to resolve these bubble structures but there's there's some some organisms like we're going to see that have a snip every hundred bases those are very very difficult to assemble I think that's a pretty good list anything else before I I switch over to the next slide so we got most of them repetitive sequence high headers, I got city low coverage biased sequence because of GC content high error rate something we didn't get is chimeric reads so often if your library prep goes wrong you might end up sequencing two different DNA molecules in the same read the assembler really really does not like that because it looks like structure variation so you want to avoid chimeres at all costs if you have sequencing adapters in your read those also look like very high copy repeats so you want to avoid doing having any sequencing adapters you can use these trimming programs that we heard about to get rid of those sequencing multiple individuals like this population or just if your sample has been contaminated by you know somebody else's sample that they're they're preparing at the same time that's going to make your assembly much more difficult alright so with so sometimes if when you're doing a library prep particularly PCR based PCR can ligate two different molecules of DNA together then when they get sequenced it's a read that came from two independent pieces of the genome that gets put into the same read and we call those a chimeric read and they confuse the assembler because they create a path through the graph from one part of the reference one part of the genome to a different part of the genome so they cause they make it look like repeats but they're just like artificially induced repeats by a flaw in the library preparation step they typically occur at like a very low rate so you know you don't have to worry about them too much but the assemblers will typically have special code to look for structures that are only supported by you know one or two reads and then try to get rid of them okay so with all these things in mind I set out to write a program that would take a raw set of a lumina reads and assess whether any of these factors are going to cause difficulty during your genome assembly uh... this program is called pre-qc it's pre-assembly quality control and it's similar to these fast qc programs that we saw uh... early on and that accept is more focused towards doing a de novo genome assembly uh... and the way that it works is it's based on camera coverage just like we're seeing for error correction we're going to take the short cameras we're going to count how many times that they occur and this gives us a lot of information about uh... the different properties of our gene so here's a histogram of how many times fifty one murders occur in this test human data there's two features i want to point out here there's this large peak here where the camera coverage is around twenty five to thirty acts those are genomic cameras these are cameras that are useful for genome assembly and they occur roughly twenty five to thirty times which was our sequence in coverage uh... for the sample there's another peak which is up here a camera count of one over here those are seek out those are cameras that contain sequencing errors those are the ones that we want to correct so just by looking at the relative height of these two different peaks we can essentially figure out how accurate our sequencing reads are if the reads are have more errors the height of this single camer peak uh... is going to be higher if there's fewer errors it's going to be lower but there's another feature in this that's not that apparent uh... on this histogram but if we look at this other data set we see there's something odd going on here and this is this oyster data set that i've included just because i love showing this plot anybody want to say a guess it was happening why we have this extra mode in this peak in this distribution it is diploid and you're close so what what would these two peaks be if this is a diploid genome exactly, heterozygous so this oyster genome and they only found out about this or uh... it only became apparent after they sequenced it has a snippet about every eighty bases it's extremely heterozygous it's so heterozygous that we can quite clearly see that there's this cluster of camers that have half of the depth of the other ones so we can partition this into camers that are homozygous present on both alleles and heterozygous that are present on only one of the two now this assembly failed spectacularly it did not work at all because this genome has such high heterozygosity and it had to use an entirely different way of sequencing to get around the fact that that there's so much heterozygosity if we flip back to the human one just a little bit if we look really close we can see a little shoulder on this distribution here which are the heterozygous camers in the human genome so there's adding a little bit of density half of the coverage peak uh... but it's not so much to affect the assembly because it's only about one in a thousand camers that have that property alright so we can take this a step further and build our entire brown graph and process the graph and classify the structure that we've talked about into different classes we can have structures caused by sequencing errors in red structures caused by snips and indels, heterozygosity and green structures caused by repeats in blue I'm not going to go over the model of how we do this classification but if we run this on these different assemblophone genomes we can use that to figure out how heterozygous the genomes are so here this oyster genome the heterozygosity is calculated to be one in about one hundred bases uh... and the human genome is calculated to be about one in a thousand bases with the other assemblophone genomes falling in between those two extremes so this immediately if you ran this program looked at this data set you see that heterozygosity is about one in a hundred bases this should put up a big red flag that you know it's going to be very very difficult genome to assemble uh... conversely if you have a human genome you see one in a thousand bases that's not too bad we the assemblers are pretty good at assembling genomes that are that heterozygous we can also look at the frequency of repeat branches here the human genome and the oyster genome are pretty comparable in repeat content where there's a repeat induced branch around uh... one in a few hundred bases which is what we would which is uh... fairly well known for human genomes and the other genomes like this bird genome here we see that it's much less repetitive compared to the other ones and when we went to assemble that in the assemblophone data set it was one of the easiest uh... genomes to assemble uh... the program will also predict the genome size for you so here the human genome is predicted to be well three gigabases uh... that's good because that's about how large it is these other genomes like the blow constrictors about one-and-a-half gigabases uh... the lake malawi cichlid is about a gigabase and our control data set, our easy control data set here, the yeast genome is correctly predicted to be about twelve megabases so what this program gives you is you can just take a raw set of sequencing data that you know nothing about run it through this program and it will estimate these type of characteristics for you it's going to tell you how big the genome is, it's going to tell you roughly how heterozygous it is and how repetitive it is and then that will give you some information about how difficult it's going to be uh... to do that assembly uh... it will also just calculate basic summary stats like the average quality scores of your reads uh... across the length of the read as we see quality scores decreased towards the three prime end of the read uh... because that's the sequence of the aluminum sequence of air and mold uh... and it'll actually run a simulated genome assembly for you with different parameters, different values of this camer size that we use for the Duprogram graph and estimate how long the contigs will be uh... for that camers for that camers, as we see the yeast genome which is easiest to assemble has fairly long contigs around thirty thousand bases this oyster genome which is very difficult to assemble has the shortest contigs uh... of less than ten thousand bases again just giving you some indication of how difficult these assemblies are going to be i think it's time for the coffee break just to wrap up uh... you can overview of how assemblers work tried to get across the idea to shorten read assemblers uh... assemblies require different methods because of the different characteristics of those data sets and then in the last few minutes we talked about what factors make a given assembly difficult uh... or easy uh... I'll take any questions now but if I don't get your question or you have questions later on when working through your own data set feel three just to drop me an email uh... on my oyster address here before coffee any any more questions yeah so this uh... this one here it's not exactly the same as you know the genome coverage that I described you before because camer coverage is going to be a fraction of genome coverage we're at about twenty seven twenty eight x fifty one more coverage which which corresponds to both forty x uh... base-level coverage it's a certain does a little confusing but this this is uh... correlated with genome coverage this display yeah that's a really good question and something people you know when when we started doing to brown graph assembly uh... a very common thing to do is you take your genome your your your sequencing reasons and you'd run the assembler with all possible k and then pick whatever where whatever one gave you the longest context uh... after having you know few years of experience and where data standards became uh... you know solidified where everybody got about forty x fifty x coverage for your data the field is essentially settled on using fifty one to sixty one mers for yourself the camera size is a balance between you want to use a long cameras because they're better crossing repetitive regions but if you use too long of cameras your genomes not going to be completely covered by us as a single camera so i think you have a hundred days per read you think you might think okay great i'm gonna use ninety nine base pair cameras or a hundred days per cameras but it's very unlikely that you have a read starting from every single position on your on your genome so you need to shorten the cameras such as a genomes well covered by by the short sub sequences