 Hello everybody and welcome to an introduction to genome assembly. My name is Simon Gladman and I'm a bioinformatician from the University of Melbourne in Australia. During this slide presentation, I'll be giving you a bit of an overview of what genome assembly is and why we want to do it. And then hopefully preparing you for the upcoming tutorial. Okay, just a couple of requirements before we start. Hopefully you've all done the introduction to Galaxy analysis and you understand or you've used Galaxy before and you understand how to use it. If not, please go and have a look at that video first. And hopefully you understand a little bit about sequence analysis, mainly around quality control. We're not going to really touch on much quality control here as it's covered in another topic quality control is vitally important in genome assembly. Okay, so the question we're going to try and answer is how do we perform a very basic genome assembly from short read data. Okay, so we're going to talk about DeNovo genome assembly. And I'd like to thank Torsten Seaman, the BULAC, Ira Cook for a lot of the slides in this presentation. Okay, so we'll get started. We'll talk about DeNovo assembly, which is the process of reconstructing the original DNA sequence from the fragment reads alone. And why do we have fragment reads? Well, the simple reason is that none of our equipment can read a full genome at the moment. And so we can only read small pieces at a time. And so what we do is we break the DNA up into small fragments. We send it off to a sequencing lab somewhere. They determine the sequence of some parts of those fragments and then send those results back to us. And what we get is a file full of reads, which are pieces of the DNA that came from the sample that we sent. Now, unfortunately, these are in very, very small pieces compared to the length of the genome usually. And so what we need to do is we need to fit them together like almost like a jigsaw puzzle to put them back together. And so we need to find reads that fit together by moving around and comparing them against one another. There could be some problems though. There could be some missing pieces and some of the pieces could be dirty like there might be some errors in them. Or another view of what we're doing is imagine if you have a stack of newspapers and then you put those newspapers through the shredder. And then you give them to a kindergarten class and you say to the kindergarten, hey, glue the original newspaper back together. And they try and glue the draft newspaper back together again. And then so the draft sequence is little bits of stuff stuck together. And then the closed original newspaper again is what we're trying to get. Okay, maybe it might work better if I give you an example. So here we have a string of characters. We'll call it a small genome. Friends, Romans, countrymen, lend me your ears. Okay, so this is one sentence. So we send our sample containing that string off to a sequencing lab and they chop it up into little bits and make lots of copies. And then they read the bits that they can read of each of those fragments and send the result and reads back to us. DS, Romans, count, NS, countrymen, Lee, friends, ROM, send me your ears, semicolon and crime in lend me. Okay, you'll notice that there might have been an error when the sequencing center read our sample. So what we need to do now is we need to find all the overlaps. As you can see here, friends, ROM, kind of overlaps with this DS ROM here, doesn't it? So we can see this DS ROM overlaps with this DS ROM. And so maybe they fit together. This read and this read fit together. And so we'll lay it out like this. And then we'll say NS countrymen, Lee, oh look, there's an NS count here. So maybe they overlap as well and so on and so on. And we find overlaps that make sense between our reads and then we lay it out like this. Okay, so you can sort of see we've laid out all their reads with all their overlaps. And then the next step to do is find what we call in consensus. So that is we look at the evidence of the reads and we try and figure out what the original sequence or sentence might have been. And so for this position, the first position, we have an F and that's the only evidence we have. So maybe our majority consensus is an F. And so on for the R and the I and the N. But here we have D. This is a bit stronger. We have two of our reads that suggest that there's a D in this position in our original sequence. And so we can say with a bit more confidence perhaps that this is probably going to be a D. And this one here is probably going to be an S and this one's going to be a comma and then we'll have a space. And so on as we go to here. And then when we get to here, you see we have three pieces of evidence. We have three reads that kind of overlap here. And two of them are saying there's a T here and one of them is saying a C. In that there's two T's versus one C. So maybe the C was a mistake. And so we'll say that in this position, we're going to assume that our original sequence had a T in this position here. And exactly the same with this position here. Lee landed me and send me your ears. We're going to say, well, perhaps, perhaps this S is an error because we have more evidence to say that it was an L. And so we have a final majority consensus of friends, Romans, countrymen, lend me your ears. And we have reconstructed our original sequence from our reads alone by doing overlap, layout, consensus. So far so good. However, the awful truth is that one does not simply assemble a genome. So Professor Mihai Pop, who did a lot of research into the area of genome assembly, actually stated that genome assembly is impossible under certain circumstances. And we'll talk about what those circumstances are soon. So why is it so hard? Well, unlike our example that we just talked about, we don't have five reads. We have millions of pieces, and they're much, much shorter than the genome. So if we have a human genome has three billion bases in its entirety. Some of the chromosomes are very, very large. And if we've only got short read technologies, the longest ones we can get are about 300 bases. If we've got long read technologies, we can get them out to maybe 70,000 bases, but it's still well short of the length of the original genome. And another problem is that a lot of the sequence is repeated throughout our genomes and throughout the genomes that we're studying. And so they look very, very similar. And there's lots of missing pieces. So some of them just can't be sequenced. They're too high in GC content. It could be lots of reasons. And there could be a lot of errors in some of those reads. So it's kind of like doing a jigsaw puzzle where you don't know what the picture is. Your dog's got to some of the pieces and chewed some of them so they don't quite fit anymore. And one of your children is drawn over the top of some of them with texture. And so it's actually quite a difficult problem. But we have a basic recipe that we can follow to help us. So one of the things we will do is we will find all the overlaps between the reads. We'll see where each read overlaps all the other ones. That sounds like a lot of work, especially for millions and millions of reads. And then we're going to build a graph, which is a picture of the read connections. So we're going to place a read on the table. And then we're going to place another read on the table and then we'll draw a connection between it. And then we'll add another read and draw the connections in and add another read and draw the connections in. And so we'll slowly, slowly build up this giant picture of all the reads and then all the connections to one another. And so we call this thing a graph. And then what we need to do is simplify that graph because it's going to be very, very complicated. We're going to have reads overlap many, many, many other reads. And so maybe we can remove redundancy. We can do a bunch of other things that might help us simplify that graph. Unfortunately though, sequencing errors will show up in this graph and we'll mess it up a lot. And then finally we will traverse the graph. We'll try to trace a sensible path through the graph to reduce the consensus. Now, this is not a trivial problem at all. In fact, it can be quite difficult to do this, but this is what our General Assembly recipe is. And in pictorial form, you can see here we have our reads. We find out all the overlaps and then we lay them down and draw all the connections between them. And then we start to look for paths that visit each of these things once. So we can go from this read to this read to this read to this read and we'll pull out this green to purple part here. So green, blue, purple, and we'll pull out this consensus sequence here. But then what do we do with these parts? And what do we do with the fact that this one has lots of connections going into it? And this one has lots of connections coming out of it. It's a lot more complicated than we think. And this is what a realistic graph looks like. It's very complicated. And tracing a path around this, this part. I'm sure we can go from here down. And then do I go this way or that way? I don't know. I can trace the thing. I'm actually it's a lot more complicated and number these things. It's a nightmare. It is genuinely no fun. So what ruins this graph? What makes it so complicated? Why can't we just jigsaw it back together and make it really simple? Well, read errors. They introduce false edges and nodes in our graph or they introduce false connections in our graph, non haploid organisms. So polyploid organisms like humans who are diploid have this thing called heterosciogosity where one version of a chromosome might be slightly different to the other version of the chromosome. And so these can cause detours in our assembly graph where we can have two possible paths through our graph. This makes it very, very difficult. How do we how do we choose which one to go down? And how do we know that one isn't just an error or it's real? There are so many things we need to think about. And then repeats. If the repeats are longer than the read length, it's actually impossible for us to assemble. But if the read length is longer than the biggest repeat, then maybe we've got a chance of assembling it. And repeats causes nodes to be shared and we get locality confusion. And so we tend to read, but then we have no way of knowing which path to take out of one of those reads. Okay, so we're going to talk about repeats because they're really important in our genome assembly. What is a repeat? Well, a repeats a segment of DNA, which occurs more than once in the genome sequence. And they're very, very common. Things like transposons and satellites and gene duplications occur regularly in pretty much all genomes, except for viruses because they're usually very compact. But pretty much everything else, bacteria, fungi, high eukaryotes, plants, animals all have huge numbers of repeats. Okay, how does this affect our assembly? Well, if you imagine that these black lines here are our reads and we've found overlaps and we've laid out this nice consensus here. We've laid them out and then we found this nice consensus in this black region. But in this red region here, we know that that red region there is identical to this red region here. Right, but we have different sets of reads from it because when we broke the DNA out and set it off to the sequencing center, they read this part and this part and this part and this part and this part. But to us or to the computer, they look identical. So this red read here looks identical to a combination of these yellow reads here. And so when we do overlap detection, these red reads get piled up on top of the yellow reads. And so even though they came from completely different parts of the genome originally, they get lumped together. And we can't split them because, well, this part of this genome here, we know that maybe it's connected to this red section. But then when we're coming out of it here, we don't know whether to go into the yellow section or the red section. So we can come into it. No problems. We don't know which way to go out of it. But this part here assembled up really nicely with no repeats in it. And so, yeah, sure, we can assemble that part. But this repeat, we're going to have this collapse of the consensus. And we're going to just produce one repeat instead of two. All right. So the law of repeats is it is impossible to resolve repeats of length s unless you have reads longer than s. And I've written it here twice because you need to keep it in the back of your mind. All right. However, we can do some tricks. I'm going to talk about scaffolding and what that means. So how do we go beyond contigs? So contigs are contiguous pieces of DNA that we can put back together. So this section here, this nice long section here where there's no repeats. There doesn't seem to be any problems. And so when we extract that out of our assembly, we get this piece here called a contig. Whereas the repeats get collapsed into another contig. Okay. So we want to go beyond contigs. But their sizes are limited by the length of the repeats in our genome. We can't change that and the length or the span of the reads. And we can use long read technology to hopefully overcome these repeats. But we can also use tricks with other technology. We can change the type of reads that we produce. So say we have this example fragment. This is a fragment of our genome. So we've sent our sample off to the lab. They've broken it down into small little pieces. So it'll fit into the machine. And then normally, well, originally what they would do is they would read one end until the machine stopped working. And then they would call out the read. And so they would only sequence one end of the fragment. However, they got a bit cleverer. And then they thought, well, why don't we sequence both ends of the fragments? And so they sequence the first bit. And then they sequence the last bit. Now, we know that this particular read is related to this particular read, right? Because they're on the same fragment. And we can also measure that roughly the length of this fragment. And so we kind of know how far apart these reads need to be when we're doing our assembly. So when we do our overlap detection, et cetera, et cetera, what we do is we record where the pair of these two reads lie in relation to one another in our consensus. And we can say, are they roughly the right distance apart? Are they too far apart? Are they too close together? And so we can exploit that kind of information in our assembly algorithms. Okay, so when we do scaffolding, we're doing exactly that. We know that the sequences are related to each other because they came from a single fragment. And we roughly know how long the fragments were. And most of the time the pairs will occur in the same contig. But occasionally the pairs will be on different contigs. And this is evidence that these contigs are linked together. And this is what we're talking about here. So sometimes we have one of the paired reads on this contig, another one of the paired reads on this contig. So that's evidence that these two contigs are linked. And we can also get a direction for these contigs. So we kind of know which way they need to face in relation to the other ones. And then we can say, you know, this paired-in read matches one here. So we can connect those together. And there's another one that came from a bigger fragment. And, you know, they sort of join up together there. So when we lay the contigs out and try and assemble those using the evidence that we have, we can sort of see that there's a contig here, a contig here, and a contig here. And we kind of don't really know what goes in there. But it's probably going to be some kind of repeated element, probably. Or simply an area of the genome that we didn't sequence or couldn't sequence. But this will give us what we call a scaffold. And then we can go back and target these kind of areas and try and fill them in. Okay. So how do we assess how good our assemblies are? Well, we desire our total length of all of our contigs to be similar to the genome size. We want to assemble as much as our original genome as possible. So if we've got a five megabase genome, say any coal wire or something, and we've got three megabases in contigs, then we've done a pretty poor job of assembling. But if we have, you know, close to five, 4.9, 5.1 megabases of contigs, then perhaps we're getting close to having an assembly of most of our genome. We want fewer larger contigs. So we want our contigs to be big. We don't want to have lots and lots and lots of little ones because that's not going to tell us anything. We want to have a few large contigs that make up most of our assembly. And we want them to be correct. And we need the ways of checking to see if they're correct. Because remember, we don't really know anything about this genome before we do any of this work. So we don't know what it looks like. We don't know what the contigs are meant to look like. We kind of know the length, but that's about it. So how do we know they're correct? Well, there's some ways we can sort of figure that out. And to do this, we have some things called metrics. There's no real generally useful measure because you don't really have any prior information. So we don't have a true set. We can't say, yes, this is 99% true because we don't really know. But what we can do is we can sort of measure the number of long contigs, or we can sort of look at the total number of paces in contigs and compare it with our genome length. We can calculate this thing called the N50, which is a statistic that gets used a lot in assemblies. And it's basically a measure of how together my assembly is. So the N50 is the length of that contig from which 50% of the bases are in it and shorter contigs. So imagine we have seven contigs at our assembly with length one, length one, length three, length five, length eight, length 12, and length 20. Well, the way you calculate the N50 is you lay them out in order like that. And then you sum them all the way up. So when you add all these numbers together, you get 50. So our total number of bases in contigs is 50. And half of that is 25. So now what we want to do is start at the smallest and add them together until we get to 25 or above. And then the length of the contig, that is the last one that we add to our sum, that is the N50. And so we go one plus one, two, five, ten, 18. It's not 25 yet, so we need to add the 12 on. We get 30, which is greater than 25. And so the last one that we added into this sum was the 12. And so our N50 of our assembly is 12. And so basically we're saying that 50% of our bases are contained in contigs bigger than 12. Okay. So there are two levels of assembly. There's a draft assembly and a closed or finished assembly. The closed or finished assembly is usually our goal have a finished reference sequence for our organism of choice. However, sometimes we get to the draft assembly stage where we get the end of the scaffolding step and we've got a number of non-linked scapholds. We've got gaps and we've got unknown sequences in bits of them. But we've got probably 80% of the genome sort of put together and laid out in a plan. And this is fairly easy to get to. And sometimes that's enough. But the closing or finishing assembly, we want one sequence for each chromosome instead of a bunch of scaffold. It takes a lot more work because we need to look at each gap individually. We need to figure out what goes in each of these gaps. Small genomes are becoming much easier to do. So we can do a whole bacteria now. And we can almost guarantee we're going to get a closed genome out of it at the end with long read technologies. Using Oxford Nanopore or PacBio or one of the other long read technologies. Large genomes like human genome are much more difficult even with long read technologies. It's still not easy and it's still very expensive and still the province of consortiums. And for example, the human genome consortium. So how do you actually go about doing an assembly? So we have an example. We culture our bacteria. We extract their genomic DNA. We send it off to a sequencing centre for say a luminous sequencing. And what we get back is 250 base pair per end. We get back to text files from a little while that we sent off to our sequencing centre. Now what do you do? Well, we use a tool. We can use some assembly tools. In some of the tutorials, some of the early tutorials that we have on the GTN, we use tools like Velvet and the Velvet Optimiser and Spades. I have to point out please that Velvet and the Velvet Optimiser are training tools. They are, however, very good for teaching people how to run assembly tools and what's going on behind the scenes. So if you want to learn about assembly, then Velvet's OK. But if you actually want to do an assembly, then use something else, something like Spades or a base. New blood doesn't really exist anymore. SGA, all paths, soap. There's hundreds of them, canoe. The list is endless. We can also not just assemble genomes. We can assemble other things like metagenomes. And we can assemble transcriptomes using things like Trinity transubis. And there are many, many, many others. In fact, if you look up genome assembly and Wikipedia and then go to the list of tools, it's almost you have to scroll through like five or six pages of tools. You'll be doing an exercise but not using Velvet a bit later on. Hopefully if you do the assembly tutorial. So thank you for listening. So this concludes the introductory slides for genome assembly. If you would like to know more or learn a little bit more about genome assembly or would like to have a look at some of the algorithms that genome assemblers use. If you click up here on the top left of the slide deck, it will turn you to the topic page on assembly. And you can see here there are a lot of different slide decks in this section. The ones we did were these ones. But there's ones that go into the details about the brand graph. This one here, a deeper look into genome assembly algorithms is actually really good and explains to you exactly what's going on inside the tools if you're interested. All right. Thank you very much and goodbye.