 Hi guys, once again. So what we just covered was mapping to a reference genome. So that's when you have a reference existing, you're trying to take your reads and see how they compare to it. But this talk that we're going to actually go over is what if you have no reference at all and you're trying to reconstruct your genome from short reads or just from scratch. So what is genome assembly to begin with? So say we take any genome of any species, for example, this how it currently looks, we can see that we have a few repeats as well. When we do our sequencing, we end up reading smaller chunks of this original genome at different positions. And now the task in genome assembly is how do we take these individual reads and connect them again to construct our original genome, right, doesn't seem too complicated when it should start out. But it is actually quite a difficult computational problem. So the overview of this talk is going to be how assemblers actually work. Assembly algorithms for short and long reads, for short reads are examples of our illuminate sequencing, our paired-end sequencing, which we had before. Long reads are our third generation sequencing technologies, so Pac-Bio and Oxford Nanopore, which come with their own set of problems, which is why we have different assembly strategies. And then what makes short-read assembly difficult to begin with? So the way whole genome short-end sequencing works is we just start copying and fragmenting the DNA. What we ideally want to do is we want to make enough copies of our DNA and fragment it in enough positions that we're able to cover every single one of these original sequences with enough coverage as well. So we capture all these sequences multiple times in different ones of these reads. What we then want to do is, assuming that we do get a large number of these fragments that almost all the genome positions are covered with, we want to be able to align them one after another based on their overlaps and try and see what kind of consensus sequence we can build by reading them at every given position. But because we don't actually know where the reads came from and because we can have repeats in between, aligning them in that nice neat format isn't always usually that simple. So one of the main categories or one of the main essential characteristics we need to have is coverage. So coverage in this case, because somebody asked yesterday as well, between the difference between coverage and depth, the coverage is your average coverage of all the positions across the entire sequence, but coverage is also used for a given position as well. So in this case we can see that the average coverage of our input sequence is about 7x, but if we would focus on this G over here, it's only about 3x, about 4x, and so on and so forth. So the average coverage is just the average of each and every one of these bases. So basically the same thing. So in this case they're referring to this G over here which has a coverage of 6. When you're doing assembly, you need to have enough coverage so that you can accurately call those bases at the end. The more similarity there is between any two of your reads, and for these purposes you have to remember that we're comparing reads from our sequencer with other reads within that same sequence you run rather than to a reference. So everything's being compared one against another. The larger your similarity between two reads, the more likely they are to be one after another from the original sequence. So in this case we can see that if we align these two, we get this huge map for the first, I think, seven bases, a mismatch and another perfect map. So we can assume that these two came one after another if there isn't any other read that somehow spans between these two or has a greater mapping between the two reads. Now depending on the kind of data you have, you get intrinsic problems with both. So long reads like Pac-Bio and Oxford Nanopore, you get average read lengths that are just greater than 10 kilobases. Pac-Bio is pretty consistent with a mean average read length of 10 kilobases. It's just very, very, very expensive. Nanopore, it depends on the chemistry, but ideally what they do is they target the chemistry to capture around 8KB so they get the most high throughput as well. The problem with both of these is your error rate just from the sequencing technology itself is about 5% to 15% being miscalled, a base being miscalled or a single insertion slash deletion happening at a given site. So the problem with this one is you have to overcome the high error rate by taking advantage of these huge reads that you end up gaining. With Illumina, you have very short, high accurate reads, but because you have such short reads, it's difficult to try and span large stretches of repeats because there's no way of being able to determine which of these reads goes from either location. So with short reads, efficient assembly is what the problem becomes. It's also high throughput, so remember that with Illumina you can get up to like 600 million reads. You have to compare each one of these reads against each other to try and build your overall graph. So from a computational standpoint, it becomes much more difficult. So for each one of these, there's actually different assembly strategies that we end up going with. So for long read sequencing, we end up having this overlap layout consensus method. And then we'll get to the short read as well because it has a different approach. Well, it is what it is. So this is the general layout for a long read assembly pipeline. You take your reads, you overlap them one after another. You lay them out into large stretches where you get context. You take the consensus at every single base to get an accurate call of what those bases for the context might be, and then there's further steps you can do with the context afterwards. So your overlaps work exactly like how we were seeing before. You lay out your reads and you try and see what the largest suffix of one maps to the prefix of another read. So in this case, you can see that GGCTC and so forth over here matches perfectly to this prefix over here, so you know that this read precedes the second one over here from a computational one. And you do this across all of your reads, try and make a directed graph. So if we take this original string over here and we break up into small pieces, we get this directed graph where this connects up here, which also connects down here, which connects up here, but also goes up here. The numbers above illustrate the number of reads which end up giving support to that specific directed graph. Once you have this overlap graph constructed, we can now start doing layouts and try and resolve and simplify what our original graph is. So overlap graphs can end up looking like this. This is just a simple sentence of to everything, turn, turn, turn, there is a season. The reason why this is used is because you have this repute of turn, turn, turn, which causes all of these ambiguous mismatching going back and forth. So what's basically happened over here is you've taken the sentence, you fragmented it into much smaller pieces, and then you've connected them up. What you do when you're laying them out is, so this is just a subset of the graph, and you can see that this green node over here that connects to ever with every can also be represented by to ever, oh, every, every. So this branch over here is actually can be resolved by just taking the blue directed point, so we can get rid of this green point over here, this green directed path. So we can actually simplify our graphs by getting rid of redundant steps that can be explained by other steps instead. Does that make sense? So what we can basically do in summary is we can check if two nodes that are connected can be connected by an intermediate node in between. And that helps us get rid of some of these edges. So our graph from before now looks like this, which is a drastic improvement on what our overlap graph can be. Now we can further continue this by checking if any two nodes have two intermediate nodes that can explain it. And our graph ends up simplifying from that to this simple graph over here. We still have that unresolved blue repeat in the middle, and there's no way of actually explaining any one of these by following this layout consensus or this layout strategy. So because of that, this is as simple as our graph can end up being. Now we see that we have this long stretch of directed nodes, and this long stretch of directed nodes, and this unresolved blue repeat. So with overlap layout consensus, what you end up doing is, in the second step of your layout, you end up taking this as contact one, taking this as contact two, and this is an ambiguous region. So you now have your long stretches of DNA, your subredes, that you can now continue with your assembly. Now with consensus, what you do is, because you know how each one of those fragments end up connecting to one another, you can align all of your reads and make up a specific contact, line them up, and then you call the consensus at that specific location. So the most common base that's called. In say this position over here, we see we have CCAAA, because the A's are more common than the C's, I might be looking at the wrong position. So it's actually shifted, so CCTT, CC, sorry, you end up calling the consensus of a C. I think the PDF is shifted to the side, or the PowerPoint shifted to the side. Yeah, yeah, so basically two C's and three T's, that works as well. So as long as, so basically what you do is, you take the majority vote. I believe it's, if it's a split decision, some assemblers will make that into a bubble, and then you can resolve that afterwards. Otherwise, it will take either one, because it's, because you don't know what it actually is, it's a limitation of the program itself, but then that's basically it. So once you have your larger context, you can then use those context as your reference assembly. There's no way of bridging ambiguous regions, not with just one secret technology. Now with short read assembly, it's actually a fair bit different. You have this pre-step, which is error correction, which can sometimes be skipped. You have your graph construction, graph planing in the context assembly, and then you do scaffolding and gap filling. And gap filling is also sometimes an extra step as well, that can be ignored. So error correction just works. So let me explain this in a bit further, because it's actually explained the next step. So what, the way you work with short read sequences is you take a sequence like this, and you break it down into substrates called k-mers. So if I say I have a k-mer of 40, that means I read this string over here in 40 characters at a time. And I shift it one after another, and I collect all the possible k-mers that exist within my sequencing path. So what you can actually do then is you can look at the number of times you see specific k-mers. If you get k-mers or if you get strings that result from just sequencing errors, which is what this position over here shows, because as your sequence or your reads get longer and longer, you get more error rates within every single base, you can get missed calls on what the base specifically is. So by counting the k-mers that might have resulted from sequencing artifacts and knowing that because sequencing artifacts are fairly low with short read sequencing, we can disregard all tiny k-mers and only focus on the larger ones. Once we have this subset of larger k-mers, we can then continue on to the next strategy. You can also do an exact overlap between the reads, but it's extremely slow because you have millions of reads that you're comparing. And there's a bunch of programs that will end up doing k-mer based corrections as well. Once you have your subset of k-mers, yes. Yes, so you actually select what your size of your k-mers are. It's a parameter that's given when you're running your assembly programs. So we'll get to that afterwards. Jared actually wrote a program that will help you decide what your best k-mer product is, but because assembly is not an exact science, you generally have to do multiple different k-mers and then try and see what gave you the best assembly. So once you have your set of k-mers, you have the set of reads that you actually want to deal with, you can start doing your draft construction. So the way you do this is you end up taking your reads over here. So in this case, your reads are only six bases or ease of convenience. Your k-mer that you selected are four. You lay out all of your k-mers on a graph, and then using the reads, you see how each k-mer is connected to one another. So you can see that ccgt, cgtt will connect to each other. And then the next four strings is this one up here, so you see a connection up there. And you do this for every single read independently and you build these graphs or these nodes that connect all of your k-mers together. And you end up getting this initial graph layout where you can see that you have this long stretch that goes to acgt and then loops back to cgtt and then you have this trailing end over here. The issue is you don't know whether your sequence is actually repeated multiple times because there's no way of resolving that, whether this might just be from sequencing artifacts, so on and so forth. So there's other problems that end up coming. Now this is just for four or five reads with six bases each. So the problem becomes much, much, much larger when you start looking at longer sequences or even more reads. So just like we saw before, sometimes when you end up doing your assembly layout, you end up getting these little artifacts that come off. They don't end up going anywhere. They just trail off and they're much shorter than the overall sequence as well. So those are called tips. So those are one source of error that you get from assembly with short read sequence. The other thing you get is ACT positions where you might have a header. So you have a header as I guess position. You can get these little bubbles where you get this one camer that results as one sequence and another camer that results from another sequence. Because it's exactly that position, you get the exact same fragments going both ways. Or you get the exact same number of nodes going in either one of these bubbles. So our main problem then becomes if this is our example of a graph layout, how do we end up cleaning this that can arrive at context that would work? So the first strategy is we look at all the tips that we have in this graph. Anything that's at the end, so any camera that's at the end of a node is considered a tip. So the very first and the very last of this longest sequence is also considered a tip. Then we work backwards until we arrive at the longest sequence. So because this is the first one and this is the last one, we can see a large number of connected nodes and so we end up disregarding them as tips. But we've now highlighted all of these random branches that go nowhere. And so we remove those camers from our graph. The other thing that we have to deal with is these little bubbles that are formed. Because they're heterocyclist positions and we don't necessarily know which one of these two options are the ones to follow. The program, so once you know it's a full bubble and it's the exact same length, the program randomly picks either one. And so you end up getting one of your nodes resolved. Now, because you have these longer stretches of sequences now, with these positions over here being ambiguous, you can start building your context and the program goes along and connects all these stretches of camers into your overall contact assemblies. So once you have these contacts, you also want to be able to tell because you're using paired and sequencing, you have more information. You want to know how these contacts actually relate to each other and how they connect to one another as well. So that's where scaffolding ends up coming in. So what you end up doing is you take your contacts, because you have paired and information, you can check to see where the range of this first contact or whether the range of this first contact map to the range of your second contact. In this case, the blue over here and the blue over here show that these two contacts are connected. The red over here and the red over here show that these two are connected. The green over here and the green over here show that these two are connected. And that's supposed to be purple. So those two purples would also connect the last graph as well. And because you have that, you know that the context or your overall sequence must have followed a pattern in this manner. Where a contact one, two, three, four, five are arranged in the sequence. You unfortunately don't know what's actually missing in between or what the base sequences that's missing in between this, because they're usually from repeated sequences and they could be anything. So what ends up happening is you fill them with ends for ambiguity. Or because third generation sequencing is now available, where you can get 10 kilobase pair of reads from a single fragment, you can use long read sequencing technology to try and see if you can bridge these gaps using TACBIO or Oxford Nanopore, or you can do local assemblies in this location to try and see whether you can somehow resolve them. But that's basically the overview of the two different assembly strategies. There are hybrid assembly strategies now where you can pull your short read sequencing information and long read sequencing information and try and see if you can reconstruct a better assembly. But those are still experimental and there's still work being put into it. So the quality of an assembly also matters a lot. So bacterial genomes that are short reads, you get hundreds of contacts that are 10 to 100 kilobases. If you use long reads, you get a handful of contacts. And if your genome is small enough, you can actually just get one contact that sequences the entire genome. And if you give it enough depth, you don't actually need any error corrections or any short reads, you just get the entire sequence perfectly. For larger genomes, our short reads give us about 10 kilobase pair contacts or a million base pair contacts for long read sequencing. But it's just far more expensive to do Pac-Bio or Oxford Danupour than Illumina short read. Illumina costs about $1,000 or $1,500. Pac-Bio will give you a lot lower coverage and cost you about $10,000, $15,000, so in order of 10 for a fraction of the coverage. But because the assembly is really important in terms of the way we've approached our new analysis of, say, for example, cancer genomes, so sometimes for genomes that have larger structural variations, if we use our traditional mapping, our mapping strategies, we might be losing out information that otherwise would be resolved from the assembly just because we're forcing matches to occur against a normal deployed reference genome. So there's been a lot of work put into assembly in that strategy as well to try and discover underlying differences in normal genomes, our mutated structurally varied genomes as well. So there's competition called the Semblethon 2 that all it's hosted every single year where different labs will come and compete to try and see which of their software is the best. So in 2013, they were given three different genomes to try and reconstruct and assemble. And what they ended up finding was that each one of their assemblies was actually quite varied. So there wasn't a consensus on whose assembly was actually true and whose was the best program to work with. So Jared's specialty is in assembly. So he ended up trying to look at it and try and figure out, well, what kind of parameters from these assemblers could cause these differences? What makes it give an assembly difficult? And also, if you're comparing different species and the variability between species, what kind of problems might come up? And he ended up making a list of different factors that might make assembly difficult. So repetitive sequences are pretty obvious. It's kind of difficult to resolve which contact follows from another one when you aren't able to fully go across it. High-heather zygosity also makes it a problem because you get a lot more bubbles in your graph and you don't know how to overcome your bubbles. You also don't really have a specific reference sequence to go with. Low coverage makes it really difficult to try and determine whether you have a consensus sequence or not. Bias sequencing, so regions where you have high GC content might be sequenced for the greater depth than lower GC content, just because he uses it as well. So the plasmodium that ends up, that gives rise to malaria actually has an AT content of about 80%. So it makes it very, very, very difficult to try and assemble because your coverage is really low and is extremely biased, but because it's malaria, there's still a lot of work put into it. So people are still trying to overcome these factors, but there's still problems. High error rates because you make random bubbles in camers. Chimeric reads where part of it just doesn't map anywhere. If you still have adapters inside your reads, this will cause problems because you're trying to force bases that don't actually exist in your organism. Sample contamination, obviously. So I will throw what you're trying to assemble off. And if you sequence multiple individuals and you're trying to reconstruct a single assembly from these multiple individuals, well, you're causing heterosegosity to exist to begin with. So you're not actually able to reconstruct with efficiency. So Jared ended up writing a program to try and help overcome these or try and help you estimate how difficult it is for a given assembly or for any assembly that you're trying to approach. So this is the structure of a graph that you would have at the end of an initial graph construction with assembly with short pre-sequences. You can see that you have errors over here, SNPs and end-elts gives you the bubbles and repeats give you these. So sometimes the errors are supposed to be pointed backwards, but you get this unresolvable repeat that just keeps going back and forth. So based on these different metrics, his program is able to determine whether your assembly is going to be easy, hard, how hard it might be, what the average or what your expected genome length is if you're doing organisms that aren't fully categorized. And the way it does it is it looks at camber coverage. So for every given camber, how many reads contain that camber? Just like we used error correction, we throw out really rare ones. We want to be able to conserve everything else. So a good assembly or something that would be an easily assembled genome would be something where you have a normal distribution of your camers. This very, very, very small bit at the end, which is basically one camber or two camber counts would be the rare petrozygous, would be the rare camers that might have given, that might have come aright, that might have arisen from sequencing artifact so you'd end up disregarding them anyways. You can actually see that there's this little shoulder over here. So because it's a human genome, we have about 100,000 SNPs. So you have this little coverage over here of bubbles of camers that have half the expected coverage and that's just because it's a SNP position. So you get this little bias over here, but it's nothing that can't be overcome. Now if you take something like the oyster genomes, the oyster genome actually has SNPs every 100 base pairs and so on. So it's about 10 times more than humans. You can see that you have this bimodal distribution where you have some of your camers still following that average coverage, but because you have such high SNP rate, you get a much bigger portion that have about half the coverage and this makes it much, much, much more difficult to try and resolve what your overall assembly would be. The more biased your distributions become, the more difficult slash impossible it gets to to try and reassemble it accurately. So once we have the coverage as well, we want to be able to take these different branch parameters and try and assess how difficult our assembly would be and we can do this based on different models. So we can do error modeling, variance modeling and repeat modeling. These are all different parameters and mathematical models that Jared built out. So if you guys have questions on them, I can try to answer them, but I don't think I'd be able to. But basically the program that he's at ended up writing will give you a set of figures which will try and help explain any of these problems with it. So for variant branch rate, this tries and tells you for different camer lens down here, different camer lens, how your frequency of variance branches in the derivative graph occur. So if you have higher heterozyte or variant branches, it makes it more difficult to try and assemble the genome. You can see that oyster over here is about 100. Frequency of variant branches which kind of goes in with it's every hundred base pair of snip rate. Humans are about 100,000 and then this yeast over here is used as a control because it's not supposed to have any variant branches. You can also model your repeat branch rate to try and see the frequency of repeated branches based on different camers. And you can see in this case that as your camers get longer, your repeat branch rate does begin to fall as well. But because humans have a lot of long repeats in our genome, we're still higher than all the other organisms. These organisms were chosen based on the fact that the yeast is a control human, snake, and fish I believe were part of the assemblathon. So, competition, oyster is just used as the worst case scenario and then the bird was just added in to, I guess purple. But the bird was also added in to try and compare against the other programs. What this program will also do is try and, based on the assembly graph that you give it, you know, try and estimate what your expected genome size is supposed to be. And the reason for this is you want to be able to have enough coverage for your expected genome size to be able to accurately resolve any repeats or call specific context. So the programs accurate to about 10% because you can see the human genome that is estimated is almost three billion base pairs, which is roughly the length of a human genome or the reference genome that we accept. It'll also estimate the quality scores at every given base position. And you can see that as the base position gets longer, your quality scores begin to fall. Quality scores falling make it much more difficult to do your graph assembly because you have more of those branching effects and you have notes that don't or shouldn't belong there at all. So what you would ideally want is something that's steady without any quality scores falling, but this is just a result of the sequencing technology itself. And then you can also measure your error rate at any given position as well. And the higher your error rate becomes, the worse it'll be trying to assemble your genome. So all of these are metrics trying to determine how difficult the set of sequences you just gave it is going to be to assemble. The last thing you also want, or one of the last things you also want to check is GC bias. You don't want your camera coverage to be based on percentage of GC. So what this ends up doing is plot a heat map and you don't want any trends to be present. So for our fish, which doesn't have any GC bias, we can see that we have this singular point right in the middle without any trends. If you look at the yeast genome, you can see that there's a bit of a linear trend that increasing GC concentrate, GC coverage of percentage actually causes a decrease in coverage, but it's slight. And then if you look at the oyster genome, because of that high SNP rate and heterozygosm, we actually get two of these position, GCE hotspots and we see a slight bias for GC in terms of camera coverage as well. The program will also output the expected fragment size because we're using paired-end reads, so you end up getting a nice distribution or an estimated distribution of your fragment size based on the reads and also do a simulated assembly. So since you were asking about what camera to ideally choose, it spans across different cameras and tries and gives you what your simulated length of N50 is and we'll talk about N50 in the tutorial. It's one of the metrics we use to try and determine how good your assembly is. So it helps you pick or focus on which cameras to actually select when you finally run the assembly on your entire genome. The program is called PreQC, is run and fast. So this is from, I don't know if it's still a pre-print, it might have been published for now, but the pre-print is still available at this link and the code to actually view how it does all of its metrics and calculations is also available on GitHub and these are the commands that you basically have to run. So it's three commands that you end up getting, summary of your assembly strategy. So that's just an overview of how assemblers work. Short and long read assembly requires different methods because they're different data sets. You can't just use whichever strategy you want because they're tailored for that specific technique and many factors, Sherman of the assembly is difficult or easy. Short reads, SGA's PreQC can help assess these factors but just because it can help assess the factors doesn't mean you can necessarily overcome those factors. And the strategy nowadays is to pull short and long to try and create the best assembly possible. And that's basically it, any questions?