 So this part of the workshop is a little different in that this is a section that only has a lecture. There's no associated practical. And as Michelle said, this is on de novo genome assembly. So I was glad to hear that some people at least are not working primarily on human and they have they've sequenced something else and they'll want to use genome assembly as a starting point. And in the next hour or so, I'm going to go over genome assembly from both a theoretical standpoint of how the assemblers are actually working, what are they internally doing when they take a sequence genome and try to reconstruct it, and also the practical aspects of genome assembly and how you can get hopefully good results out of genome assembly software that you use. So I'm Jared Simpson. I'm a principal investigator at the Ontario Institute for Cancer Research. I've been working in genome assembly for about seven or eight years now. At last count, I've written four different assembly software packages. You can interpret that in two ways, either assembly is very hard and the field is always changing or that I'm just never not very good at it. And I need to constantly rewrite new software. It's the former. Thank you, Francis. I appreciate that. Right. So just to get started. What is genome assembly? So we're all on the same page here. So I'm going to refer to pictures of genomes that look like this. So I'm going to refer to these cartoonish looking pictures of genomes to illustrate some points throughout the talk. Now, genomes are made up of both unique sequence, which I'm visualizing by these colored bars and repetitive sequence. I'm visualizing by these red bars here. The repeats are what makes genome assembly difficult. If you have sequences present in multiple different places in the genome, the assembler has to try to put those repeats in the right place. And when the read length is a lot shorter than the length of the repeats, which is commonly the case when we use Illumina sequencing, that's very difficult. And that's why genome assembly is so hard. And we're constantly trying to cut them up with new ways of resolving these repetitive structures. So going back to the question of what genome assembly is, we start with a genome that looks like this, and then we sequence it, which can be thought of as just taking random pieces of the genome, figuring out their sequence from end to end. So usually the length of the reads is much, much shorter than the genome. So even if you sequence a bacterial genome, which might only be, say, five megabases in length, you sequence it with roughly a 100-base pair of reads, which is six orders of magnitude shorter. And then the genome assembler has to take all those reads and then reconstruct the genome into its original structure. So we take a genome, we sequence it by breaking it into pieces, and then the genome assembler is essentially inverting that process by taking the small pieces and trying to put them back together. So here's an overview of what I'll talk about. First is really a theoretical section where I'll talk about assembly graphs and the data structures that assemblers are using. That's just to get you familiar with common terminology. And then I'll walk through an example assembly pipeline from taking a sequence genome in FASTQ format and going through the full assembly process. Then I'll talk about some of the features that make assembly difficult. Unfortunately, assembly is a process that's ripe for failure. If your genome is, say, extremely heterozygous, extremely repetitive, you can get very poor results. And I'll try to talk about some of the features that lead to difficulty during assembly so you can understand them when you sequence a genome you're interested in. And then I'll talk a little bit about the future and how assembly is changing with the introduction of long reads where the read lengths can be 10 kilobases or longer. For example, the Pacific Bino-Sciences sequencer or the Oxford Nanopore incident. Okay, so I'm also going to look at this instruction a lot during the talk. So this is an assembly graph. Now, when I mentioned what the structure of genomes looked like, how they're made up of repetitive segments, the way that assemblers deal with the repeat content within genomes is it makes a graph of where each vertex is a sequence, for example, a sequence read, and it links two vertices by an edge if those sequences are related in some way. So, for example, if two reads overlap, you put an edge in the graph. And this is the way that assembly deals with complexity of the genome. It's a very concise representation of the sequencing data, and we can perform inference over this graph to try to figure out what the original structure of the genome was by looking at certain features within this graph. And I'm going to come back to this a few different times during this talk. So now, when you work with a genome assembler, it'll use one of two types of assembly graph. If you're working with next generation sequencing data, like Illumina data, the assembler will use a data structure called the Brown graph. This is an extremely popular model of genome assembly because it's a very efficient way to represent the sequencing data, and it can be built extremely quickly. The older way of assembling genomes was using overlap-based methods or what's now called string graphs, which are representing overlaps between reads, which is much slower and much heavier to work with. I'm going to give you examples of both of these two types of graphs over the next couple of slides. So I'm going to start with the string graph or the overlap graph because it's conceptually the simplest representation of the data. So here we're just looking for overlaps between reads. And by overlap, I mean we're at the end of one read and matches the start of another read, for example, here. And when we have two reads that overlap, we just put an edge between them. So here we've got two reads, read one, which starts ACG, and so on. And we're representing that as a vertex in the graph. We have second read, read two, which is ATG. We represent that as a second vertex in the graph. And because the start of read two matches the end of read one, we've linked them by an edge in our second graph. Now as we add more and more reads to our collection, so we've added a third read here, we have a third edge. It's prefix matches the end of the other two reads, so it has edges to both of them, and so on. We start to build up a picture of what the genome might look like. And now the fundamental bit of theory about genome assembly is, yep, on the bottom. That's right. Yeah. So the next slide is that when we go to assemble the genome, all we're doing is trying to find a path through this graph and concatenating the strings that are on the edges, these edge labels. So this fundamental piece of theory is that the genome is a path through this graph of all of the reads. And that's really the only thing you need to understand about how the assembler is building these graphs is that it builds this graph and then it's trying to just find some path through the graph that reconstructs the sequence of the genome. Now of course there's plenty of caveats and dealing with things like heterozygosity, dealing with these repeats that I've been referring to and dealing with sequencing errors, but really we're just trying to construct a graph representing our data, find a walkthrough. So is that part, is there any more questions at this point? I like to usually like to say that like this, I'm up here talking but this is a workshop, so please just jump in with questions as we go along if anything's not clear. That's right, yeah. So we would call those contained reads where the read is contained within some other read and it doesn't add any additional information about what the structure of the genome is, so those are removed up front. Okay, so here's what a generic assembly pipeline looks like. So at the start of pipeline, we have fast queue data which is fresh off your sequencing instrument and then we might perform read trimming and read filling to clean up the data, remove artifacts from the data set. We then might perform error correction, build the assembly graph which I've just been talking about, clean it of more artifacts, construct contigs, scaffold the contigs together and finally field gaps and that would be the result, that would be the assembler's view of what your genome looks like. And I'm going to talk about all of these steps in detail and this is more moving into the practical aspects of the things that you need to do when you're assembling a genome. So sequencing adapters I think came up a few different times but how the Illuminous Sequencers work is that they take synthetic sequences, ligate it to the ends of your biological DNA and then that's used to prime the sequencing process. Now usually the Illuminous Software will strip those adapters off and what you get in your fast queue files is just biological sequence. Now if for some reason this process went wrong and these synthetic adapters remained in your reads they can completely ruin your assembly. To the point of view of the assembler they're going to look like a repeat that was an extremely high copy number and it won't be able to construct long contigs. So often if you have adapter contamination in your data you have to remove those adapters before giving them to the genome assembly software. Some assemblers come packaged with trimmers. They will hunt for these adapters and remove them. Otherwise you might want to use a standalone trimmer. I quite like this Trimimatic software. They can take a library of Illumina adapters hunt for them at the ends of your sequences and then get rid of them so they don't they don't ruin your assembly. Another way that you might want to trim is by quality score. We heard about quality scores this morning where the Illumina sequencer has this characteristic error rate where the quality of the data drops towards a three prime end of the read. If you have particularly bad data you might want to just remove the bad bases at the end of the read to reduce the work that the assembler is going to have to do to resolve that low quality sequence. And again all of these these softwares will perform both adapter and quality trimming. Yes if you have exceptionally poor data then just resequence it. Complaint your sequencing core. They love to hear that their data is not good enough to use. But yeah sometimes at some point you want to to weigh how much time you have to put into working with your data against just running you know a new high seek lane which is really quite cheap these days. So just to go to go back to that point I just raised the error rate at the ends of reads increases. I think Vitae talked about this earlier this morning and might have even shown this same plot. So this is six different sequence genomes. It's a yeast saccharin, cerevisiae, a fish, a Lake Malawi cichlid, a snake, human genome, a parakeet and oyster genome. And these six different sequencing runs have extremely different error rates. The human data which is this light blue line here is quite good. The error rate is less than half a percent across the entire length of the read. But say for the fish data set or the bird data the error rate is much higher towards the end. So this is data that you might want to perform quality trimming on or run an error corrector on. So how error correction works is that it looks for sequences that are very rare within the data set and then it tries to fix those rare sequences to be more common sequence. So it's looking for sequencing errors within your data and trying to recover what the real sequence was after removing the errors. So to illustrate this process we're going to consider this read here which has a single error which is the colored in this red C here. And a lot of the error error correctors are based on k-mer counting. So k-mer counting is where we take a fixed length string of length k. In this case k is just 20 so we're looking at 20 base segments of the genome. And then we just count how many times they occur across our entire run of sequencing data. This is a very efficient thing to do. And for sequences that don't contain errors we expect to see them quite often. So if we sequenced to 40x coverage we'd see these 20-mers about 40 times. However if there is a sequencing error because the sequencing errors are roughly random and roughly uniform we expect to see the substrings that have an error within them very few times. So the substring that has this red C which is our sequencing error here is only seen once across the entire data set. So how do we if we have an error? So this is we do this exact process. We just start from the beginning of the read, count how many times the first 20-mer is there, how many times the second 20-mer all the way across along. And when we see the 20-mer that's very rare that's when we'd say okay we've detected an error. What are the possible alternatives? Let's just let's change this C to one of the three other bases and one of them would flip that to be to be the true sequence. So just looking at that in a little different way this is a graph of the the number of times the 20-mer has been seen as a function of the position in the read. And you see we we start from the the first 20-mer it's around can't read that around 27 and then we're just going up and down a read between 20 and 30 and then when we get to this point here the 20-mer frequency drops to one. So we detect an error at this point and then we look for a correction that can make all those 20-mer as well represented in the data. So you can either use a standalone k-mer based error correctors I've listed a bunch here. COIC is very popular, SGA, Soap De Novo, BFC, Blast, Light or Musket. These are all fairly good algorithms that run very quickly on your data and by performing this error correction process you you're giving much cleaner data to the assembler and you're reducing the amount of work that the assembler has to do to resolve these sort of these sort of artifacts. Now there's also error correctors that are based on finding inexact overlaps between reads so aligning reads to each other and looking for differences. These tend to work quite well if the error rate is very high but they're extremely slow rather than just counting 20-mers you're performing inexact overlaps between a vast number of pairs of reads and that's computationally very inefficient. So we typically don't run overlap based error correctors on on very high throughput data sets. Okay so now we've performed a little bit of data quality control and data cleanup through read trimming and read error correction. Now we're starting to get into the media assembly which is constructing the assembly graph and finding the assembling it into our genome. So I talked about overlap graphs before now I'm going to talk about the brown graphs. So the brown graphs are based on this idea of k-mers which I just introduced during error correction. Essentially what the brown graph is is we take all all strings of length k this is an example where k is four and we make a vertex for every former that's in our data set and we connect two formers with an edge if they overlap by k minus one or in this case three bases. Okay so this is our example data set here we have five different reads the first one CCGTTA so it has formers CCGT that's this vertex here CGTT this vertex here GTTA here okay and then if we do that for all of the reads we get this graph and now we see this graph branches because this CGTT which I've labeled in red here is present in two copies within our reads with different sequences that are after it so we can either we have GTTA following CGTT and GTTC following it here. Now what the assembler is going to try to do is resolve that repeat and we can see that there's one path through the graph that can visit every single vertex here so we can go down here follow this chain of vertices back through here to the second copy of this CGTT and then up through here and again this is just this principle that we're trying to find walks that reconstruct the sequence of our genome. Now this is just a diversion into the efficiency of genome assemblers when I started working in genome assembly in 2008 we weren't so concerned about making the assembly software very fast or very memory efficient we just wanted to be able to take Illumina data these very short reads and then get some meaningful result out of it so the first genome assemblers were incredibly inefficient they would construct these kamer based graphs and it would require about a hundred bits per kamer and if you translate that to trying to assemble a human genome which might have 10 billion kammers in the graph the amount of memory would be about a terabyte. Now almost never you'll have access to a computer that has that much memory so a lot of the work that's gone into de novo assembly in the last 10 years is just trying to reduce the amount of memory required to perform assembly of large genomes and because of all that work that's gone into it we can now assemble a human genome using a brown graph using about 10 gigabytes of memory rather than about a thousand gigabytes or a terabyte of memory which was what the situation was about seven years ago. So after the assembler has constructed the graph it wants to perform additional graph cleaning steps to try to get rid of artifacts so I'll talk about two different types of artifacts that appear the first one is what we are referred to as tips so these are structures in the assembly graph that branch and then just don't go anywhere they branch have a few connected vertices and then they stop they have only a connection on one end and not the other now these structures are caused by residual sequencing errors that the air corrector or the air trimming wasn't able to get rid of so if we have a sequencing error at the end of our read that was uncorrected all of these camers at the end of the read aren't going to be connected to anything because they're rare so they're unlikely to have be seen in multiple reads so they cause just this structure here that goes nowhere likewise this sequencing error would cause this structure here that goes nowhere now the assembler wants to take walks through the graph and we want those walks to be very long that means we're reconstructing large segments of the genome uh in into single pieces so you want your genome to represent your genes in single copy or in single contigs so that you can then run um a gene finding program and and and see what your your genome's coding and to do that you need to be able to make very long walks along the graph but because of these artifacts that branch off they confuse the assembler so the assembler wants to identify these residual sequencing errors and then remove them from the graph to only leave this gray path through the bottom here now there's a second type of artifact that the assembler has to deal with which is what we call bubbles so these are caused by heterozygosity so if you sequence a diploid genome like a human genome there will be heterozygous snips within the genome and because of the the alleolic differences for example uh there's a c on on one haplotype and a g on the other they cause the graph to branch just like those tips did here but um yes so if you if the if you're nearby snips and they're on the same haplotype let's say there was another snip here it would just extend the length of these bubbles out so you'd have a longer structure here and you can phase them saying that both of these snips came from one haplotype or both of these alleles are one haplotype versus the other um but in general you you get isolated structures here human genomes only have heads about 100,000 bases so you you end up with these bubbles about every uh 1000 vertices in the graph and as i said before the assembler wants to be able to just reconstruct long paths through the graph so it it'll remove these structures to just represent one haplotype so that clears are any questions at this point yep yeah so like we have to use different software for different error or what how come one should let the you know standard software for which you most of them you know have yeah so this is where um something a paraphrase it's it it you're asking um how can we standardize what what's what software to use at each step of the pipeline right like how do we choose what trimmer to use how do we choose what error corrector and how when when you should use one software versus another is that right yeah so um a lot of this comes down to your data set unfortunately um there's certain error correctors that are very good at say getting rid of all of the errors in the data so if you have or say very aggressively getting rid of the errors in your data so if you have a high error rate in your data you might want to use a more aggressive error corrector um on the other hand if your genome's highly repetitive you don't want the error corrector to to correct copies of repeats to to to a different sequence which would be an error so you then you want to be more conservative so um it depends a lot on what the what the the repeat content and the properties of your data and i'm going to come back to this in in a little bit um what i suggest for anybody who's who who's just starting assembly you've sequenced a genome okay now i want to assemble it is to use a fully featured assembly pipeline like all paths lg which is written by the brode institute which will integrate all of these steps together um also i recommend going to papers people who who run in a full assembly pipeline see what programs they are using for each step and then try to walk their way your way through it um okay so those are the two two types of artifacts so after we've sequenced our genome we we perform some qc on the data we might construct an assembly graph that looks like this uh now the assemblers wants to assemble this unfortunately the graph's quite complicated there's branches every few vertices and there's not not a lot the assembler can do so the first thing it'll do is it'll perform this trip tip removal step where it identifies these branches in the graphs that just go nowhere it removes them now we have a much cleaner graph then we'll look for these bubbles in the graph which uh indicate heterozygosity we'll collapse them down to a single allele and now we have a much cleaner structure and something the assembler can uh deal with yeah you do um usually the assembler will take the path that's higher coverage between the two so yeah higher coverage so higher camera count or the higher number of reads on branch and the reason for that is rarely you can get sequencing errors um if or if you get a systematic sequencing error the scene in multiple reads it can construct these structures as well so the heuristic of taking the one with the higher coverage is you're biasing this towards selecting the um the true allele rather than the other um if it's a heterozygous snip it's essentially arbitrary which one you you represent in your data um so it just picks the higher coverage one now a lot of assemblers will will write out a facile saying that it's removed this structure and it thinks that it's a heterozygous snip so then you can go back to that that file and and see where the allele like variation in your graph was okay so now we have a much cleaner representation of the data and the assembler can start to build contigs and the contig building process is just merging all of these simple chains of vertices that can be unambiguously connected together so there's no branch is here so it'll just merge these together so the assembler merge the first four vertices but since this node branched it could either go here or here the assembler had to stop it couldn't merge that node for any further without making uh the chance at a misassembly because there's two different possibilities from that one but then it would skip over that carry on merge this together merge this together and so on until it finishes with this graph which is what we call a contig graph or a unitig graph where we've made all the unambiguous merges together to construct these contigs now at this point you might ask how long are these contigs going to be and and again it depends on your data if you sequence the simple bacterial genome like E. coli with aluminum data your contigs might be a hundred thousand kilobases in length on average if you sequence something that's more complex like a human genome the contigs might only be 10 to 20 thousand bases in length and at this point we want to try to bring in paired in information which we heard about this morning and when mark was talking about structural variation as a way of linking these contigs together and jumping over these repetitive branching segments that can be resolved by the reads alone so the bacterial one is it is larger because it has less variation typically because it has less repeats so so larger genomes tend to be more repetitive and they cause more of these branches in the graph that the assembler can't resolve but also and i'm going to come back to the variation as well but because bacterial genomes don't have variation usually they're clonally grown and then you sequence a single clone it makes the assembler it makes assembly a lot easier as well so this process of taking your contigs and and paired in data and trying to jump over repeats it's called scaffolding you know give an example of that now so we start by mapping read pairs to the contigs and here i've just depicted a handful of pairs linked with this arc here linking two pairs so this is one read this is a second read and they're pair linked together and from this we'll look at pairs that line to the end of one contig and the end of another contig so these uh i'll use the right ones a little more visible so we have red pairs linked here and red pairs linked here so you simply might think that this this contig is linked to this one because there's this consistent pattern of pairs they're in a group here likewise this this purple one might be linked here green green blue blue yep with pairs yeah a lot of the time that's right yeah a lot of times the assembler won't look at the pairing information when it's trying to build contigs just because it makes the problem a lot more complicated and trying to uh track which pair is linked to which pair takes up a lot of memory so if you're doing large assembly which as i mentioned is very hard just keeping track of all that information is usually too much so it uses this heuristic of trying to construct contigs first and then bring in the pairing information afterwards so this can either be your original paired end data or sometimes they'll get longer range mate pair libraries this may be where the pairs are separated by five kb to to try to jump over even very large repeats so you can you you can think about using multiple different paired end libraries as a way of building longer and longer scaffolds so once we've mapped all of these reads to our to our contigs the assembler will look at consistent pairs um that are mapped to one end of contig and and the other end of the pair mapped to another and it'll build what we call a scaffold graph so this is similar to these these assembly graphs i was talking about where we have uh sequences and link edges between them or links between them um and here we're just the sequences are our contigs and then edges are where the uh paired end links are mapped to to either end of the contig now the assembler will try to estimate the distances between the contigs using this known fragment size distribution so i think this was talked about in the smore this morning where um when you when you make a paired end library you expect the pairs to be roughly 400 bases away from each other so the the assembler can use that information to say okay i've got this contig on one half of the scaffold this contig on the other half and i think there's 200 bases in between so that's gap size estimation and uh that's that's one of the steps the assembler will use to try to put together contigs um so once it's made these gaps which with their estimated sizes we can then try to perform a second round of localized assembly to try to fill in those gaps with real sequences when the assembler makes scaffolds it will just put ends in between them saying this unknown nucleotides so in between this contig in this contig there'd be about 200 ends here but we can try to perform a second round of assembly to fill in that uh gap using more localized information there's standalone tools for this there's one called sga gap fill or gap closer from soap de novo and you can also use uh other sequencing technology for example pack bio which is much longer range information to try to figure out what the sequence is within these gaps so if you have packed bio reads which are 10 kb even if it's a very repetitive sequence within that gap the 10 kb read should span that and we can try to fill that in yeah um yes there is so the the soap de novo gap closer tends to be extremely aggressive it will try to fill in gaps even at the cost of uh making errors um the sga gap filler full disclosure i wrote that it's very conservative it won't fill as many gaps in but it tries to avoid making errors when it fills in gaps uh there's a nice paper reviewing scaffolding and gap filling um programs by a guy called martin hunt it was in genome biology i think about a year ago so that's a good resource um for looking at that if i have my email address somewhere in the slide if you email me i can send you that paper okay now um something i've alluded to a few different times is that the results for assembly can can vary widely across using both different software on the same data or sequencing two different genomes and running it through the same software um so around the time that this paper was published called the assemblophone two i started to explore the question of of what makes it give an assembly difficult so what are features of a genome or properties of data that are going to cause problems um for the genome assembler so the assemblophone project was a competition where three different genomes were sequenced this uh fish is parakeet and the snake that i referred to earlier and that data was released to the community of people who write assemblers they assembled the data submitted it to the organizers and then the organizers said whose assemblies were good whose assemblies were bad and there are a lot of lessons learned from that um but the main one is that the results are are largely inconsistent between software run on one genome and different genomes run through the same software uh package and the reason for that is that there's a lot of factors that go into uh making a genome assembly so i've talked a little bit about repeats so if your genome's high the repetitive it makes the assembly a lot more difficult if there's high levels of hetero zygosity so there's a lot of these bubbles in the graph makes assembly difficult if you haven't sequenced enough if the sequencing coverage is biased towards lower high gc regions that complicates the assembly also there's properties of the data if the error rates very high if you have a lot of chimeric reads there's sequencing adapters if there's sample contamination so you've sequenced two different things by accident instead of just one or sometimes when you are sequencing very small organisms you can't just sequence a single individual to get enough dna from but you have to sequence multiple of them and that then becomes sequencing in a population and causes a lot of these bubble like artifacts in the assembly graph so what i worked on last year is coming up with ways of measuring these different properties that contribute to difficult or failed assemblies and writing a program that will give a report to the users about effectively how difficult their assembly is going to be and i'm going to walk through this as a more practical part of the talk so i showed you this picture of structures in the assembly graph earlier so now we can start labeling different features in the graph we talked about these tips that are caused by sequencing errors we talked about bubbles that are caused by headers i guess snips and indels and it talks a little about a repeat so it's caused a lot of branching in the graph so now the question is going to be whether we can say how often tips occur in the graph how often snips and indels occur and how repetitive the genome is so how am i doing on time 20 minutes left okay great so the way this works is we we perform it now we've built a probabilistic model of the graph using sequencing coverage so we can annotate every node with however how many times it's been seen in our data set so this k-mer has been seen 40 times this 139 38 and so on and when we reach branches in a graph we can predict whether it's caused by a repeat whether it's caused by headers like a snip or whether it's caused by a sequence here uh and the way that we do that is we we make a model of what the total data histogram will look like so this is if we if we look at every vertex in the graph what's a histogram of how many times that vertex has been seen so the mean here is about 30 times that's a 51 more count here and it has this nice normal distribution here so this is a human data set there's not a lot of heterozygosity and the data is well covered we on average we see each vertex on the graph about 30 times in our reads now if we compare that to oyster data which is one of these organs where we have to sequence multiple individuals and it's very heterozygous we see that this 51 more distribution is bimodal so there's a population of k-mers at at about 45x and there's a second population of k-mers at about 22x so everything that came from this part of the distribution is heterozygous everything that came from this part of the distribution is homozygous and if you compare that to human there's just a little bump here of the heterozygous k-mers but the oyster data has a huge bump here now this is telling us that there's a very high heterozygosity within the oyster data set and assembling this is going to be really challenging because the graph is just going to have these bubbles everywhere that the assembler is going to have to try to resolve um this one yeah so these are sequencing errors so these are things seen exactly one time in the data set there are lots of those yeah you'll always see see this spike which roughly has an exponential distribution where copy one is is fairly frequent but you rarely see a sequencing error twice three times and so on so that's why this falls so sharply um it's assuming that sequencing errors are random that's right now ideally you want um your distribution to be quite far out to the right um so that you don't you don't want k-mers to be seen zero times because then you have a break in the graph where you just have an observed a k-mer human data sets quite good it's well separated from zero from zero this oyster data um is is slightly too close to zero for comfort if somebody showed me this and said what should I do I tell them they should sequence another lane just to push the coverage distribution out to the right a bit more okay so this is the probabilistic model of how um we can we can classify branches in the graph as either sequencing errors heterozygous variation or repeats I'm not going to go into the math um but I'll just go straight to the results and if we run it on these six different genomes the yeast the fish the snake human bird and oyster we start to see stratifications based on how often these heterozygous variation branches occur in the graph so down here is the yeast data set and this is a branch rate 10 to the minus four that's one in 10 000 because these yeasts are not deployed we see very rare branches that are tuned to variation with them on the other side of the spectrum is the oyster data where the branches rate is about one in a hundred bases and that's consistent with I just showed you where the the coverage distribution has these two distinct modes one heterozygous camers one homozygous camers and then the other genomes are following within here so the human genome is here we have a branch about one in a thousand bases which is what roughly what we expect from population genetics so this oyster data extremely hard to assemble yeast very easy to assemble and and the other ones following in between now if we also classify the the frequency that the graph branches due to repeats we see a similar pattern um now the human genome is one of the most repetitive this makes sense because it's the largest genome and we know that human genome has tons of transposons which contribute to assembly difficulty again the oyster genome has a high level of repetitiveness and the yeast genome is not too bad we should be able to assemble that quite well so are these plots clear you have any more questions about about this part before I move on okay so we can also use a statistical model to calculate genome size um so the human genome as expected is estimated to be about three gigabases uh the snake is about a gig and a half the the cichlid is just below a gigabase and this parakeet is is about 1.2 so these pro this program can be run on just fast cue data so you just take your data in fast cue format run it on the program and it will produce a pdf of all of these results predicting how large the genome is how heterozygous it is how many repeats and so on and that gives you a sense of how difficult your assembly is going to be the program will also perform qc on your data so it will plot quality scores as a function of the position in your reads so you can see whether your data is higher low quality um it will also it made this error rate plot that was looked at a couple different times and it will also try to assess whether there's gc bias so uh aluminum sequencing is often amplification based where there's a pcr step and that pcr step can cause sequences which are either high or low gc to be over represented in your data um and this is a good thing to assess prior to your assembly because if your genome is very gc rich you want to make sure that it's well covered um so this is an example of a good data set where we're plotting camer coverage as a function of gc content and it's roughly uh uniform with respect to the gc content now this is a data that's not quite as good where we see that there's a trend towards lower coverage for higher gc regions um so this might be this might indicate a problem with this data set now again if we look at the oyster data we see two distinct distributions here again there's a population of heterozygous camers and homozygous camers here so the oyster is really the nightmare case for assembly i use this in all my talks just to indicate just how hard assembly can be um and and essentially the the people who published this data weren't able to get a good assembly out of this data it just didn't assemble at all and they had to use much longer reads to assemble the oyster genome rather than just a hundred base parallel data okay and this program will also uh plot the the fragment size histogram or the in the the paired end fragment sizes for each one of the data sets so uh taken snake as an example the histogram is about uh a mean of 380 bases which should match whatever uh the library prep that that was performed on this data is this is another way that you can see uh the quality of your data you can check whether the the the empirical fragment size distribution matches what the lab tells you there's some data sets in here that might be a problem so for example this fish data set the insert size is less than 200 bases so since we're sequencing a hundred base read on either end the um reads are expected to overlap in the middle by about 20 bases uh that's something the assembler should know as it changes the coverage distribution um and and it has to be aware of the fact that the the two paired end reads overlap uh and finally this program will actually perform a simulated assembly for you so you can predict how well your genome is going to assemble um so as I said here the yeast data the contig lengths are much longer contig lengths are just on the y-axis here as a function of this k-mer length that I talked about which is the the length of the vertices and the brown graph so the yeast data has is the easiest to assemble it has the longest contigs and as I've said before the oyster data and yellow here has a very short contig and it's quite hard to assemble no matter what k-mer size you use so this program um is open source you can download it and run it on your data it will give you a pdf report about how difficult your assembly is going to be um it's very useful if if you're going to ask for help on your assembly for example post biostars or the mailing list of the assembly tool that you're trying to use to include this type of report so that the authors of assembly people like me can take that report and say okay your genome's really heterozygous you might want to change this parameter up or down and give you tips on how you can parameterize your assembly after taking into account the unique features of your data set okay to end now I'm just going to talk about long read assembly so we've had Illumina sequencing for about seven or eight years now um and it's been very successful in that we can sequence tons of genomes from all over the tree of life and get a picture of the relationships between cbc's unique gene contents within different organisms but the assemblies tend to be very fragmented as I said before if you sequence a large eukaryotic genome the context might only be 10 to 20 kb in length now in contrast if you take uh long single molecule sequencing data like from the pacific biosciences sequencer or the the newest one the oxford nanopore sequencer the reads might be 10 000 to 20 000 bases themselves and it makes the assemblers job of resolving these repeats a whole lot easier unfortunately that comes at the cost of a much higher error rate within pac bio data your error rate is around 15 to 20 percent as compared to one percent for Illumina data but because the reads are so long the the read length compensates for that high error rate in in some ways you can perform error correction of the long 10 kb reads and then assemble corrected data just like I was talking about correcting Illumina reads now for Illumina reads we had to develop all this machinery for working with the brown graphs all of that doesn't work when the error rate is that this high so now we're going back to these overlap based methods of assembly which take a whole lot of compute time but they can deal with this 15 to 20 percent error rate so a long read assembly pipeline looks quite similar to a short read assembly pipeline again we'd perform read trimming and read filtering then perform some error correction construct a graph clean the graph build contigs and then we've added this polishing step now I'll talk about the two unique features of long read pipelines in turn so typically error correction is now overlap based rather than based on camers and because of finding overlaps between high error rate long reads is very expensive this can take thousands of compute hours and now there's this trend where for short read assemblers we were only worried about memory efficiency we wanted to perform the assembly using as little memory as possible now we have to worry about compute time for long read assembly memory is no longer a concern we're just concerned about how long it takes for the assembly to complete because we've gone back to these overlap based techniques and this final step which we call consensus polishing is where we're trying to compute the final sequence of the genome using a probabilistic model of the sequencer so the idea here is that if we take our raw pack bioreads which have 15 percent error rate and then construct contigs from them using overlap based techniques the final assembly might have an error every one in a hundred bases and that's not good enough if you want to say find genes from that assembly or perform any sort of population analysis so people have been developing models of the sequencing data where we can run a hidden markoff model taking all of the data into account and compute a new consensus sequence which is more like one in 10,000 error rate rather than one in a hundred so these two are steps error correction of long noisy reads and consensus polishing of long noisy reads are really the the fundamental difficulty of working with long read data and and the problems that people who who write sequence analysis algorithms are dealing with now so I've just put three different papers up here if you want to read more about long reassembly so this is the classic paper where jason chin and his colleagues at pack bio showed that by using pack bio data you can assemble bacterial genomes into a single contig so the genome's not fragmented at all essentially you've taken your genome sequenced it and then put it back together perfectly now this is extremely computationally expensive and and some colleagues of mine out in philipi and sir gay corn are trying to apply these long read assembly approaches to very large genomes so they now have a pipeline that's efficient enough to assemble the human genome using pack bio data and this paper here is a paper from my lab where we've collaborate with nick loman in the uk to assemble bacterial genomes into one contig using oxford nanopore data and and for those of you who aren't aware oxford nanopore is a sequencing instrument that's roughly the size of a smartphone you can fit it in your pocket it's extremely portable our colleague josh quick is right now in west africa sequencing ebola samples right in in clinical labs to wait in as a way of tracking the outbreak and population genetics of ebola and really this it's an extremely incredibly exciting technology in that now we can move sequencing anywhere and and and sort of rapidly sequence outbreaks do you have a question so um it's improving all the time initially it was almost 30 to 40 percent which is incredibly difficult to work with now it's about it's comparable to pack pack bio where it's about 20 percent 15 percent at best