 Right, so I'm Jared Simpson. I'm a group leader here at OICR. I'm just over in the West Tower. I work primarily in Genovo assembly, so since about 2007 I've been working in bioinformatics and my research has always been designed towards making assembly more scalable and faster and more memory efficient and that's really what I'm going to talk about today. So there's really two key parts of my talk. One is a theoretical part and one's a practical part. At the beginning I'll talk about more of the theory of assembly and how we use assembly graphs to represent sequencing data and the relationship between sequence reads and then I'll start going into more of the practical aspect of assembly. So I'll walk through what an assembly pipeline looks like step by step and give examples of programs like they can perform quality trimming and error correction sequencing data and then finally I'll continue on the practical aspect of the talk and discuss various factors of what makes assembly a difficult problem, why do so many assembly projects fail or run into difficulty when you just sequence a genome and then you get back an assembly that's maybe not as good as you're expecting. And I'm also quite happy that some people don't have reference genomes here because I think everybody should be doing Genovo assembly for as many things as possible. Okay, so I'm going to start with this picture. It's a little cryptic. This is an assembly graph or at least how I think of them in my mind. Essentially an assembly graph is just a way of representing sequences and their relationships to each other. So typically the assembly graph, each vertex or each node in the graph is some sequence and we put an edge between two vertices or two sequences if they share some overlapping sequence. And the idea is that if edges represent sequences that could be merged together to perform the assembly. Now you see that there's four different colors in this graph, gray, red, this light green and blue and all of these are trying to depict different structures that appear in assembly graphs. I'm going to leave it at that right now but I'm going to be returning to this picture later on and describing the different structures that form an assembly graph and what that actually tells us about the genomes that we're working with. So if you read assembly papers or you look at assembly software, you'll hear about two different types of assembly graphs. There's the extremely popular to bring graph model of assembly. This is where you take each sequence read, break it into a set of sub sequences that we term K-mers and then construct a graph of these K-mer sequences. This is by far the most popular way of performing assembly on next generation sequencing data because these graphs where we're just breaking the reads into K-mers and finding K minus one, overlaps between them is extremely fast to construct. We can do this construction of the graph in linear time so it scales extremely well for large volumes of data. In opposition to that, we have string graphs or overlap based methods of assembly where each vertex in the graph is a sequencing read and we link vertices with an edge if they share an overlap. This is a very rich representation of the sequencing data. It's sort of the as much information as you can possibly represent in the assembly graph, but the drawback is these graphs are very heavy to construct. In early approaches to assembly, you'd have to take all pairs of reads, compare them to each other and this is extremely slow. So these string graph or overlap based methods of assembly didn't work very well for Illumina data. This is now changing and there's been a lot of work on fast construction of these type of graphs. I put one reference here where I published a paper at ISMB in Barton Formatics 2010 that allowed these graphs to be constructed very quickly if you were restricted to just exact matches between the reads. This form of assembly is the basis of the SJ Assembler, which is one of the primary software programs that I work on. So I'm going to use this string graph to go into more detail about how these assembly graphs are constructed. So we start with just a single sequencing read and we add that read as a vertex in the graph. Then as we add a second read, we now add a second vertex to the graph and since this read, read two overlaps the previous read, read one, we add an edge between them. And we've labeled the edge here TAC, which is the portion of read two that read one does not contain. And the idea here is that if we added this string TAC to read one, we would get a string that contains both read one and read two. So we'd have an assembly which is a super string of both of these two vertices. We can extend this idea to more and more collections of reads. So we add a third read here, we add a third vertex to the graph, we add two more edges to the graph. If we add a fourth read here, it overlaps all the previous reads, so we add more and more edges to the graph. All of them are labeled by these overhanging stretches of sequence. Now the fundamental property of assembly that I want to get across here is that walks through these assembly graphs give us reconstructions of the genome. So if we find a lock that starts at V1, goes to V2, V3, V4, and we collect all the sequences along these labeled edges, we reconstruct a sequence of this genome for this toy example. So the idea is that assembly just boils down to making these large graphs and then finding walks through the graph and reconstruct the sequence of your genome. So that's where both some work to assembly goes is just building these graphs and trying to find walks. But there's a lot of steps to an assembler. When we have real data, it often doesn't behave in ways that we expect or we want, so we have to perform a lot of different cleaning steps to clean up the data before assembly. Now this is going into the more practical aspects and we talk about what my picture of a generic assembly pipeline looks like, where we start from raw read data up here and then it goes down the pipeline this way. And we might perform some read trimming or filtering to get rid of low quality reads. We might then perform error correction, build the assembly graph, clean it up a little, construct context from the graph, and then scaffold them together and perform gap filling. And I'm going to go through each one of these steps in detail now. Yeah. So what we usually do, and I'm going to talk about the the Contag assembly step in more detail, but right now what we usually do is we only can find paths through the graph that are unambiguous. So we don't want to follow branches in the graph because this might be traversing through some repetitive structure that we don't know what the exact order is. So typically we just find these simple paths in the graph where each vertex has one edge going out and one edge going in. And it's almost universal that most assemblers will do this. At least as a first step is just assemble all the simple things that can be unambiguously merged. You might take more, if you have paired end data, you might try to do some more sophisticated resolution. But for this purpose, I think it's fair to say that they just find unambiguous parts of the graph. And then you're right that there is no start end. You just sample each vertex and then assemble locally from there. Yeah. Please ask questions as we go. I don't want to stand here like I'm just talking for 50 minutes. Okay. So really the first step is to do read based or read trimming and read filtering. So often if you've seen the air profile of limited data, it drops towards the end of the read. If your sequence data is very bad, you might have very low quality bases at the end of the read. And you might want to just trim them back a little to the higher quality sections. A second problem that trimming can solve is if you have adapters in your data. So if you've sequenced longer than what your Ferdinand length is, you can go right into the Illumin adapters and your sequences will be chimeric where you have your actual DNA that you're interested in and then whatever the Illumin adapters are. And you always want to trim these off. And this example down here is an IGV screenshot of some data that I had that had quite bad adapter contamination. And you see all of these colored bars here were adapters that were attached to my read. And if I tried to assemble this without getting rid of the adapters, the assembly would have completely failed just because all of these adapter sequences would go to one branch in a graph and we just have tons of what looks like repeats in the graph that couldn't be resolved. So I just ran a simple trimmer on this data and got quite a good assembly out of it. So some trimming software that I like. There's a program called Kraken which was developed at the EBI. So this will actually infer what the sequence of your adapters are and automatically remove them. Trimimatic is a very popular quality based trimmer. We'll look at quality scores and then get rid of bases that are much lower quality than expected. Some assemblers will actually have a trimmer or quality based trimmer built in and some of them you need to use one of these standalone trimmers that you can get online. So as an alternative to quality based trimming or in addition to you can perform error correction. So this is the Illuminous Sequencing Error Profile that I just mentioned for six different data sets. So as you see here for each of the data sets at the first base so here we're looking at just the first base to the last base and the average error rate over a sample of a million reads. At the first base the error rate is quite low but for all of the data sets it goes up towards the end of the read. And if we leave this data uncorrected it's going to cause problems in the graph. It's going to cause a lot of new branches in the graph because all sequencing errors create essentially new sequence that contributes new structures to the graph. Now we want to get rid of as much of this as possible. So what we can do is we can perform error correction. Error correction is just taking our sequencing reads and trying to infer the positions in the read that are wrong and setting them to the right bases. So in this example here I've marked this C in red color to indicate that it's a sequencing error and I'm going to explain kamer based error correction. So here we just look at the abundance of each kamer and try to infer which kamer is in the read our errors are not. So kamer again is just a subsequence of some fixed length here I think it's 20. So if we look at every 20 base stretch in the read starting from the beginning so say this first 20 bases and count how many times it appears across your entire read set the kamer's that are error free are going to be seen many times. We always sequence genomes to about 40 or 50 x when we're trying to assemble them so we expect a lot of redundancy in the data set. So error free kamer's are in high abundance but then when we get to the kamer's that contain the sequencing error like this kamer over here because they generate a new sequence that's not it's unlikely to be seen elsewhere on the data set the abundance is quite low. So error free kamer's have high abundance kamer's with errors have low abundance and typically they're only seen a single time it'll create a unique kamer that's unique to this read and not seen anywhere else just because sequencing errors are quite rare. Now we can use this observation to perform error correction so this is a kamer count profile that looks at a single read so if we start from the left most base the first base of the read we count how many times that first kamer has been seen typically they're seen about 20 to 30 times depending on your sequencing coverage so if we move from the read from left to right this fluctuates just due to sampling noise and then we hit this position here around the 60th kamer in the read and the count drops down to one. So the assembler would infer that there's an error at this position because it's expecting to see the next kamer about 20-25 times but only sees it once. So it would infer there's an error there and it would search the other three possibilities to see if it could correct it and in this case one of the three other possible bases successfully recovers the count profile and it matches what the rest of the read is. So it's just so kamer based error correction is just looking for these low frequency gamers and then trying alternatives to see if it can fix the read. So there's a lot of different kamer based error correctors available so a very popular one is Quake which was developed by David Kelly and Stephen Salisbury's lab. In my assembler SGA there's a kamer based error corrector which is tuned to be very memory efficient. Paul Medvedev wrote a program called Hammer and the very popular Soap De Novo assembler also has a kamer based error corrector. An alternative method of performing error correction is by finding overlaps between the whole sequencing reads and then calling a new consensus sequence. So just looking at how consistent the base calls are for a group of reads that share the same sequence and then setting the incorrect bases to whatever the most prevalent base is for that position. So a good example of this method of error correction is in the all paths assembler which was developed by the Broad Institute and David Jaffe's group. It's an extremely good error corrector it can find a lot of errors even more errors in the kamer based correctors but it's much much slower. Finding overlaps between all these sequencing reads is computationally very demanding so overlap based correction tends to be about in order of magnitude slower than kamer based correction. Okay next we would then take our cleaned up data and then put it into an assembly graph. So I alluded to De Bruijn graphs before and now I'll take you through the construction of De Bruijn graph in more detail. So again this is a kamer based model where we want to break up the reads into short subsequences and construct a graph of those subsequences. So here this is the De Bruijn graph with k equals four of this collection of reads. These five six base pair reads. So we take every former in each read so CCGT and we put that as a vertex in the graph. Then we take the next one CGTT put that as a vertex and we just do that sliding this kamer window across every read and constructing a vertex for every one of those kamer. Now this vertex in red is special because it's present in multiple reads with different next bases and it represents a branch in the assembly graph. So there's CGTT followed by an A in this read here and CGTT followed by a C in this read and that's this branch here. So that causes a branch in the graph and the assembly might look at this and think it's ambiguous and you wouldn't be able to resolve what the true sequence is. Now there's a path that goes down here back around and up here but you can also have just as likely an assembly that goes down here goes back through this loop and then again up here. So the fact that this kamer branches makes it that we can't figure out what the true assembly is for this sequence. So when I started an assembly in 2007 De Bruijn graph the assembly was becoming popular. The theory was worked out in about 2001 by Pavel Pesner but it didn't really become a popular method of applying it to real data until NGSSM data turned up and we had to deal with these huge volumes of data and I'm just going to spend a few slides on fairly technical material about what my research involves and how we can efficiently encode these assembly graphs. So the first next generation sequencing assemblers that used De Bruijn assembly, these kamer based methods they typically used pointer based graphs. So for every kamer you would have an 8 byte pointer and if you have a program in C or C++ then you know this is how you represent memory locations in the program and all of these assemblers are written in these sort of low level languages but essentially you take 8 bytes for every vertex and 8 bytes for every edge in the graph and this turned out that you required a few 100 bits per kamer to store this graph in memory and that's why assembly became such a notorious memory hog when you're performing assembly for very large data sets if you want to do a human assembly where the graph has something like 10 billion vertices you'd require hundreds to even thousands of gigabytes of memory. So this is why everyone will recommend that you get the largest memory server you can when performing to Novo assembly. But this isn't required. We can do much better than a few hundred bits of memory per kamer and it starts with a very simple observation because we've had, we're constructing this graph that consists of just kammers. Each vertex in the graph could have at most four neighboring vertices. There can be one with an A in the last position, one with a C in the last position, one with a G and one with a T. So because there's such a restricted neighbor set we don't need to store each vertex, each edge in the graph using an 8 byte pointer. In fact we can get rid of the edges entirely and just make them implicit in the graph. And this is an assembler that I wrote in 2007 when I was at the genome science center in Vancouver called the BIS. This is what it does. Instead of explicitly storing the edges of the graph what I do is I just take a large hash table, put every single kamer that's in the sequencing collection into a hash table and when I want to look up the edges of a particular vertex to see what the neighbors are I just query the hash table for the four possible neighbors. So if I'm trying to find the edges of CGTT I would look to see if GTTC is in the hash table and I'd look to see if GTTA is in the hash table. And if it is I'd say okay I know that there's these two neighbors of this vertex. And this allows us to represent the assembly graph in much much smaller amounts of memory. We can shrink the memory by about an order of magnitude from about a thousand gigabytes of memory to something like a hundred gigabytes of memory. And this idea of making the assembly graph implicit by just storing a set of kammers has become the jumping off point for developing even more efficient methods of representing assembly graphs. So two computer scientists in Australia, Thomas Conway and Andrew Bromwich, they took an approach called sparse bit vectors for representing this kamer set. So they take this set of kammers here, construct a bit vector of size four to the k. Here k is four so four to the k is 256. And then they would set a bit in this bit vector for every kamer that's present in the set. And this allowed them to then compress this sparse bit vector and get an extremely compact representation of the kamer set which is about 28 bits per kamer. So now we've gone from hundreds of bits per kamer down to about 28 bits per kamer. And this was really a large theoretical breakthrough in representing the assembly graphs. So we can write each kamer as a number from zero to four to the k by just saying a is zero zero, c is zero one, g is one zero, t is one one. And then we just do that concatenating this bit string and that gives this number from zero to four to the k. And we set that numerical representation of the kamer as the bit in this bit vector. So it's never going to be larger than 256 or rather, 255. So you can actually think of it that way. Yeah, I think of it, I just wrote Drew this in 2D because it's much more compact to draw a slide. But you can just think of it as each one of these kamer is just one of these filled in boxes. So we have nine kamer here, so we have nine bits set in this bit vector. And then what we do is just compress this down. It compresses, it turns out that it's very sparsely encoded because we don't actually use k of four in real assembly, we usually use k of 60. So it's much longer kamer. So four to the k is an extremely large number that we couldn't represent in memory. But we have this very sparse bit vector that we can compress down. But each one of these kamer corresponds to one of these black boxes here. Is that good? Now a method of representing these graphs that's even more recent and it's gaining a lot of popularity is using a bloom filter. So in a bloom filter, again we take a bit vector and we set bits corresponding to each kamer. But now we don't just calculate the numeric index of each kamer. What we do is we apply multiple hash functions to the kamer sequence. So in this case three hash functions. And then for whatever value the hash functions return, we set the corresponding bits in this bit vector. So each kamer gets hashed three times. We set three bits out of however large that our bit vector is. Now when we want to check whether our kamer is present in the bloom filter, we run the same three hash functions on it and we check whether those three bits haven't set. If they are set we know that the kamer is probably in our bit vector. Now bloom filters have this nice property that if you've added your kamer into the bloom filter, it's always going to return and it's there. But you can have false positives. If you didn't add the kamer to the bloom filter, it might return that it is in the bloom filter. And now the Titus Brown's group is the one that are helping this method of representing the Brown graphs. And they've shown that you can have, you can tolerate a certain false positive rate within your bloom filter. You can have 5% false positive rates and it doesn't affect how well you can assemble the sequence. And within the bloom filter you have this tradeoff between your bit vector size, which is the memory usage of your assembler, and how often you get false positives. So as long as you keep your false positive rate under a certain percentage, you can perform assembly just fine. And this allows extremely compact representations of graphs, which is about 10 bits. So this slide just summarizes the different ways that we can represent these graphs. Starting from the early methods, what were pointer based, which took hundreds of bits per kamer down to somewhat more exotic data structures like the FM index and this data structure called GFM that I worked on last year, which allows you to represent the assembly graph in about 4 bits per kamer. And this progression of improvements in the memory usage of assembly graphs has been one of the driving forces that allow us to assembly as a fairly routine part of bioinformatics. Okay, so that's one of the most technical part of the talk. I'm going to now return back to the practical aspects. So once we've constructed this graph, we want to clean it up a little. So here's another schematic of an assembly graph. It looks like a bit of a mess. We have branches hanging off here. We've got these small little structures here. We've got edges crossing. And we want to try to make this assembly graph more manageable. And the first thing that we do is something called tip removal. So any sequencing errors that remain after error correction will cause these branches in the graph where they just diverge a little and cause these short little what we call tips. So all of these vertices here correspond to sequencing errors that are colored in red. And what the assembly will do is just look for these short little branches and then start trimming them back towards the main part of the graph. So in the first pass I would identify all these red nodes and then it would go back towards the main part of the graph until it found it and then get rid of all of those tips. So that's just a way of cleaning more sequencing errors from your assembly graph. And you end up with a much more simple structure that has fewer branches which are going to stop the assembly when we get to the stage of building contigs. Next we'd want to remove what are called bubbles. So if the genome that you've sequenced is diploid, any SNPs in the genome cause the diverging structure in the graph where we have a branch where one branch covers one allele and the other branch covers the other allele. So if we have a heterozygous SNP this might be an allele, this might be the C allele. And since we want linear structures in the graph we try to get rid of these. So the ascender will just run a simple algorithm to find these bubbles where we start at each branch point and then we see if they rejoin a short distance later. So we run the algorithm starting all these blue nodes and then traversing until we find the places where these bubbles rejoin. So you're talking about short indels, do you do small indels up to high time? Typically these would be indels of probably up to 100. Most of the bubbles you remove would be SNPs and then short indels as well. Usually the assembler will compare the sequence of this path to this path and as long as they're close enough it then collapses them down to a single representation of the allele. So if you had something that's very complex it might not actually remove it because it might look like two different copies of a repeat where it wants to try to keep that. And now you can see that when we started with the graph it was quite complicated but now we've ended up with something that's much simpler and much more manageable. Sorry, so when you're collapsing it would you be recording? It depends on the assembler. Some of them just throw out the second half of the bubble, the second allele. Some of them will write the divergent structure into a FASTA file that you can look at after the assembly. And some of them will actually write in IUPAC ambiguity codes into the structure of the sequence. So it would represent like an AC as a Y or whatever the ambiguity code is. Most of them won't. Usually the reads are so short 100 bases that you can't do a very good job of phasing. Okay, and then finally we get to the contig assembly part. And this goes back to the question that we had earlier in how the assemblers actually generate in contigs from this. It's just looking for all of these parts of the graph that are branch free. So we're looking, we'd find a contig here, we'd find a contig here, and so on. So we just start merging them together into longer and longer sequences until we end up with our final contig set. And this is just one contig for all of the what we call unambiguous paths in the graph. So the result is just a collection of contig sequences. Now what are the lengths of these contigs? If you work with fairly simple genomes, if you're interested in bacterias then you can expect contigs from Illumina data to be around 100,000 base pairs on average or median contig length of around 100,000 base pairs. If you work with large eukaryotic genomes, like for instance the human genome, you typically get about 10 to 15,000 base pair contigs from Illumina data. Now something I haven't talked about yet is newer sequencing technology like PacBio. If you do PacBio sequencing on bacterial genomes it's quite likely that you get a single contig that represents the entire genome. So the PacBio sequencing bacteria has become a very effective way of getting finished genomes of these relatively simple organisms, but the cost is so high that for large eukaryotic genomes it's still quite prohibitive to do PacBio sequencing. I'm not going to talk about that in detail, but if anybody is interested in that I'm happy to talk about it a little later. Okay, and then once we have contigs we then want to try to link them together into larger structures. So since the assembler stops at these branch points it's sort of thrown away information. We don't know which repeats go together and our assembly might stop at repeat boundaries and we want to try to recover that as much as possible. And this is where we can start to bring in paired end data. So hopefully you had an introduction to paired end data yesterday. Chelsea, shaking your head, which is good. And here what we do is we take all these contigs that we just assembled we map all of our reed pairs back to our contigs and we see where the pair is mapped. So for contigs where both ends of the pair map great, we have this arc here, but the ends of contigs only one end of the pair might map. And what we can do is look at where the one end of the pair maps and where the other end of the pair maps. So here we have this group of red pairs here that are joined this way. And we might think that this contig is followed by this contig in the genome. We can just keep running this logic for all of the different groups of reed pairs and link them all together. We then build what we call a scaffold graph where we put arcs between contigs that we think are ordered or neighboring on the genome. So when we do this we can take our paired end insert size that we know from the data and try to infer the distance between the contigs and then put in ends in the gap between pairs of contigs. And that's what we call scaffolds. And as a final step which most assemblers will perform is we try to fill in the gaps of these scaffolds. So we have these runs of ends which are the unknown bases that separate our contigs and our scaffold. And these assemblers will try to perform a localized assembly of what are typically the repetitive sequence that are in between the contigs. So these are the things that couldn't be resolved by the graph. But if you take a more aggressive localized approach sometimes you can fill in these scaffold gaps and get a more contiguous assembly. Examples of programs so in SGA there's a program called Gap Fill. Soap De Novo has a gap closer module as well. And you can also fill in gaps with other technologies. So if you've performed some PacBio sequencing for your large genome that wasn't enough to assemble it completely you can try to see if the PacBio data resolves these repetitive sequences that are in between contigs. Okay, so are there any questions about that portion of the talk which was the assembly pipelines? Plants are hard. They're typically hard, yeah. That's a good lead into this part. So what I'm going to talk about now is the factors that make assembly difficult. And what really prompted this work is a competition called the Assemblathon, or the Assemblathon 2 which was organized by UC Davis. And what they did is they took sequencing data from three large mucaryotic genomes released it to the community. All the people who are interested in developing assemblers they said assemble this. We then gave the results of assemblies back to UC Davis and they scored each assembly to see how well all the programs and different pipelines performed. And this is a quote from Keith Bradnam who is organizing this competition from the paper. We make a few broad suggestions to some performing De Novo assembly of a large eukaryotic genome. Don't trust the results of a single assembly. If possible generate several assemblies with different parameters and different pipelines. And the reason that he said this is that the results for this Assemblathon competition were highly variable. Some assembly some assemblers performed very well on some genomes but not the others and some genomes were much easier to assemble than the others. And it's very difficult to predict how well your assembly is going to turn out when you start with it. You don't know anything about your genome you don't know about its repeat structure. You might not even know its GC content and all these factors matter when choosing how you're going to assemble the genome. So last year I worked on a program that tries to measure these different ways of these different things that make assembly difficult. So I've just made a list here. So if your genome is highly repetitive or highly heterozygous which is why it's difficult to assemble, it causes problems. So if you remember back to when I was talking about assembly graphs, repeats cause these branches in the graph that stop the content building process. Heterozygosity cause these bubbles in the graph that make it difficult for the assembler to resolve what the allele structure is. More factors that make assembly difficult. If you sequence to low coverage then you get a lot of coverage breaks in your graph where you just didn't sample enough sequences to have a complete graph. If your sequencing is biased towards higher low GC content then again that causes breaks in the graph due to coverage. If your data has high error rate it causes more branches in graphs. If you have chimeric reads, if you have a lot of adapters in your sequence then again it causes problems in the graph and so on. And all of these things are I think not really taking into account when designing how you're going to do your assembly. And for the user of the DeNovo assembler they don't think about these things early in the process when designing their assembly strategy. So I've been working on ways of making it easier to measure whether these, all of these factors are going to affect a given assembly. Now we do this by returning to the structure of the assembly graph again. And I've already talked about how errors form in a graph. They form these little branches here that we call tips. I've talked about how snips and indels form these bubbles in the graph. But I haven't talked about how repeats form a branches in a graph. But it's a fairly natural idea in that if you have some repetitive structure the graph will have a branch between the different copies of the repeat and some structure that you can't resolve. So I'm just depicting this here. And this program that I'm developing which is called PreQC tries to learn the structure of the graph and measure how often does the graph branch do to repeats. How often does the graph branch do to snips and indels. How often does it branch do to errors. And what can we tell about the genome by just looking at the assembly graph. So it does this by calculating camera coverage. I talked about this before when we were talking about error correction. Most cameras in the graph will be seen many times if we've sequenced 50x. On average we might see cameras 40x, 41x. And we can use this to perform inference over the graph by just looking at how many times each camera has been represented. So the most basic output is just plotting a histogram of these camera counts. This is a histogram of camera counts for human genome. We see this roughly normal in distribution with a mean of I think around 28. But we see a few different peaks here. We see there's this sharp peak here. So all of these cameras which is about 6% of the cameras have been seen a single time. So these are the cameras that had sequencing errors as I mentioned before during the error correction point. These cameras that have been seen a single time are probably just the cameras that have been generated by sequencing error occurring at some position. And then we see a little feature here where there's a bump at this location here which is about half of the main coverage. So these are cameras that cover heterozygous snaps. Now in the human genome this isn't so pronounced. It's just a little bump here that's barely perceptible. You might even say it's noise. If we look at a more difficult genome like the oyster genome, we see a much more awful count distribution. Here we have two very well-defined peaks. One here for cameras that are present in both parental chromosomes and one here that are present on only one parental chromosome. These are the heterozygous cameras. Now this corresponds to a snip every about 80 bases in this genome. And this is much, much too heterozygous to assemble using conventional methods that just use Illumina sequencing. So this data came from a paper by the BGI where they sequenced the oyster, tried to assemble it using just normal whole genome shotgun techniques and failed and ended up sequencing false mids to assemble the genome just because the heterozygosity was so high. And we comply this to a lot of different genomes and try to infer how often the branches occur. So we built a statistical model of the structures in the assembly graph and I'll just briefly go through this because I'm starting to run out of time. But essentially what we do is we count the number of K-mer occurrences on each half of the branch. So here we have a branch where this node CX is seen 33 times, this side seen 30 times, this side seen 2 times. Because this side of the branch is so rare when compared to this side of the branch, we would probably say that this is just a sequencing error and we could get rid of that half of the branch. If the coverage between the two halves of the branch is balanced so if you have half the K-mers on this side of the branch, half the K-mers on this side of the branch, we would probably say that this is a SNP or an indel that's a heterozygote in the genome. If the coverage of the branch is much higher here, so 45 plus 24 is much higher than the incoming side of the branch, we'd probably say that this is a repeat structure here. And we can run this over every branch in the graph and try to estimate how often these structures in the graph appear due to SNPs and indels and how often they occur due to repeats. So here's just six different genomes. So this is in yellow is the oyster genome and this is an estimate of how often the graph branch is due to SNPs and indels and we see that it's very often for the oyster genome. This is a parakeet from the Assemblophone Competition. It's also highly heterozygous if a branch is quite often. Down here in light blue is a human genome which doesn't have that many SNPs. It's only about one in a thousand bases, so the branch rate is about one in a thousand. So we've correctly estimated the SNP rate in the human genome here. Down here in dark blue is a yeast genome which is not diploid and it is reflecting the fact that the graph branches very infrequently with these bubble-like structures. So you can run this program on your raw data and estimate how heterozygous your genome is without knowing anything about it before starting the assembly. And this helps you to understand what the difficulty is going to be when you actually run your assembly. Yes, you probably would. You have two strategies. You can either go say you wanted to see assembled as bird genome which is fairly heterozygous but maybe not cripplingly so you might take an assembler that's designed for highly heterozygous genomes. Your second option is to parameterize the assembly. Most assemblers have a lot of tuning knobs that you can change how they work and you might say, okay, well try harder to remove these bubbles from the graph or try to resolve heterozygosity more. Yeah. Yes, exactly. That's the error rate. So we've misidentified bubbles in the graph at a rate of about one in 40,000 bases. So if you look at the frequency of repeat branches which we also estimate here it tells a slightly different story. As we expect the human genome is highly repetitive which is littered with transposons and that's reflected in this repeat branch plot and the rate of branches with human genome is the highest here. Similar rate of repeat branches is the oyster genome. So the oyster genome is only a fifth of the size of the human genome but it's comparably repetitive and as we go down these species become easier and easier to assemble. So here is the in red is the assemblophone snake, in green is the assemblophone fish and again the bird is in purple and then the small yeast has a very infrequent branches due to repeats because it's only 12 megabases with a fairly simple genome. Now what I didn't explain earlier is that these plots are per k-mer value. So if you remember the k is the size of the sub-sequences that we use to construct the assembly graph and as we increase k, we use longer k-mers, the graph becomes less and less repetitive. So strategy is that when you are sequencing difficult genomes, you want to use a more stringent assembly parameters which in this case means larger k. But of course to support much larger k, you need to have higher coverage sequencing. So if you have very large genomes, you might want to spend an extra money on sequence twice as much and then use more stringent assembly parameters out here. The assembler won't annotate repeats for you. It would just look like any other sequence. What you can do is then map your reads back to the assembly and look at them to see if there's big pile-ups of reads or a lot of divergence, a lot of mismatches in these regions. But the assembler typically won't annotate repeats. They won't assemble at all. The centromeres are too repetitive and the telomeres won't. They'll just be... Most assemblers will give you back the complete graph and all of the centromere sequence would just be single k-mer vertices. Because they branch so much, it just stops immediately. So you've got very short fragmented pieces of the assembly. Is there another question? When you have higher coverage you can use a longer k-mer. That has less branching in the graph. Doesn't that allow you to have longer k-mer? Yes, that's right. Your k-mer coverage because you break up a read into k-mers the number of k-mers in each read is linear in the length of the read. So if you have longer reads you get more k-mers out. So you can use correspondingly higher k. So if you have 100 base pair reads, we typically don't go over k of about 60 to 70. But if you had 250 base pair reads, you might use k up to 100 or 120. But then you can start arguing whether you should use an overlap based assembler. But that's a different argument. When you look for k-mers, do you just need one read or across all of the reads? It's across all of the reads. So this program is just measuring the structure. You can think about it as building the full assembly graph and then trying to classify each branch in the graph into one of these three categories, error, repeat, or heterozygium. Larger k-mers doesn't make everything much slower. It uses more memory. Typically it's... I would say it doesn't have a large effect on the runtime. Because there's less of these repeats to try to resolve in the graph, so it makes the assembler's job easier. So I'd say that anything that you should have equal runtime are slightly faster with large k. Okay, and also this program will predict the genome size for you. So this is just the output of the program. Slays or pointers are dead. So here the human genome is predicted to be about 3 gigabases, which is more or less correct. And then for these assemblathon species, they're within about 10%. So you can predict what the genome size is directly from your raw sequencing reads without actually performing the assembly. So the program will also estimate the quality of your data. So it will look at the quality scores and plot the mean quality score by position. So this roughly represents the error rate of your data. So the fish data is the lowest quality data set here. And the snake data, which was sequenced by Illumina itself is the highest quality data set in the assemblathon. We can also measure the error rate by just looking at overlapping reads. So the program will take a read, find all of the reads that share common sequences with it, and then calculate a consensus at each position. And here we see that most reads show a G here and this read shows a C. So we can infer there's an error here, infer an error here, and we can calculate this preposition error rate plot that I showed earlier. This is handy for choosing whether you want to perform this quality-based trimming that I mentioned earlier. So for this data where the error rate becomes very high towards the end, this is the assemblathon berg data, you might want to trim it back to about 120 bases where the error rate is only about 1.5% rather than using the full length of the data, which becomes quite noisy at the end. So this again helps you guide how you're going to do your assembly. It also lets you look at more complicated metrics like whether you have a sequencing bias towards particular GC content. So this is a plot of coverage versus GC content for the assemblathon fish data. This looks quite good. The coverage doesn't look very biased. If we look at a different data set like the yeast, we see that you could draw this line through here which indicates that higher GC regions in this data set are sequencing worse than lower GC regions. So we have this bias toward lower GC sequence here, and typically we don't want to see this. We'd like coverage to be relatively uniform across the entire genome independent of GC content. If we look at the oyster data, we see something quite interesting. Here there are two blobs in this plot. This lower blob is all of the heterozygous cameras that I mentioned because this genome has such a high density of SNPs. In the upper blob is all the homozygous cameras, the ones that were present on both parental chromosomes. So again these plots are giving us information about what the structure of this genome is and how difficult the assembly is going to be. The program will also estimate the fragment size distribution for your library. So if you've done paired-end sequencing and your lab told you that it was on average a 400 base pair fragment, you can run this program and check whether this matches expectation. So just picking up one example here. Here's the snake data in red. We see that the average fragment size is about 375 bases with this distribution here. In the oyster data there's three peaks because there's three different sequencing libraries for this genome. And finally the program will actually perform a simulated assembly for you across a range of K so you can see, get some information about which K-mer you should use when you're actually performing your assembly. So for most of the genomes like take the yeast for an example. With low K the assembly is not very good but then with intermediate K we find a peak here and then it drops back down to not be very good at high K. Now the reason for this is when you use a small K-mer the graph is more repetitive because you have this not very stringent parameter but for high K-mer you haven't sequenced enough to have good coverage of the graph so the graph breaks a lot due to drops in coverage so you end up with this sort of Goldilocks principle where there's this K-mer that's just right in the middle. And what you can do is just visually inspect this plot, say okay I'm going to try all of these K-mer's that look like it's going to give us the best assembly around here. And the other genomes all have the same sort of curve where low K-mer's aren't very good and then intermediate K-mer's give me the best assembly so for the assemblophone bird you might pick something around K61 which is a good starting point for the assembly. Interestingly the snake data keeps going up and up because this genome sequenced to extremely high coverage. I think it was about 100x so we haven't actually used a K-mer that's big enough or long enough to negatively affect the assembly. You could keep assembling out to a larger and larger K-mer and keep improving the assembly. Okay so just a plug for this where it's quite easy to run and fast. You can run it on less than 24 hours for human genome so everybody that comes to me to ask for assembly help I tell them to run this program first. There's a preprint on the archive and it's now published in Bioinformatics describing all of these methods and the source codes on GitHub. And it just consists of these three simple commands that you can run to generate this report and all these figures for your data. Right so I'll stop there just on time and if you have any more questions just feel free to email me and I'm happy to discuss anything else. Very recently people have there's a good paper from the Broad Institute doing that where they tried to take the when you assemble human genome you get a bunch of sequence that doesn't map to the reference. It's just novel sequence and you don't know where it is and they've tried to use population data to infer where that's placed on the reference genome. But this is a fairly new idea. People haven't really done it at large scale to try to resolve these structures in the graph because you could think about using something like LD to try to infer what pieces of the genome go together.