 So, hello everyone, so I'm Michael Brudno and I'm going to tell you a little bit right now about genome assembly and how genome assembly works in general and a little bit more specifically how it applies to hyperboot sequencing and the stuff that I'm going to talk about. You will actually have a chance to play around with in practice in the lab, which is just conveniently going to be held sometime tomorrow. So, okay, so that was true to the mid-federate flight and, okay, so I guess to start with all of you are here because you want to learn about handling mid-generation sequencing or hyperboot sequencing data and building genomes based on that. And there are some questions which you want to ask yourself before you actually go about doing this. And the first and the most important one is what do you want to learn about the underlying genome. If you really want to get the whole genome from first base to last base or if it gets a bacterial genome from first base to first base and assemble it using just hyperboot sequencing, odds are you're out of luck. You will not be able to do this. You will be able to get pieces of it. Sometimes you'll be able to get long pieces of it, but you will never or almost never be able to get the whole genome from start to finish unless it's something like a very small virus. It's just basically impossible with very short reasons. So the questions that you may want to ask, are you interested in comparing this genome to some other genomes? Are you interested in what kind of variation then? Snips, larger structural variations. Maybe you don't really need to do an assembly and all you want to do is detection of variation. If you're interested in predicting genes, can you somehow leverage other sources of information? You basically, you know, building a genome from scratch is going to be very difficult to impossible. Very important question is, do you have a similar genome? So very often you will actually have a close species of bacteria which was already sequenced via Sanger or some other technique, finished, high quality. And then maybe all you have to do is, you can use that genome in order to help you build the one that you're actually working on. Very important question is, do you have made pair data? So if you have made pair or pair and data, it's very good. If you don't, you're basically, you should go back and get some pair and data. It's basically, what? Frank, this has no break from my question. So, that should be easy. So do you have made pair or pair and data? And another important question is, do you have other re-types? So if you have both Sanger-type data and high-frequency sequencing, Illumina or solid data, you should be able to use them in order to leverage things to work together. And another somewhat important question is, what computer power do you have available? For many of these programs that we're going to mention today, computer power is not at all trivial. And actually there are various trade-offs. Some tools will be very memory-intensive. Other tools will be very CPU-intensive. So if you have something that is, you have a ton of data, but if you have a machine with 32 gigabytes of memory, which is not that much nowadays, but that you will be able to run it, but if you don't, then you're stuck using other tools. And by the way, for everything that I talked today, please, please ask questions. I initially thought that this was going to be a one-hour lecture. And thank you, Francis, for running over. It's been making my lecture a little bit shorter. But yeah, so yeah. Maybe this is just for me, but I'm kind of confused with the mage pair and the data. Uh-huh. Yeah, mage pair versus pair and data. It's actually a very, very good question. I'm going to be mainly using the terms interchangeably. That's not the exactly, Francis is shaking the card, that's not exactly the right thing to do. So I'm wondering if I'll have a good slide on this. And I don't think I will in this presentation. But basically mage pair data, you have a piece of DNA where you're going to be sequencing each two ends. You're going to be going in from the left and going in from the right. And that's mage pair data. Pair and data, you're grabbing a piece of DNA, and you will be sequencing sort of two pieces of it, one after the other, at some distance apart. The reason that the distinction is important is that the sizes of the distance between the two reads is going to be completely different in the two cases. If you're sequencing mage pair data, you have to basically take your DNA and bend it into a circle so that you can start sequencing out from the same location. And for the DNA to basically bend it into a circle, it has to be some minimum length, several, you know, around 500 bases. If you're doing pair and data, it can't be too big because you actually need to have somehow be able to sequence two pieces which are next to each other. Does that make reasonable sense? Yes, but what is the biological relevance of doing or adopting these two techniques? So it's not, so the biological relevance is not really much. Some of the technologies will support one or the other. Some of them will support both. For example, in the case of Illumina or Selexa, I'll use Selexa because I can't remember to use Illumina every single time. So in the case of Illumina, I think they do not support mage pair data and Francis can tell me if I'm wrong, but I'm pretty sure they only do pair and. In the case of solid, they actually only do mage pairs. They do not support reading two reads, some distance apart from each other. And this is basically a difference in chemistry. So it's more what your technology can give you. Although I've heard rumors that solid will soon come out of the pair and strategy. I don't know if Selexa's coming out with the mage pair strategy probably there. Can't remember if I know. Sorry? We're going to ask, well, we can ask them about the solid. I'm pretty sure that I've heard this from multiple people, so that probably it's true. So, you know, given this, what are your options? Well, it goes back to what you want to learn. You know, variation within the species or conservation between species. You can try to build a genome from scratch. You'll get some nice pieces, probably you won't get the whole thing. You can try to do reference assisted assembly where you actually use a related genome in order to help assemble. But you will miss some differences between the reference and what you're trying to assemble. Especially difficult are copy number variants, and we'll have a chance to talk about that later. If one of the genomes has two copies of a certain element while the other genome has a single copy, that's very difficult to capture with the references to assembly. You can also just map each reference genome and not even worry about doing any kind of assembly. This is great if you want to discover variation, but if you don't have a reference, then obviously this doesn't go anywhere. Also, same problem that when you're doing reference to assembly, you miss things which are not in the reference, or which are hard to find in the reference. So, mapping versus assembly. You have a reference genome. You won't find snips and indents. This is called genome re-sequencing, and this is great for somatic mutation detection, snip discovery, structural variation discovery as we'll talk about. But one of the problems is dealing with non-uniqueness. You ever read with maps to multiple locations, what do you do with it? Maybe it maps perfectly to one location with one change in the other one. Well, then you can probably send it there. But it's even more difficult if it maps with one change to each one and maybe different changes. So you really do not know where to send this route. In the case of denomogenome sequencing, it's great. You won't have these problems, but you'll get gaps. You'll get pieces which you don't know the order of or the orientation of. It's obviously easy to try to combine this first with de novo and then try to take the result in context and map them. And that's actually a very, very powerful technique which is used a lot. Okay, so just sort of a little overview. Whole genomes have got to be broken. Start with a piece of DNA. You have a sequencing machine which generates these pieces called reads. And these reads are very hard to see for whatever reason that the colors got destroyed. These should be red and blue, not red and red. But there are two colors and these are reads coming from the two strands of DNA. Computer scientists like me tend to think that DNA is a strain and they tend to forget that it's actually a molecule which has two strands. So, minor detail. Then there is an assembler which is basically a program written in your favorite programming language which will hopefully will generate three contents. And after some finishing, you want to be able to get a final genome. And finishing is usually expensive and time consuming. So, how well the assembler works really will dictate how well you can do with this. Finally, you know, what kind of, what does this data look like? This is what Sanger data may look like. Lots of, you know, relatively short pieces of DNA. And assembling these was considered a really, really big challenge ten years ago. And they talked about how difficult it was to build the human genome. And today we have next generation sequencing which gives you a ton more reads but they're even shorter. The problem is very difficult. And here is an illustration actually of a mate there which is what we talked about earlier. Two pieces of DNA which are some distance apart. And this was a pair of hands sequencing and they would be placed in both ways. So, what an assembler. So, I'm a computer scientist and I'm going to give you a computer scientist definition of an assembler. It takes a set of strings over A, C, G and T. And it gives output a common super string of the reads of these strings. So, for example, you have these three strings. And this is super string that contains all of these, these ones. So, are there any people in the room who consider themselves sort of computer scientists? A couple, a few. So, what's wrong with this definition? Yeah, it's not just many. There's as many as you want possible outputs because I never specify anything about that super string. You can have a super string by taking these strings and sticking anything you want in between them and that would be a perfectly valid super string. And while that may seem like a very pedantic point, it's actually a relatively important point because any way that you're further trying to constrain the problem may lead to problems in the assembly. Because all you really know is you had some genome, you sampled some pieces from it. And this is, you know, before we even consider such things as sequencing errors. You know, there's as many outputs as you want. So, initially when people thought about this, they said, well, why not the shortest common super string? And, you know, there's multiple ways of putting these reads together. Why not find the one which is the shortest? Well, there's two problems with that. One is sort of a computational problem and the other is a biological problem. The computational problem is that this is what's known in computer science as NP-hard, which means that it's very, very difficult to do computationally. A bigger problem was that this is biologically not very sensible because it leads to something called over-collapsing of the repeats. Imagine that you have some segment which is present in genome multiple times. There is absolutely no reason for you to use it multiple times in building the shortest common super string. In the end, you will get all the many copies of the repeats collapsed into a single location. So, this was the first earliest approach to this and it turned out to be exactly the wrong thing to do. There have been alternative approaches which have been proposed and these are based on something called the brewing graphs which I'll talk more about in the string graphs. These two have been... So, if you read Dulliver's Travels, they have this war about big Indian versus little Indian. So, there was the little equations and you had the war between two sides about which end of the egg to crack, whether, you know, when you're cracking an egg, you should crack it from the big end or the little end. So, this was meant to be a parody of religious wars in Renaissance and early modern Europe. So, the brewing graphs and string graphs are very much like that. There really are two sides of the same egg and the groups have been... I wouldn't call it a religious war but close to it about which of these approaches is right and there will be people who will add them and we say our approach is the brewing graph approach and that's the exact right way to do it versus the string graph approach. There really are two sides of the same egg. One of the key things is that both of these formulations are also NP-hard. They're formulating these approaches. If you read the papers, both of them try to hide it. The fact that they make arguments about that their approaches are computationally easier for one reason or another, it's not true. They're all NP-hard. There is no free lunch. Well, if they're really hard to solve computationally, can we, you know, why try to do it like this? If this is going to be computationally really difficult? Well, an answer is that there is a big difference between theory and where it's NP-hard in practice and whether it's actually can be implemented. To explain this, I sort of have a little roughing joke where biologists and mathematicians were asked to predict the outcome of force races. So, you know, a biologist decided to look at all the forces and study their evolution and study their lineages all the way back to the time of Genghis Khan and came up with a model which was able to predict the winner of a force race 50% of the time and was very happy about this. And the mathematician drew some integrals at some complex formulas and the tan said, hey, I have a model which predicts the winner of a force race is 100% of the time, but it works for a spherical horse in a vacuum. So, this is the difference between theory and practice. In theory, it's NP-hard. It should be very complicated. In practice, we have very good ways of making the things work. So, what are these deploying graphs, these magical contraptions which supposedly make things easier? And imagine that you, again, have a set of strings and nodes, so we're going to build a graph. Hopefully, it's a graph that doesn't scare any of you. A graph that has nothing to do with a graph paper. It's a thing with nodes and edges. So, nodes in the graph will be k-1-mers from the set of strings. So, this is a set of strings that I got. Every single k-1, so this is a k-mer, a string defined k, k-1 would be 2 in this case, and every single one of them should be somewhere in this graph. Wow, I managed to get the buttons. Okay, so AG is right here, you know, PC is right here, and so on. And just sort of a little bit of terminology. The set of k-mers in the genome is called the k-spectrum. And finding the shortest string with a given k-spectrum is something called the Chinese Postman problem. This is a classical problem in computer science, and Fez 189 sort of showed the way to do this, that if you had every single k-mer that was present in the genome, you could find really easily, computationally, the genome that has all of them, the shortest genome that has all of them. And it works by something called Eulerization. And this has to do with a classic problem in computer science called the Bridges of Königsberg. So the question, there was a map of the Russian, now Russian city of Königsberg, where the city had islands and bridges, which connected the islands. And the question was, could you traverse all of the bridges of the city without going on any bridge twice? And the solution turned out to be, you know, whether this is possible turned out to be a really simple problem. You can do it as long as from every single, and return to the same place. So traverse every single bridge and return to your original location. You can do it if and only if every single island is bordered by an even number of bridges. If there's an even number of bridges from every island. If this is the case, you can do it. If this is not the case, you cannot do it. So the same thing is true here. If every single node has an even number of edges going, or equal number of edges going in and going out, you can do it. Otherwise, you can't. But in this particular case, you know, you cannot do it. This graph had initially two edges coming in, but only one edge going out. The second time you come into this node, you will never be able to leak, which is why you wouldn't be able to do it. So, but there is something called, there's a technique called Eulerization, which is to make me graph Eulerian. And Euler was from this, lived for a while in the city of Kenningsburg, which is why, you know, it's called an Eulerian tour. It was one of the first to formulate this solution to this problem. So, to make the graph Eulerian, basically you add edges to it, so to make it balanced, so that you have an equal number of edges going in and going out. So, this edge, this red edge, actually two edges, two here and then two there, will make it balanced. Yeah. Most of it. So, when you talk about K minus one more, like from their team, other than your particular tumor, is there any specific way to do it? Yes. So, what I did is this. I should have been a bit more specific about how I did this. So, every single K minus one more became a node. Every K more becomes an edge now, between two nodes. So, if you have a greener GCA, that becomes an edge between GC and CA, so that this edge really is called GCA. C is common there. C is common, exactly. So, actually K minus one letters will be common. So, GC to CA. So, this edge becomes GC. So, yes, thank you. That explains why everything else I said afterwards, which is that, you know, every single edge becomes an original tumor on the graph. And now you want to go through the graph so that you traverse every single tumor, every single edge at least once, but as few times as possible. So, that's exactly the formulation. Yes. Thank you. So, this polarization can basically add the purest number of edges to make the graph Eulerian or balance. And once it's balanced, you can go through every edge exactly once. Okay. So, another important thing about DNA is that, as I mentioned, it's a molecule. And the graph which I just showed you, one of the things I had was that everything was a string, actually. Right? Everything here was just a regular string of DNA. A string of letters, ACG. How is DNA different from a string? Well, it's got the two strands. So, this is actually what DNA looks like in a cartoonish way. So, the question you may ask, how can two DNA molecules overlap? How can, if I have two DNA molecules, how can I join them together? And it turns out that there are several ways of doing this. So, imagine this AC and TCTD. They could overlap by this letter C right here. Or, imagine this AC and this TCG. I'll have to flip TCG in order to get the G to align with its C. Right? So, in the reverse trend. And, in this case, I actually have to flip both of them. So, AC gets flipped and, well, no, I flipped the first one. Sorry. Flip neither, flip the second, flip the first. So, these are the possible ways of having things overlap. And this can be modeled by something called a bidirect graph, where I actually have a direction on, not on an edge, but on a node. And there's three types of edges in the bidirect graph. They are called sort of, there's a regular edge, an out pointing to an in, but there's also an out pointing to an out or an in pointing to an in. These model the ways that DNA molecules can overlap and the fact that DNA is actually not a string, in reality, double stranded. And bidirect graphs actually turn out to be really similar to directed graphs in that a walk, you can define something called a walk on it. So, a very valid way of traversing it. But this is going to be, there's going to be a little bit tricky. For example, this is not going to be a valid walk. Why? Well, because, here, you start at an AT. And then you go into a TT, right? From an AT it will overlap a TT by one T, so you can go into this node. But a TT does not overlap a GT or an AC. Only AA does. So if you go through like this, that's not really a valid string of DNA, which you cannot never parse a valid string of DNA from here. So the key thing in order to figure out walks and bidirect graphs is that you always have to balance your edges. So if you go into a node on an, something just pointing out, you have to leave on something that's pointing in. So from here you can go down or you can go, but you can't go straight. And from here you can go down as well, but you can't go straight. And you can see this will actually spell it out string of DNA. This will be GT, TT, TT, GT. So the cool thing about bidirect graphs is that they completely let you forget about double-strandedness once you sort of get them into your head. Take a look at this. If you start from here and do this walk this way, you will spell the following string. G, G, C, A, A, T. What if you take the exact same walk and walk it backwards? You will spell A, T, T, G, C, T. And what's the relationship between the two strings? They're the same string of DNA. They're the same DNA molecule, just the two strands of it. So the two walks on the bidirect graph, whether you walk it one way or another way, will actually give you the two strands of DNA. This is actually a great example of theoretical computer science inventing something and then finding an application for it. So bidirect graphs, as they relate to computational biology, were sort of invented, but this was found by Kechecheoglu in 1992. And then we were doing some research in bidirect graphs and decided to Google them. And Kechecheoglu in 1992 didn't have Google. That's a great research tool. So we Googled them and it turned out that bidirect graphs were invented by theoretical computer scientists back in the 1960s. The problem is, you didn't know what to do with them. So they were basically forgotten. He invented this sort of formulation, but who needs it? And nobody used it. And then Kechecheoglu used it but without realizing that they were known back in the 60s. So this is, for example, the bidirectness using the brewing graphs. So imagine that you had this DNA string. These are all of the camers which you would generate from it. And you would build some graph which looks like this if you were dealing with regular strings. Unfortunately, what you really have is for every single camer, you have a traverse complement as well. And for some of them, the reverse complement is itself. If I have any examples here. And the graph would actually look like this. And what you need, right? Once you add all of the reverse complement to camers. And what you need is now two walks which are reverse complements of each other. And this is obviously somewhat, you know, doesn't seem as obvious how to do anymore. But by going to a bidirective graph, it's very easy to see that there is a simple walk through these nodes which uses these two twice. So bidirective graphs are really an easier way of thinking about DNA sequences. And almost all of the assembly tools now use these bidirective graphs. They're really elegant from a computational perspective. So what are the downsides of the brewing approach? What is the first sort of, you know, the first thing they do is they take the DNA sequences which you've got and split them into camers. Split them into strings of length tape. And there really isn't any huge motivation in order to do it. There's actually very often very few disadvantages to doing it. When you're dealing with really, really short reads, your key will almost be the size of your read. And then you don't lose much information. But if you're dealing with longer reads, then there's obviously room to lose information by splitting your read into subsegments of length tape. And all of the brewing based approach, they will do the split into length tape, and then they will add really complicated computational steps in order to go back then and to reconstruct the reads. Because you could theoretically get a path which is not compatible with a read if it's longer than tape. Because all that we asked about is that you go through every single camer. It's also, well, so this is a big argument, actually. So it is very going to be very significant to sense the sequencing errors. Because as soon as you change one letter, well, you destroy that whole length tape and you will get multiple paths for the same thing. This one is actually quite controversial. Not memory efficient. People who believe in the string graph approach will say that the brewing graphs are not memory efficient. Because you have a node for every single string of length tape. People who do the brewing graph approaches will say, no, no, no, we are the ones who are memory efficient. It really is a trade-off. If you have really, really, really high coverage and really short reads, the brewing graphs are more memory efficient. If you have lower coverage and longer reads, they become much less memory efficient. So there is a trade-off. So the goal of people who believe in string graphs is that there should be one node per read or better. There should be a way of reducing the number of the memory that's used by the graph. That the division into camers, since it's really arbitrary, you shouldn't have to do it. And there should be flexibility in the presence of sequencing errors. And this has led to sort of work on string graphs, which are based on something called an overlap graph. Here you will start with your sequences, your reads, and you will not split them in any way. You will just directly build the graph from them. The nodes of your graph will be the reads. So every single node goes through it. The edges are overlaps between them. And, for example, the weights of the edges are lengths of the non-overlapping prefix. So, for example, ACG to ACG-TAC overlaps TACAT, right? And three letters are shared. The prefix is ACG. So that will be the weight of this prefix. Here ACG-TAC versus CATAC. There is no overlap. Well, there's no overlap by one letter, C, right? And so the prefix is five letters, ACG-TAC. So... So this is... Usually they are... It really depends on the actual assembler. Usually indels will... It's porcanger reads and 454 reads. You definitely need to allow indels. For things like Selexa and solid reads, you usually don't allow indels. But, yeah, that's... They could be allowed. For what? Oh, basically... So whenever you have an overlap, you have two pieces of DNA, which you actually have read, you know, from two different reads. If there is a sequencing error, which is an indel, you want to be able to model it and hence you will allow for overlap, which are not perfect, and including indels or sequels. I guess I also wanted to ask what's the reason or why you would want to allow indels for Selexa and solid because of a deep coverage? Yeah, so for solid and Selexa, the reason indels are not allowed, typically, is that... Well, for solid, because of color space, indels are relatively tricky. And if you ask... So the A-B, people that I've talked to claim that there are no indel errors in their read data. I think that may also be due to the fact that the tools that they've developed for working with color space data do not allow one to find indels. But, you know, the relationship between them is somewhat murky. So for Selexa, the fraction of indel errors is very, very small. And as a result, because of the high coverage, you're basically better off discarding the read with such an error and dealing with everything else. So, okay, so I'm making this a directed graph. In reality, this would be a bi-directed graph. The exact same way that I showed bi-directed graphs earlier. So in reality, all of these, so ACGTAC, the same exact node is also GTA-CGT. And that will overlap with other screens in whatever way. And this would be a bi-directed graph in practice. I'm just making it directed for simplicity. So, and then what you do in this graph is basically try to find things like cycles and paths, and longer paths that could be supported. So for example, this could be a circular walk, for example, ACGTAC. It would be TACATAC, and then GTAC, and then back into this one. So this is, for example, a circular walk on this, in case you're assembling a bacterial genome. There is something called transitability and permeable overlaps. So for example, ACGTAC, TACAT, and then this TACAT. This walk will exactly spell this overlap of one letter. So this overlap is something which is called transitability and permeable. It can be inferred from these two edges, from this edge and this edge. So these are overlaps that we'll want to remove from our graph, because they're actually redundant. Okay, so to build a string graph, you start with an overlap graph and remove these transitability and permeable overlaps. So you start with something which will look potentially like this, and you remove all of these red edges because they can be spelled directly. And you end up with a graph which looks like this, and I apologize if I can't see these. There are also nodes between these, but for some reason they're not coming through. The next thing you do is collapse chains. So from here, there's this read followed by this read, by this read, by this read, and then to this read. You're going to start walking along this path. You've got to finish it. You have no ways of leaving it. So all of these can be collapsed. And then all the edges get separated into two classes. There are required edges. These are edges between two nodes, between two nodes that had internal vertices before. So all of these, this path had internal vertices, so this is going to be a required edge. This is this node. This is going to be a required edge because there was an internal vertex right here. This is going to be an optional edge. The reason, since it doesn't contain a read, you're actually not sure that it's part of any path. So this is basically all that there is to string graphs. You build these, you find these required and optional edges, and then you output the required ones as your assembly. So the way that Myers formulated string graphs initially in 2005, the goal is to find the shortest path using all of the required edges and any optional ones. He had some suspicions that maybe polynomial time may be easy to do computationally, turned out to be hard as I mentioned earlier. Well, once you're done with this, so at least the slides are out of order. I'm going to do them backwards. So from paths to context, do you have reads when they overlap each other? This is what a path in your string graph may look like. But there could also be a repeat region in which case you will have multiple reads in here where other reads are coming from a completely different genomic region. So this is going to actually lead that there is an edge going here, an edge going here, an edge through, and then edges would split. So this is going to be what your shape of the graph looks like. So you will typically merge reads up to potential repeat boundaries. This is like collapsing these paths. But what do you do with these? So, sorry, the two reads were out of order. Well, the way you can think about this is that you have an edge is going in, then a repeat region, and then edges going out. Well, in order to figure out what is the right way of joining these together, you can use mate pair data. And this is why mate pairs are so important for assembly. If you have short reads, all of these edges will be relatively short because you very often encounter repeats which will collapse the paths. But you can use mate pairs by looking at this. So you have red and green things, and they're somewhere nearby each other. There's some piecing between them, but it's not, hopefully, not too big. But then they should have mate pairs, right? So this read will be paired with this one and this read will be paired with this one. And the question which you may want to ask is, well, is there a short path between A prime and B prime? Because if there is, that implies that these two edges should be next to each other as opposed to these two edges. So you can do something called extra sort of path algorithm to find where A prime and B prime are close to each other and then join, if so, join these two edges together and as a result, resolve the street. So you've got a reading region. You've got unique edge going in, unique edge going out, unique edge going in, unique edge going out. So you want to know which of these unique edges sort of belong with each other. The way to do it is you look at the reads which are only in unique regions near the reading and look at their mate pairs which are located some distance away from over there. And then you ask, well, are the pairs close to each other? Because if the pairs are close to each other, that means these two regions are also close to each other and hence the right path through here would be like this as opposed to like this. This is only for mate pair, right? This is exactly. So this is why pair and data is so important for mate pair data. Because unless you have these paired reads, there's no way. As soon as you get to any kind of a repetitive area, and oh, this is something I'm not mentioning, but when people talk about assembly, so when biologists talk about reading, they usually think transpose on something like that. When computer scientists or people who do assembly talk about repeat, there is a very simple definition of a repeat. This is a piece of DNA that is longer than your read length that appears multiple times in the genome. A repeat is defined relative to your read length. The shorter you read, the more repeats your genome has. So if you have reads of length 25, then even short homopolymer stretches are repeats now. So making assembly very, very difficult. Okay. And the final stage of assembly is consensus calling. You get these reads. They got errors, for example, sequencing errors like indels or in the letter changes. You do multiple alignment of these reads. And at every single position, you follow the most common letter. Or alternatively, you can take the best quality letter. So all of the reads have quality values. It's sort of the likelihood that they're mistake. I'm actually not going to talk much about quality. Mike, are you going to talk about quality values? Yes. You will. Okay, great. So you'll hear more about quality values. But you can use the best quality value or a consensus of some kind by weighted voting or any other way of predicting the best letter here. And by now there are many, many, many assemblers which are out there. For example, one of the most popular ones and ones you will see used most for, especially selected data is velvet. And this is a comparison from the velvet paper which came out a year and a change ago. They claimed that they were able to do a much better job than all of the previous assemblers and they were. I mean, now there are other tools which do almost the world velvet. Still sense to be the tool that everybody uses. In bioinformatics, there is, and I think in all of science, there's a huge conservative mindset. Once you get a tool which works, you don't care about better tools which come along for better or worse. There's a new tool which is 5% better. Who cares? If it's 50% better, probably people will switch. So here they claim that their error rate is 0.02%, 97% coverage and M50. So whenever people compare assemblers, they talk about something called M50. The M50 is the length of the content in which you have half your genome. So in contents of this length or more, you will have gotten half your genome. This is the typical number which all people who compare assemblers work with. You typically don't want to look at your average content length because that will be very, very short and not very meaningful because what you end up with is lots of long pieces and lots and lots and lots of really short ones. So people will typically talk about M50 score which is half the genome is in pieces this length or more. So 8.5 feet or I think that's the least. And the previous assemblers were available for really a lot of work. There is by now, so by now there is a huge set of assemblers available for next generation sequencing. So velvet was the first one which had really spectacular results for short pre-assembled, all paths as well had showed very good results. On the other hand, I've never run all paths and I've never seen anybody outside of Wiped who has run all paths. So the results reported that the paper are absolutely spectacular but I have an extra theory. Euler is the original sort of Eulerian assembler based on the brewing graphs out of the Kemsner's group of nine papers, ultra short reads, USR, this is beginning to sound like airplane names but they claim results better than velvet in that paper. And the paper, I think the paper title is something like this length matter in the name. So all of these tools which I talked about work with letter space and it turned out that assembling in color space becomes a little bit trickier and as Francis promised I'm going to now actually tell you a little bit more about color space. I'll explain color space to you again and this is what color space reads look like and this is what a consensus may look like when you're trying to call consensus. Here I have reads and they start with a letter and then have a whole bunch of numbers and if you didn't get what they meant in the first time we'll try this one more time and you could try calling consensus based on the actual numbers while it's got screwed up the fonts are completely wrong. This should be all courier so that you can actually see consensus. Sorry about this. Should I edit this? This is courier. What's going on? It should look something like this. I'm not sure why it got screwed up in translation between the PC and the Macs. Something went wrong. But you can see that these should align and they should give you a consensus of some kind of logic for this. Sorry? The binders. The printout of the binders is fine. Yeah, but see, that's why you know we should settle on one kind of technology either if it's newer or Mac. Sorry, consensus is tricky. Okay, well, sorry. What could be a PC's fault? It's a Microsoft. Okay, I'll just blame Microsoft. That's a lot easier. Ah, great question. What are the three to try to fill out? So these are reverse complemented. Color space works tricky. There is no reverse complement. There is no reverse. But you have to push the letter through. So what I have actually done is I've just threw the reverse the whole week. So you don't, it's actually great for those of you who are unique tax and if you're not, you will become one by the end of my lab. There's a great command in Unix called Rev which will take a string and reverse it. So it's great for color space. So it's, so you just take a string and reverse it. So you can read from the opposite string, which is why the letter is at the end. So this is basically to be a multiple alignment of these three and then you call consensus and dot and get some string which has a letter and a whole bunch of numbers after it. And you can then, as Francis pointed out, since you have the first letter, you can translate it through. You can take and apply these rules and translate it. The question which you, which you may wonder about, what happens if you get one thing wrong? Well, if you have a regular string of DNA letters, you screw up one letter while one letter is wrong. It'll have no effect on the next letter or the previous one. That's how consensus works. You call every single position independently. So what happens in the case of color space? So let's go back and try to explain color space again. I took me probably a month to figure out color space. So if you guys didn't get it the first time, it's probably fine. So this is how I explain color space in a slightly different way because that's how I understood it. So this is what the color space read looks like. The letter followed by a whole bunch of numbers. The letter is part of the letter which joins the DNA to the slide. So it's actually known. You know what it'll start with. And then you have a whole bunch of numbers. Well, Francis tried to explain it to you using this table, which is what EDI gives you, the stable to have it figured out. And in case you haven't gotten this from the first part of my talk, I like graphs. So instead of using tables, I prefer to think of graphs. So this is what's something called in computer science state automata. And you have letters. And then you have numbers on the edges between them. And what you have is you're at some letter and then you see a number. And that tells you where to go to. So you start at the T and you see a zero. Well, it tells you you should stay at the T because all the edges coming out, there's a zero, one, two, or three edge. So you stay at the T. Then you see a one. Then a one goes to G. So the next letter should be a G. G, followed by a two could be an A. Back to G, C, G, D, D. So this is what color space, this is a simpler way I think of thinking about color space. You're at some letter, you see a number and that determines where you go in this sort of little state diagram. This also it'll become a lot clearer now what happens if you get something wrong. What happens if as you were reading this read you got one color somehow wrong. Let's say you got this one where the zero is wrong, you should have gotten a two. What will happen? Well, we'll start at the T go to a T to a G to an A and say it's an A. So we'll get that letter wrong. So what will happen with the next one? We will go to a T an A a G a G and A and we get this string and if we compare them it's not so this I'll blame on the Microsoft as well it's not showing the second part of this but you can see that actually the two strings are completely different after the color space error. So if you were to call color space consensus by just taking the consensus of all the colors and then translating through if you get one letter wrong anywhere you'll get the rest of the genome wrong. So probably not a very good idea. Will not, that will not really work. Well, actually there's ways of getting around this problem and this it has to do with the fact that these things called translation. Let's forget about the first letter and let's just think about the colors. What string of DNA could this represent? Four different strings and how will they be different? By the four different first letters. And let's also think about the sort of string with an error just in parallel and these will be different from this if they started with a T. But for every since there's four possible strings depending on where you started when you started to know you'll never the two paths will never agree. So for this string with a zero there's four possible translations starting from A or C or G or a T. So if you start at the T which is where you're supposed to start because you know the first letter of a linker you will actually be able to go through and translate up until the error and you get the right string you get from G. But the string after that is going to be completely wrong as I pointed out. But do you see anything interesting about this sort of set of strings compared to what we expected to see if there were no sequence in errors? One of them has the right sequence. One of them has the right sequence. Which one? The second one. Yeah this one's wrong but this one this is wrong but this part actually matches it exactly. So if you basically when you have a sequencing error it's not that the rest of it is wrong. It's still right but it's just in a different translation. You have to figure out which one of the other translations to use starting at that point. So you can compensate in some way and jump to a different translation. So how would we know when we need to jump to a different translation? Really you have an additional piece of information. The piece of information is the first letter of all of the subsequent strings. Because you got that letter at some point I'll tell you what the right frame should have been later on. So you can work your way back. So this leads to something called the hidden markup model which you know I'm going to skip the details of this. If you're really interested in hidden markup models I'll be glad to talk to you offline about it. But I'm definitely so the key thing is you shouldn't just call colors because colors are redundant. If you're calling consensus for example 00 could be both an A or a TT. What you should do is you should call things both the colors and the letters simultaneously. Let's consider two positions on the genome. We don't know what they are going to be. Let's call it X, Y and X, Y and Z. So three positions. Well X, Y is a pair. So it should correspond to a single color. So this is these are all the possible pairs of nucleotides that you could have at that position. And YZ is another pair. But the two of them actually overlap by a single letter. So the middle letter has to be the same. So what we will do is we will allow one to give a walk sort of this graph where every single position you have this, you know, a set of 16 possible nucleotides but only allow it to follow from where the second letter will match the first letter of the next. So similarly anything which ended with a T can go to a TT or to TA to a TC so on. So this, if you're familiar with hidden Markov models, is a hidden Markov model. If you're not, don't worry about it. So then you have something called emissions. And for, you have color reads. So for every pair you have a set of colors which would be emitted by the pair. But you also have letters. If you have a T followed by two, you will know that the next letter is going to be a C. So this tells you which of these four frames you should be in. That at this point you should really be in the C frame unless this two was wrong. So in the hidden Markov model you have these emissions. You have color emissions where an A is likely to generate a zero with high probability and everything else with a low probability. TT, exact same thing. With high probability it will generate zero and with low probability anything else. These are basically the probability of the sequencing error. But as far as the letters go an A, A is likely to generate an A. That's the second letter. And so that's the letter that you want to generate. And everything else with low probability while a TT will have with high probability will allow for the generation of T's and with low probability anything else. So what this lets you to do in the context of a hidden Markov model is to say for every single position what is the likelihood of the state that you are in. So for example you could say I'm most likely in the state AT. And hence call the letter T at that position. Furthermore what's really nice is it allows you to give a confidence value for every single letter. So this this is completely screwed up for a reason which I do not understand. So what this shows is an output of a program which we will use tomorrow is you get a set of colored face readings and you get the genomic sequence but also for every single letter you sort of get a confidence. So a 9 indicates a high confidence a 0 indicates very low confidence in that letter. And then at some point the program just said I can have no more confidence to call these letters. As far as I'm concerned so it will output so this is a program called I Here Which You Guys Will Play With Tomorrow. So just to recap most key points of what I've been trying to tell you over the course of the past hour. So there is for different types of data you need different assembly approaches. So for shorter reads the burning graphs actually capture most of the information that you need to read data. Velvet and other similar assemblies will do a great job with short reads. Velvet is a debroying graph based assembler and so is Euler-USR. For longer ones string graph like approaches are better. So approaches so I here for example actually uses an overlap string graph based approach not a debroying based approach. For longer scope assembly when you're trying to really do the genome from scratch read length is really key and in the absence of read length you need makers. So and because read length is really key there are many efforts to combine various types of data so there's 454 reads are now the long reads. There are almost 500 letters now where Sanger was 10 years ago so there are many efforts which tend to combine 454 data with either solid or selected data mainly selected because color space is trickier and people are just sort of beginning to figure it out from a computational side. But at the same time color space is tricky but can handle with the right computational tools. So as a biologist so basically if you're a biologist and that's the goal of the computer scientists in the room and of the bioinformaticians in the room you should be able to hide color space away from the biologist by making everything look like letters. So and the key to this is the right computational approaches. Okay so thank you and I'll be glad to take any more questions.