 So hopefully, there won't be any slide screw-ups. My friends, hopefully, that's just limited to my previous stuff. So I'll be talking a little bit about dynamic variation discovery. And to lay it out, I'll talk a little bit of introducing the topic, but then I'm going to talk about pretty much the entire pipeline leading up to sniff discovery. The reason why I do that is it's very much a sequential process. And if you screw up at any point before doing the actual sniff calling, you're going to get very difficult results. Then after sniff discovery, I'll talk a little bit about visualization and then some of the work that I've been doing in a thousand genomes project. Now, why would you study genetic variations? Well, for the first volume, when you get a big data set, I heard a couple of you mentioned this morning when you were introducing yourselves. Well, we just got a whole bunch of data. And then we've aligned it from now on. Well, this is the killer application for a bunch of data lying around. Now, take a look at what you have. So on one hand, you can use genetic variation discovery to look for inherited diseases if you look over with many samples. You can also use it to study inotific differences. And also, you can use a coalescent model to try and ascertain ancestral history. Now, at the core of it, when I talk about genetic variation, I usually talk about sniff discovery and also what's called engels. Those are insertion and deletion events. And so if you're looking at the outfits of two visualization programs, this is concept. Here, you see a bunch of T-oleals where there's supposed to be an A-oleal in the reference. So that's a typical sniff. And over here, we see there's basically an insertion and a double A-oleal has been inserted relative to the genome. But it's not only snips and engels that we care about. We also are interested in structural variations. And so Mike Bruteno is going to be discussing some of that tomorrow. And then we also have other aspects such as epigenetic variations. We mentioned earlier today, we talked about 5-sulfide seed and also chip seed is a very popular thing right now. So let's start off with thinking about how we can interpret the data. Now, we have an Illumina sequence right here. And it's somehow interpreting the basis of producing the actual bases and perhaps even base qualities. So I guess the first question is, what's the base quality? Why is it interesting? Well, a base quality basically describes what is the probability that any particular base is erroneous. And we usually use a log scale to depict that. And so if you look at a base quality 10, you have a 10% likelihood that that base is wrong. 20, you have a 1%, 30, a 0.1%, et cetera, et cetera. So back in the senior capillary days, we said, you know, any base, I had a base quality of 20 or a higher was pretty good. These days, we're a little bit more demanding. And so we like to see things that are like 27 and above. And so just keep this in mind when you hear base qualities mentioned later on in the lecture. Now, one of the other things we were discussing here earlier today was the various error rates for each of these technologies. So the first question you might have is how do you even go about figuring out what the error rate is? Well, usually it involves just aligning a data set against the reference. And then I'm looking at where you have insertions, deletions, and substitutions, and basically considering those as errors. But there are a whole score of caveats involved. First of all, your alignment might be paralogous. So you're basically counting against whatever you might think something is an error when it actually belongs somewhere else. You might have a local misalignment. So this seat probably actually belongs over there. But depending on the aligner, you have a scoring mechanism where this is the optimal alignment. And then you also have polymorphic test data. So if you remember, you have a photographic memory and you remember the table I just showed you. If you have a SNP rate of 1 in 1,000 bases of SNP, then you would expect that to be the equivalent of a base quality 30. So that means if you're trying to ascertain the error rate in a polymorphic data set, the best you can ever do when you're looking for those error rates is a base quality 30. It'd be impossible to say, OK, this is actually a base quality 40. You don't have that much power. And then the question is, OK, if you're aligning all these reads, what about all those reads that didn't align? Very unhappy kid there. And this horribly skews how you think of the error rates. If you take very stringent alignment criteria that may be only the best reads align, and then you'll think, wow, we have 0.1% error rate in the sequencing technology, whereas you probably have 4%. So these are the caveats involved, but I was going to show you some of the results. So if we're looking at 4.54, we see that the overall error rate is pretty small. It's less than 0.5%. And if you look at it as a chemical reaction, it's kind of technically impossible to get a substitution error, but they do have them because of the number of factors. So what you get is you get all these homopolymer situations where instead of saying there's two As in a row, you say there's three or one. So you're more likely to see insertion errors, actual deletion errors in 4.54. Any questions so far? Yeah? I have a question. Why insertion division in 4.54? Oh, because of the whole homopolymer issue. So when you're aligning it, let's say you really had two As, but in that one read, you had four. So that's going to register as an indel error. And that's why you see that, yes. Why not make the error rate? I would have noted that there is an error in the reader and not in the reference. So that's your good point. Ultimately, if you have the resources, what you really should do is you should probably take a synthesized oligo or a note back or something and then do thorough testing to see what the error is. But if you're just playing quick and dirty, this is the kind of way that a lot of people do it. In the young, these kind of statistics don't really differ that much. The actual proportions might shift a couple of percent in there. But when you just evaluate a normal data set, it's pretty solid. The way I interpreted your question was you're almost jumping ahead to the actual sniff discovery phase and trying to discover, OK, if you have a false positive, what was it attributed to? There's a whole bunch of aspects to that. We'll get back to that later. And if we don't, please remind me. We look at alumina. We see overall the error rate is slightly higher. We mentioned earlier it's about 1%. But here you see almost the exact opposite. Here 95% of all your errors are substitutions. So when you guys asked Mike Brudenow, why do you usually just ignore indel-type errors in solid and alumina? Well, the cynical answer was because that's what their aligners handle. But one of the more factual answers is they don't have too much in the actual sequencing chemistry. If we look at the error profile of 36 base pair of alumina reads, we see the majority of them have zero errors. So about 80% of the reads will have zero errors. And then it goes down subsequently. But these are matched. Yeah. That's why I know it's that really sad kid in that previous post. And also it's interesting to know that what we've seen and we haven't gotten a really good answer from alumina about this is we've noticed most of the time in paired-end reads that the second mate usually has slightly higher error rate. You can see that in this graph here. But nobody's really been able to answer that. But it's a subtle observation. Another thing, again, talking about alumina, we see that if we compare different lanes belonging to the same run, we see that the variants in the actual lanes don't differ that much. They differ a little bit. You get most of your variation between runs. Now, I got one of these questions during one of the reads today that involves calibrating these base qualities. And so this original here, I have that in quotation marks because this is from the 1000 Genomes project. And what they did early on is they tried to recalibrate all the base qualities for alumina and solid and 454 so that when you look at the reported quality score and the actual measured quality score, you would like everything to line up on that diagonal if it was perfect. But this is far from the original. This is pre-calibrated. This is what we used to use. And Arthur Bristow at the Broad Institute, he's been working a lot lately on a logistic regression model where he basically uses sequence context to make this a lot nicer. It's kind of curious this little aberration there. But all in all, you get this beautiful, well, you don't get any bias anymore. Now, what's interesting about this is this is nothing that you need to do beforehand. This logistic regression model is, after you've aligned your data set, you can just run this regression model and it'll fix all the base qualities before you run your snake discovery. So I'm pretty excited about this. I haven't used this myself, but I thought I would just mention it. Well, my core research topic is read alignment. And so this is why this will probably occupy most of the talk, because this is what's an idea to my heart. And also because it probably has the most impact on your snake problem. So you guys use a really old version of power for the next year, man. We should talk about that. This was supposed to be over here. All right. So we already talked a little bit earlier today about DeNovo assembly. And so what I'll be talking about here is reference guide to this assembly. What's that? That wasn't about this, man. It's just a new part. It's like three years ago. Like in 2006, they came out of like office 2007. Just the wrong line. Anyway, so all these nasty colors here basically represent several reads that you want to align it to reference. Now, what most aligners do is they'll do pairwise alignment. So already from this point, each of these reads here, you can see how they're associated with the reference. You can sort of figure out where the snips and insertions and relations might be by looking at this. But what we typically do with mosaic is we go one step further and create a multiple sequence alignment. That way, you can use the statistical power of high coverage to elucidate more convincingly, is this actually a deletion or a snip? I'm going to have to go like that. And then we can use our visualization tools to look at these. The other thing that sets mosaic apart from many other aligners is we try and support all the existent sequencing technologies. So in addition to supporting the old-style Sanger capillary, we also support Illumina 454 solid, Ulykos. And when they finally release the data set to us, we'll also support Pac-Bio. Now, this is kind of showing the basic structure of mosaic. You have a program up here called Mosaic Build. Mosaic Build simply takes your reads from various formats. If it's fast A, fast Q, there's also the short read format that Awesome has been instrumental in putting together. So if you're curious about that, you can count him about that tomorrow. Also, the native Illumina formats, the Buster and Gerald format, they support. Basically, that's just to make it so that mosaic can use these formats no matter why. And then the actual aligner will basically hash these up, like Mike Bruton I was talking about earlier about hashing things up into different k-mers. That's what we do. And then we cluster those together. We finally use a very accurate Smith-Waterman algorithm that will produce a little read arc. I don't know if it would say a assembler actually produces a multiple sequence alignment and creates an assembly file. So usually we do create ace files. And today we'll be creating files in this gigabase format, which is our SNP caller. Yeah. Exactly, exactly. So he asked what everyone usually asks when they see this line and they're like, good god, man, is this guy crazy? Why is he doing Smith-Waterman? Isn't that insanely slow? And yes, it's a very computationally expensive algorithm. However, with the exception of EVAN, we are the fastest aligner out there. So it's kind of this difference between being, theoretically, a very hard problem. I wouldn't say it's an NP-hard problem, because it isn't. And being able to do this in practice. So yeah, we're able to do this really fast. Like for small use genomes with an 8-core computer, we usually align like 100,000 reads per second. So you can do these pretty fast. We'll be going slightly slower today during the lab, but no, no, it's a good question. But you can overcome some of these limitations. Our big problem is we use a lot of RAM right now. So we're trying to bring that down. So when you have crazy advisors, sometimes they come to you with a pile of data and say, please, please, analyze this. And so that happened to me last fall. My advisor came to me with $6 billion, $4.54, an Illumina Reads, and said, can you please align these and look for snakes to name them? So I put together the pipeline, and using mosaic and V-base, it actually, on a nine-node cluster, took nine days to complete. Most of the time you see here is the actual alignment time. We also sort the reads according to position that makes life easier for the multiple sequence alignment generator, and also makes things a lot nicer when we're doing a snake problem. And so each of those nodes had big processes. So ultimately, how does it work? Well, as we mentioned, we have these little k-mers. So in this example, I'm showing a k-mer at four, which is ridiculously small. We use much larger than that. And so obviously, if you were going to try and align this read to the reference, you would assume all these k-mers to be the ones you wanted to trust. However, just like Mike Grudno said, when you're using small k-mers, a lot of things start looking repetitive. So you also, this little TTCT here also matches this TTCT. And just in this little toy example, this is what it looks like in the end. And so if you take this on a genomic scale, 3 billion bases, even if you use a much larger hash size, this is the kind of phenomenon you see. So we had a position aware clustering algorithm that tries to figure out which of these is the most likely. Also, when it comes to aligning the reads, there's some different arguments. I wanted to put it as high as the string method and the De Bruyne method kind of religious patterns. It's more subtle than that. What do you do with the reads when you find they can go to multiple places? Well, one scenario is you can align only the unique reads. And if you have something that's non-unique, you just kind of throw it at the base. Another scenario is you align it everywhere you can. And then there's the third scenario where you just kind of pick a random position. So I'm not a very keen subscriber to the last scenario. So there's, of course, the first two. This is a typical one used, for example, on the macro liner. It'll pick a random location. We also have a lot of platform-specific things. I'll just mention two of them. One is when you're aligning four or five four reads, you're going to have a lot of these gaps opening up. However, you don't want to penalize a gap that occurs in a homopolymer as much as you would penalize something that appears elsewhere. So in this case, we don't penalize this gap as much as gap number one or four, five, four. Yeah? What size of paper are you using? OK. So in our lab, we've just standardized on using a camer size of 15. That's what we use for most things. So unless we do, so most of our work is on mammalian-sized genomes. And so 15 is basically a compromise improving the speed there. When it comes to bacterial or yeast-sized genomes, that will probably go down to like nine. It's really a sensitivity versus performance type exercise. And then for AB, as I mentioned, a lot of aligners, when they first tried supporting AB, all they did is they converted all the 0s, 1, 2s, and 3s in color space to AC, GM, T's. And then they just say, cool, we're aligning in color space. What they forgot about is what we've already mentioned today, that you no longer take the reverse complement to figure out what's on the reverse strand. You simply reverse the actual sequence when you're trying to align it. So in the early days of color space, you had a lot of aligners that would get every single alignment in the reverse strand wrong. So it's important to keep these things in mind, even though they seem very trivial. Now, one of the other benefits with Mosaic and also with the program they developed in Mike Bruno's lab is since we use a Smith-Waterman type algorithm, it's very handy for short end-dial protection. So this CO-gainst paper that we collaborated with, we identified 216 proper end-dials. And we had a very high validation rate, 89%. That's almost unheard of if you come back from the, say, our capillary days and you try to validate end-dials. The normal figure there would have been around 30% for your end-dials would have validated. So that was pretty impressive. Another question that was asked today is, how about combining different technologies? And so that's what we do particularly well. This was a project where we tried to align like, I was gonna say, a biofuel friendly yeast called Pekieste pitis. We were trying to evaluate different sequencing technologies and how well we could figure out where this one point mutation was. But we had reads from Sanger capillary. Those were actually our SNP validation reads. 454 FLX reads, the older 454 GS20 reads, and then also Illumina reads all together in the same thing. And just like Mike Brugno said, you can use the different aspects of these technologies to produce a more coherent data set when you're looking for SNPs. Yes? Do you have a rendering problem and the new scale of the number of reads? Is there more lag in the rendering? No, no, no. Actually, so when it comes to speed, the only limiting factor with speed is the length of your genome. So yeah, yeah. If you were to try and align the Lilly genome, I think Jose could just die. But no, we can handle a mammalian type genomes pretty easily. And so that's why when we had six billion reads to align, it wasn't that long. So I skipped over this past slide because I think we've killed this topic of what is a period of reads. And so I think it was explained a lot better than I would have done right then. So how do we resolve period of reads? Well, music, what it does is it aligns both of those ends separately. And then it tries to analyze, okay, how can we arrange these? So in the first aspect, you can basically take all your uniquely mapped reads. So if both of the pairs align uniquely, you can use those to build up an empirical distribution of your fragment. A lot of aligners out there, you have to say, oh, I used 200 base pair, like fragment library. And that might have been what you intended, but how you actually excise the gel and how it actually turned out might be slightly different. So that's why we empirically deduce that from the data set you have. And then you just check to see, okay, in the end, do these two unique pairs actually conform to this confidence interval that you created? Or you can use the same kind of technique to look at if you have one end is unique, but the other end is not unique. Then if you find one pair that fits this criteria, then you can probably use that. But if you find several, then you're still uncertain. And in the same methodology, you can also look at if you have non-unique on both ends. But that's a little bit dodgier that you usually have a higher error rate to substitute that. But that's something our software does, it's cool. We have a bunch of peripheral programs that talk more about this very lab. Let's say coverage that easily produces coverage diagrams, which is great for gypsy experiments. Then you can also use that to find out what's happening over here. You see the coverage dips in these two regions. Did you notice how big these gaps seem to be? Each of these pegs are about 100. So 300 base pairs, but what do you think that might be? For the development? Yeah, yeah, it very much looks like an all loop. And so one of the things, so I printed out just the unique portions. So what you see here is all the contributions from unique reads. And so what you see then is dips wherever you have something that's hyper repetitive. But this is a very interesting way of identifying all those. And anyway, do you want to add a couple? Well, I think that might be what the people behind Mac tried to do. They aligned randomly. And that way, if you can't figure out exactly where it goes, maybe if you play something randomly, it will normalize the coverage. This to me is actually far more desirable than the opposite effect. Some aligners, they aren't very tolerant to repetitive elements. And what you'll see is you'll see huge peaks when you have something repetitive. And that's just gonna throw everything out the door when you're trying to do snip discovery or indel discovery. I think both behaviors are absolutely desirable. I mean, I'd like to have some good information on that. That's true. But you can, you can use that mode where you say align everywhere possible. Yeah, yeah. One thing you can do is you can divide the number of, divide the contribution by the number of places. So we actually have that. We get to talk a little bit about alignment qualities. In like two sides. We also have a little program that converts, you know, we realize not everybody wants to work in the mosaic world and they want to use some other programs in their pipeline. We have this mosaic text program that right now converts things into the bed format that you've had for that. There's a format supported in BLAT that we really like called AXT. And then new within the thousand genomes we're hoping it spreads further out. It's called the SAM format, which is we're trying to have this as like a universal alignment format and it's a binary equivalent called BAM. Yes. It's called, So this is, this is way before we even bother looking for SNPs. So no, we don't have that right now. But we do have that in our visualization tools. Yeah. What's your sense of, you said you could edit things out? Hey. No, no, no, no, no. No, no, it seems like a decent attempt. What do you, so? No, well it was, you know, this, that format was created within the thousand genomes format just because it was, we were all going crazy. Everybody was using different aligners. And then the people that were doing structural variation research and also SNP and end up calling, you'd have to write a new parser for every single aligner out there. So this, this was just a way to manage stress levels. Yeah, yeah, no, no, it seems it works. I don't think it's the most efficient. He asked my advisor, he loves it to death. Yeah, it's a different take on different things. But yeah, we support it. The love of ambiguity. Here's something I think we're kind of unique. And that is, no, when you're aligning reads from, that are taken from one individual against the current reference genome, you're gonna get a bias against whatever SNPs you have in that data set relative to that genome. What I mean is if you, if you have something that's homozygous in one individual in your data set, but you're trying to align it to that genome, it'll show up as a mismatch for that base. And that'll contribute, you know, usually when you align these reads, whatever liner you use, you usually place some sort of a threshold, the alliat, the allowed two mismatches or four mismatches or some percentage of mismatches. So what we did is we supported these IUPAC and the duty codes. And what you can do with that is you can get a reference sequence that has been masked for all the BVSNP and half map three locations. By doing that, you can align a read. And if that read has one of the two alleles or one of the three alleles denoted by that ambiguity code, you won't incur a mismatch score. And so this is something we've supported for a while, but we haven't tested it that well. Something I'm hoping to take a look at in like another moment. Here's a horribly skewed feature chart. However, it does show that there's a wide variation in the different styles of the aligners. And as me and Mike pointed out before we went to lunch, there's probably 30 aligners out there now. So this one's kind of dated, this one's from like a year ago. But when it was, it extends obviously we support a lot of sequencing platforms together with shrimp we do, we use the cement water and we know which helps us do gap alignments. Right now, all of them support paired end reads, at least the ones that we have here. And then being a computer scientist, I don't mind if people download things and compile their code, but a lot of biologists don't know how to compile code. So we have a lot of platform binaries out there. And one of the jokes we did like a year ago was we made a binary for the iPhone. And so yeah, for a small data set, for a back-sized data set, you could align how many luminaries you wanted. It was a gimmick just to show one of the signs of how robust your code is, is it sort of platform independent? Can you get it to easily work on a different platform? These days in bioinformatics, it's all too easy that somebody will use GCC on Linux and then expect it to work everywhere. They'll figure out they used something specific. No, I haven't. I just have some personal experience there, but I don't have any comparisons. The guy that wrote NOVA line was originally at, what's that Mali even out there? I'll get back to you during the break. Another question is, how well can your liner classify reads as unique or non-unique? This turns out to be a pretty big issue later on. It was a pretty well one compared to e-lan. So Mac, you can't actually figure it out because it actually randomly assigns and doesn't give you anything. You can kind of use their mapping quality as a proxy for that but that's a mess. As far as actual accuracy goes, so if you have a simulated data set, how accurately can you line the reads? And as you would expect for, so all the reads are marked in green, all the snips are marked in green, and for the most part in those two data sets, they all do fairly similarly. Here's where the gap alignment approach really helps. With all the reads that have indolence in them, as soon as they clearly outperforms e-lan and Mac, so actually soaked us surprisingly well here. I don't know if it was a fluke of the data set or whatever, but it should have been down there as well. It got up to the fluke for short time, which impressed me. And this is what I was talking about alignment qualities. We've had alignment qualities in Mosaic for roughly a year, but they need some serious tweaking. So the last couple of tweets, I've been looking at this. So what does this mountain represent? We found out that at least one Mosaic is concerned there are two different criteria that seem to contribute to how good or how well is the alignment place. One is information content. So that's basically an application of Shannon's entropy. The other one is if you take the sum of all the mismatch base qualities and divide that by the total sum of the base qualities, you see that you have a very well for the distribution here. We have a curious little aberration there, but we're basically using this model now to investigate how it looks when you have different relinks and different reference links to see if that also poses some kind of... Well, you can think about it this way. If you have a reference sequence, that's 35 base pairs long. When you have a read that's 35 base pairs, that's exactly the same. I would hope you would place it correctly. However, if you have something that's 3 billion bases long, those odds are no longer at high. Well, this is my supervisor here. He's very happy. And this is Reverend Bayes responsible for the Bayesian type algorithms. And so as a big fan of Thomas Bayes, he made all his programs after Bayes. So we moved our incarnations called Gigabase. Now, to reiterate where most Gigabase concentrates on finding both snitch and insertion and deletion nodes. And so the goal is, so I've marked all the mismatches here in red, the goal is you wanna be able to sort out normal sequencing errors from an actual step. And one of the major ways of doing this is by looking very closely at the base quality. And you wanna do it in a certain way so that if you have a high base quality mismatch, you're more likely to call this a snip than if you had a low quality mismatch from here. Here both the reference and the read had a very low quality. So this is kind of up in the air if this is really a snitch or not. Whereas you're much more confident here. Because this indicates, you know, you have a 0.1% here and this is 0.0% here. This is the actual equation. And at the end of the lab, I expect all of you to memorize this. But in actuality, it's not that complex. Really, you're taking into account the number of individuals that you're trying to, to genotype and call snips then. We have a lot of dependencies on the actual base quality and the actual allele calls in the read. So there's other, I mean, as soon as you start looking at, you know, a sniff calling, you see there's a lot of caveats involved, for example, you know, calling snips in a half-point dataset versus a diploid dataset. This is very relevant to the 1000 genomes because, you know, you tend to use one program to call the snips in the 1000 genomes. But not many people think to consider, perhaps if it's a male, you know, aligning snip calling in the X and Y chromosomes as a half-point dataset. And also, a big bonus on the snip callers to produce genotype calls, so that's obviously different for both the half-point and the diploid variants. One of my main projects in the 1000 genomes arena is handling trios of individuals. This is actually kind of, you know, when it comes to computational biology, there's often like new challenges here and there that mean your life gets harder and harder to cope with different assignments. But trios actually make things a lot nicer because some of you have a duplication of data within the three individuals you're looking at. So this gives you a lot more power to distinguish between sequencing type errors and alignment type errors from actual polymorphisms. And one of the other aspects that makes, that counteract that, is you always have some probability that you have a denominal mutation rate in the trial. So you need to take that into account as well. The easy part of the 1000 genomes project has been aligning things and doing the snip calls. Now what they're facing is, how do we know these are right? Because for any particular individual, they're calling around four million snips and it's unlikely that you're going to use any assay to try and validate even a good four samples. So one of the big things that they look at is the actual coverage. And so what they've actually done here is the coverage, the red denotes the coverage of HapMap sites in our data set. Whereas the blue kind of shows all the sites. And so what you see here is maybe if you play some sort of a threshold around the coverage of the HapMap site, you might be able to minimize your false positives quite drastically. There's some caveats involved with that. That is the HapMap data set is fairly high-frequency snips. One of the goals of that in genomes is to go down to the 1% and 0.1% real frequency. And this may no longer be relevant in that case. And there's also, if you have too many snips in one area, that's also indicative of some sort of an alignment screw up. So one of the things I plotted here is, what which snips were in EBSnip and which were in the Sanger data set that team we collaborate with and which ones were not in those two. And basically you see from about 15, so if the snip between two snips, if the distance between two snips is like less than 15, it was more likely to be something that was not in one of those two data sets and perhaps likely to be an error. So this is one method of how you can make a threshold saying, okay, we'll ignore everything that's like over or under 14 base pairs. Then also from population genetics, you can also look at the Hardy-Weinberg at equilibrium. And if we look at the probability of segregating sites at the HapMap locations, you see a fairly even distribution here. With the test statistic. But then if we compare this to all the sites, you see like these regions here look fairly erroneous. So maybe you could create a filter there to screen those out. There are other metrics involved. A lot of the, a lot of the snip collars will produce a probability. What's the probability that this site is polymorphic? And so you can place a threshold there. And then also our friends at the Welcome Trust Sanger Institute, they actively try and optimize their snip call so that their transition to transversion ratio is close to two as possible. And probably back home? Well, theoretically that's supposed to be the optimum. And then when you look at, if you look at HapMap sites, you're almost exactly a transition to transversion ratio of two. Where, where things become a little bit fuzzy is, are there hotspots? For example, if you limit your, if you do that to capture, and you only look at like genetic regions, do you still expect the same sort of transition transfer, transversion ratio? Where's it gonna be different? That's something that hasn't been clarified yet. So yeah, this happy guy, that's Derek, he's in our lab. And if you haven't caught on by now, I guess one of the popular pastimes in our lab was coming up with some weird name and a logo attached to it. And so he called this visualization tool Gambit. And before Gambit, we pretty much used an old trusty favorite concept, which was popular in the Sanger Capital area. But it has this downsides. If you have a very large data set, it requires an enormous amount of memory. Takes forever to mode. I think with my E-trio, just to look at chromosome one, I have to wait an hour and a half for it to load everything up. And that was on a fast computer. So now we have Gambit, which you also notice that you pretty much have a read for every line in concept. Most of the modern visualizers, they try and compact everything. So it more closely approximates sort of a coverage diagram. And so we're using Gambit to do some data validation, generate new biological hypotheses. Mostly for us, it's a software development aid, just to see, okay, how are my current alignments looking? Do I have any artifacts? He uses the BAM support. But what's gonna be interesting about Gambit is it's gonna use like Firefox type plugins. What on earth do I mean by that? One of the things we've noticed when we work with other labs even at our department is no, next gen sequencing is still far removed away from what the normal molecular biology lab can cope with. So if you do some mutational profiling, can you expect a normal mutation, a normal molecular biology lab to actually analyze that? And the answer is increasingly no. So we're trying to make applications that we'll sort that out. So the idea is to make plugins that will help do analysis on these. So for example, one of my colleagues in the lab, she does a lot of transcriptomics. So one of the plugins actually does, you can look at your technical replicates and your biological replicates and produce graphs attached to that. And so in that way, we hope to extend this more and more and more. So the last part I'll talk about is the 1000 Genomes project. You guys have probably heard two bits here and there about what it's involved, what the goals are. The major goals are you want to get down to discovering genetic variations at the 1% level across the entire genome. But then in the gene regions, they want to get down to the 0.1%. And so with these variants that they discover, they want to estimate the allele frequencies. They want to identify the haplotype background and also characterize LV. Now this kind of gives, this slide has been updated a little bit from what you have in your printout. This shows you have three major pilot projects within that Athendeno project. What is the pilot one? Pilot one is low coverage. We have many samples that are between two and four X. So we have 2.7 terabases of data for that project. Most of it being Illumina, that is solid and some 454. The project I've been most involved with is the pilot two data, which are deep C plus trios. We have a European trio and the European trio. And so we have 1.1 terabases of data there. And then pilot three is a relative newcomer to the project that involves X on capture. They have an outstanding number of samples, 607 samples right now. However, the target area has been reduced. So they have a total of about 2.2 megabases of target regions distributed over about 8,800 targets. And so the average coverage for each of those is about 10 to 20 X. So in my pilot two studies, one of the things we do and one of the things you guys will do after the break is compare the SNP calls you make to the HATMA and DB SNP. And they both basically answer two different questions. When you compare SNP calls to DB SNP, DB SNP kind of contains everything under the sun. It has some good SNPs and it has some bad SNPs. But the idea there is, if you have a lot of SNP calls that are not already in DB SNP, then that might be a sign that you have a lot of positive. So it's basically a proxy for seeing how many false positives you might have. HATMA 3 on the other hand, you have a lot fewer samples and they're actually genotype for the actual samples we're looking at. And so if you're missing anything from that, those samples, it's kind of a proxy for false negatives. So here you saw I missed 4.4% of the HATMA 3 SNPs in chromosome one. So that shows that we probably have like a 4% or 5% false negative rate. And then we have 20% of the calls are not in DB SNP. Now, some proportion of that are gonna be true SNPs that are just not in DB SNP. But some of them are gonna be false positives. And so that's what we'll be looking at this afternoon. Now, if we actually look at the concordance between DB SNP, our calls on the setting of our institute, we see the majority are actually shared between all three data sets. But then you see some elements like 6,000 of the SNPs were not in the Sanger set, but were in ours in DB SNP and then 15K was in their set, but not in ours. You can have a lot of fun with Venn diagrams. Another thing that I did a lot with is making indel calls. And what we see here is kind of a weird way of showing it, but when they validated these insertion deletion events, they categorized them into four different categories. Homo polymer runs that are longer than four bases, another one that's longer than two bases. And here's just simple one base pair insertion deletions. And so I basically just summed all this up. So with this scale here, if you had a perfect indel caller that performed at 100%, this scale would go up to 4.0, basically 1.0 for each 100%. When you see the cold reality here is probably at like 1.7. So we're not even able to validate 50% of the indels in this huge project accurately. So there's a major activity at the moment to try and spruce up the indel callers and make them more accurate. So that's one of the big competitions right now in thousands of years. So what have you learned today? Well, first of all, garbage in, garbage out. If you have a mistake already at the base calling, that's gonna affect your alignments and then it's gonna affect your snit calling. So it's essential that you try and do everything as pristine as possible. Use the highest accuracy possible. So that might be the only example where two here is about 5%. You know, when everything counts, that 5% might make the difference between good and bad calls. Because it's expensive to validate. So you wanna get the truest calls possible. Another one is use the right tools. And by that, I don't mean use our tools. I mean, you know, for whatever problem you have, try and figure out what's the proper tool. Don't do this in a sort of lazy approach where you hear everyone's using this one tool, so you might as well do it too. If they use off the spend, maybe a couple of days researching the tools and figure out, okay, which one will actually deliver what you're trying to accomplish. Now finally, population genetics is seeming more and more like the ultimate quality control for a snit call. It's been largely ignored for, I don't know, how long, but now it's only in the last few months in the Thousand Enos project. Suddenly the population geneticists are getting like their due credit and they're like shining in the limelight. So here goes to them. This is just the usual suspects for our lab here in our free list eater. That's it for now. Any questions before we go on break?