 So Michael feels that in order to get through the lab exercises, it would be wiser to put lab 1 and lab 2 exercises together. So he's going to go through the lecture lab of module 2 and then we're going to have a combined lab that doesn't have a break so that when there's problems, we have an extended period of time to address those problems. Yeah, my concern was simply that we break for lunch at 12.30. And so I figured this next lecture will probably be about 20 to 30 minutes long. And then that'll give us a full hour to basically work together on labs 1 and 2. And so for those of you that feel like you want to go ahead, all the instructions are provided there so you can go ahead. For those of you that want to learn more about what we're actually doing for each step, you can basically stay in sync with me. And so I've already started. Okay, so if you bear with me, there's just one more lecture before lunch. And so we'll talk about the flip side. Basically, now that you have your wonderful alignments, what do you do now? And so most commonly what you want to do is you want to identify small variants. And the way I've broken down this talk is into five main sections. Just a basic introduction of what variant calling is all about. Then we're going to talk about the variant calling pipeline. Just like we had a BAM file format for the alignments, we now have a standardized file format for variants called VCF. Now, one of the big tricks when you're doing variant calling is how do you reduce your false positives and false negatives? So we're going to talk about how can you filter your data so that what you end up with is the most true data set that you can. Whereas normal variant call filtering is often very linear. You're setting thresholds for different parameters. There's sort of a new style that people are doing called dynamic filtering. So we'll talk about a GTK tool called VQSR. So what is small variant discovery? Basically what you're trying to find is in a piece of DNA you're trying to find where you have in some cases single nucleotide polymorphisms. In some cases you have indel. So in this case we have a two base pair A insertion. Now originally when I started with variant discovery, my master's thesis project was to create a SNP caller. And at that time I had some biology friends but I had a whole bunch of computer science friends. I was trying to describe what my project was to them. And so I said I'm trying to look at a bunch of sequences and try and find out what's different about these sequences from a reference genome. And all my computer science friends looked at me and like, God, that's so boring. So you basically wrote diff for sequences. The diff is a UNIX command that basically identifies what the differences are between two files. And I was like, no, no, it's not that. We basically have added complexity. In DNA sequences you're going to have these sporadic mismatches. And those are probably sequencing errors. But then you're also going to have potentially true events. So in this example at this location here it looks like we have an AG heterozygous SNP. So the trick for a SNP caller is to try and weed these out and not be fooled by these sequencing errors and look for these true events. And so the very earliest SNP callers what they did is they looked at the base qualities. And so if the top side basically represented the base quality for the bases in the reference, and the bottom showed the base qualities for the reads, in this case both the reference and the read had a very high base quality. And so with this high base quality discrepancy you're more likely to believe that this might be a SNP if all you had was one read. In this case basically you have a low quality situation where both the read and the genome have a very low base quality. And in that case it's kind of a crapshoot. From one read you wouldn't be able to really make a decision one way or the other if this is a potential SNP site or not. Now, it's been a while now. It's been like 10 years since we've started using Bayesian equations to try and figure out, you know, is this site polymorphic or not. And so there's going to be a test at the end of the lecture on what's going on here. Nah, just kidding. But in this Bayesian scaffold it's taking into account, you know, what are the base qualities involved? You know, what were the allele calls that were present? You have some prior probabilities about the different genotypes here. And the true definition or the most accepted definition of what a SNP is, is that you're trying to find a single nucleotide polymorphism that's at least 1%, has a 1% allele frequency in a population. So that's what this is addressing. You know, what is the number of people that you're looking at when you're doing your SNP discovery? And then this is just normalizing everything that was on the top. So if you want to get it down to right technical, a SNP is if you're doing variant calling and you have a population of individuals, the less aesthetically pleasing sounding acronym SNP is what you would call if you only have one sample, or if you're doing something like tumor normal variant discovery. Oh yes. Oh yes. And some of the things that you have to take into account when you're doing variant calling is you have different ploidies. So you'll have haploid and diploid individuals. In some cases you're going to have pool DNA. So for example, if you're sequencing the human genome and you end up looking at the mitochondrial alignments, that is pooled. That's not going to be haploid or diploid. Whereas the Y chromosome would be a haploidic symptom. This is a slide that was taken from the 1000 Genomes pilot paper. And I won't go into detail about these data sets, but basically what we're showing here on the Y axis is what percentage of the variants were not discovered. So basically what we're showing here is the false negative rate. And on the X axis we're showing how many samples were sequenced. So how many individuals in the population. So the big take home message in this slide is, you know, for the first 100 samples or so that you do variant calling on, basically you can reduce your false negative rate quite a lot. So basically in this case, false negative rate means the chance of finding those variants in a single individual. But after a while that basically evens out, that basically plateaus. So going, you know, if you were to add another 300 individuals to this case, you probably wouldn't be able to discover too many more variants that you discovered when you had 300 individuals. So that's more experimental design than actually what you would be doing with a variant caller. But equally important. So let's talk a little bit about the variant calling pipeline. And what we have here is this is from the Nielsen paper from 2011. And these first three steps are basically what we discussed about an hour ago about alignment. So the first steps are base calling, read mapping. And then here we have duplicate removal and indel realignment, even base quality recalibration. And what we see here is we see two distinct little workflows for if you were trying to do variant calling in a population, or if you were trying to do variant calling with just one individual. So that's what Francis was just talking about, that in some cases you have two different types of variant calling pipelines, depending on what your situation is. But what's common to both of them at the end is you're going to be refining your data set. You're going to be filtering your variant calls. And that's what we have down here. And then they even mention genotype quality score recalibration. This is an interesting slide from Mark DePristos' group that basically coincides with what I was talking about in the computational resources argument. As we go through these pipelines, you sort of lose more and more data. You sort of aggregate or boil it down to the original stuff. So in this example, we're assuming all the alignment has been done. And so for a 30x genome, human genome, you might have BAM files that total about 200 gigabytes. So basically, these are BAM files you might have recalibrated, duplicate, removal. At that point, you're going to use a variant caller. So it could be SAM tools, CATK, the lab that Aaron and I used to be at. They have a variant caller called Freebase right now in England. Cortex is a pretty popular variant caller as well. And so the variant caller is for that kind of data might take up to about 10 hours. The resulting VCF file is much smaller in size. So basically, you've gone from 200 gigabytes to 1 gigabyte. But at this point, you have to either manually curate. And so this is what Mark DePristos' group calls expert user judgment. That could take days. You had full-time people investigating every single variant call, trying to figure out, is it believable? Often in expert user judgment, you're going to have some sort of round of validations. Alternatively, or in conjunction with, you could have variant filtration. And that usually takes about 30 minutes for this kind of this amount of data. And so what you end up with is still yet about a gig of variants, but basically these have been filtered. You might be doing some additional analysis like segregating the haplotypes, et cetera, and trying to remove machine and alignment artifacts. Any questions on this slide? All right, the VCF file format. If you thought SAM file format was hairy, I got on to topic. So the VCF file format is a text format. And if we look at each entry, basically the order looks like this. The very first field is going to be your chromosome. So in this case, it was chromosome 20. The next field is position ID. That's typically like a DP snip reference ID. So in this case, this particular variant had this RS ID. Then you have this G represents what is the base at the reference. So the reference had a G here. And in this case, we found in our sample there was an A. 29 represents the quality. So this is the variant score. Then you have what's called a filter field. So right now it says pass. If it says pass or if it has a dot in it, that means this is a good variant as far as we know. If it says anything else, that means it's filtered. So I'll say low quality or any number of things. Then we have a very long string here. This is called the info field. Now the info field can hold things like in this example, we have something encoding. How many samples were involved? What was the combined depth? So here we had 14X coverage. What was the allele frequency? Some of these fields here are pretty non-standard nowadays. The next slide shows some typical fields that we'll see in GATK. After that we have what's called the genotype format. So basically this is telling all the remaining data that we have. What do the fields actually mean? So the first one is the genotype. GQ is genotype quality. DP is read depth. And HQ is the haplotype quality. So just like what we said, we said we were going to have three samples. We're listing three individual calls here. Now when you look at this genotype, you see a one and a zero. You're like, what the heck is that? I was looking for a G and an A. Well, I can explain that better on the next slide. So this next slide is basically showing the entry, but it's describing it more in a vertical fashion. So again we have the position here. And then we have the reference allele and the variant allele. And the way they do this is they number all the alleles in the order that they were shown. So the reference allele always gets number zero. And then the first variant allele gets one. If you have multiple variant alleles, they'll get two, three, et cetera, et cetera. So later on when you look at the genotypes here, if you see a zero slash one, what does that mean? Well, that means we have an A, G, heterozygone. What I like about this slide is it also explains some of the more typical fields that you'll get when you're running GATK. So in this case, we have like the number of chromosomes carrying the allele, allele frequency. QD is basically the quality score divided by depth. We have all the descriptions here. And so as you can guess, there's the potential to put a whole bunch of data into these BCF files. And it can get pretty complex to try and parse those. I didn't actually write it in this presentation, but there's a tool out there that can make your life really easy. It's called BCF Tools. So if you want to do any sort of analysis or filtering or look at, you know, then intersections between two different data sets, you can download BCF Tools and that will help you that we are not. Yes, yes, for the stuff that we're doing there. Yes, Pasha. As you command. Yeah, so no, that's a very good question. And in fact, for the eagle-eyed people in here, there's actually two variant scores in essence. What you have this variant score here is 29, which is qual, but you also have for each of these genotype fields, you have a genotype quality. So what's different? So the variant score, this first one, basically answers the question, what is the probability that at this site there's a variant? The genotype quality that you see here basically answers what is the probability that the genotype is correct. So basically you were saying it's an AG. So it's a different type thing. But here we get into what my pet peeve with all variant collars are. So we talked about what scale we're doing things on, like a base quality of 20, meant there was a 1% error rate, 30 meant 0.1%. Technically, these variant scores obey the same scale. So when you look at like GATK or any other variant collars, you get these obscenely high numbers. I've seen variant scores like 3,000 or 10,000. When you start thinking about how infinitesimally small that error rate would be, that would be like you could sequence until the heat death of the universe and you could never really know what that kind of precision. So yeah, that's kind of one thing that I'm hoping that I can possibly affect and my stuff that I'm doing out of the window to sort of correct those variant scores in the future. But that's a very good question about what do those variant scores actually mean. So there's a very good document and this one I actually do have listed on the wiki page called the GATK best practices. And it's not actually in the slides it would actually be on the wiki page. But in the best practices document it has all sorts of advice on how would you go about creating a variant collaring experiment. But towards the end of that it starts talking about what kind of filters would get you the best bang for the buck. And so the info fields that it concentrates on are these following ones. So again QD was the variant score divided by depth. You can have a haplotype score. What are the consistency of the cycle two segregating haplotypes. MQ ranks some that looks at the mapping quality. Read position ranks some I think is pretty interesting. That basically downweights. For sequencing technologies like Illumina where you tend to get the sequencing errors towards the ends of the read. This ranks on basically negatively weights if all your evidence for a mutation come from just the ends of some reads. But maybe you don't want to believe it as much as if you had found it towards the beginning or in this case towards the middle of the read. FS that's the Fisher's exact test for strand bias. Again what you'd like to see is if you're looking at a bunch of alignments you'd like to see a fairly equal number of reads in the forward direction as in the reverse direction. If you see any bias that will be reflected here. MQ is another variation of reporting mapping quality. They recommend something called inbreeding coefficient. Now that one is only applicable if you're sequencing 10 or more individuals and I just find that in practice I'm usually never very inclined that many samples at once. So that one sort of becomes uninteresting. And then finally DP is the total unfiltered depth across all samples. So in the GATK filters they can sort of use all these info fields to figure out okay let's weed out the potential false positives. So let's talk about that some more. We already talked about how you can use trios. Trios are pretty powerful. We used that in the 1000 Genomes project when we released the variant calls for the European trio and the Eurubian trio. They screened out every time they detected there was a potential trio conflict they just threw out that variant call from the candidate data set. So that's one way of reducing false positives. Then again there is a certain de novo mutation rate. So do that at your own folly. What's interesting is males have a seven times higher de novo mutation rate. Another thing that you can do is you can use imputation. Again this is taken from 1000 Genomes data set but what's important about this is that 1000 Genomes data set we had a low coverage project and then we had a high coverage exome project. And what was amazing about it is the low coverage guys when we're talking low coverage it was between 4x and 6x coverage across the genome. But because they had many more individuals and they could then use imputation what we're showing here is on the x-axis is the number of denotype calls on the y-axis is the number of incorrect variants. The error rate is actually astonishingly very similar even though we had such little coverage with this low coverage project. So that basically shows the power of imputation even though you had all odds against you you could basically equal the odds by using imputation to figure out what the proper call would be. As an example imputation could potentially give you tell you what the variant is in your data set even if you have zero coverage in that location. And that's a pretty powerful thing to see. Another way of looking at imputation this was a study where they looked at using GATK and so the blue line represents if you just use GATK on one sample the red line is if you used it on multiple samples so you see using multiple samples with GATK you get a somewhat higher genotype accuracy but the top line they basically pair GATK up with Beagle which is an imputation program and you see that the power basically the genotype call accuracy increased quite a lot and they did that. Another thing a lot of people do when they're trying to look at variant calls and this is kind of specific for human at least in the case of half-math DB SNP however will be applicable for all your fish lovers here and you know populist lovers here. So basically if you look at which variants overlap with half-math you can get sort of an idea for what is your false negative rate. So half-math basically they identified variants every 2KB or so but these are very easy to find variants and they're sort of divided according to populations and so if you don't find that half-math SNP for your population that might be an indicator that you're not sensitive enough to find those SNPs. If you take it on the flip side DB SNP seems to have everything and the kitchen sink in it and so if you have a large number of variants that are new that are not in DB SNP that might be an indicator that you have a potential large fraction of your variants are false positives. Now for DB SNP you have SNPs for all sorts of different organisms so that one's good to use. We mentioned before some people can look at coverage so in this case what they've done is they've color coded the coverage across a chromosome for red represents the coverage for all the SNPs that overlapped with half-math blue for all the other SNPs and so basically you see these well characterized SNPs seem to have a basic coverage of course here you have the centromeres and so basically if you were to look at this you might say okay well I'll place a filter here anything that has less than this X coverage I might remove and anything that has more than this much coverage I might remove. Same thing remember that very first slide we showed in the alignment indel cleaning slide basically it showed that we had potentially a lot of SNPs in one location but then after we did the indel realignment it all cleaned up and it became just one deletion. Well so one of the QC metrics that you can look at is what is the distance between your SNPs so in this case now what is the fraction of SNPs that are really close to each other and what you can do with that is if you find clusters of SNPs that are really close to each other maybe you want to ignore those so that becomes a good filter. A number of people here that had a nice background in population genetics you can basically look for Hardy Weinberg violations and basically use that to create a filter other people they just place thresholds at the end of the day when you use a variant color you get a variant score and the variant score is basically what is the probability that a site is polymorphic so you could place a threshold I want a variant score of at least this much before I believe it. A lot of people they use the transition-transversion ratios to optimize or to figure out how well they're doing. This is controversial in my opinion because what a lot of people will do is they'll start tweaking their filters until they start getting a transition-transversion ratio in humans of about 2 or 2.1. Well 2 or 2.1 you know where does that number come from and it doesn't actually apply equally well you know intergenic regions have a different ratio than some coding regions and so this is this is sort of more about a rule of thumb than something that you should strive for but a lot of people do use that ratio. Yeah? You know in all of this I mean I know you try I mean you've got so much data you've got to figure out a way to it's not to be perfect when you're trying to clean up as much as possible but you know like this you're essentially right up to the fact that you're eliminating some very cool sites. Exactly exactly so what I'm presenting here is basically a vast array of what people have done in the past. This is not an implicit recommendation that this is how you should go about doing it. The example here that we showed these info fields in today's lab we're actually only going to use three of these info fields to filter our snips. So we're going to use the ones that I have put in bold. So QD, FS, and MQ. So QD is the variant score divided by depth. FS was the Fisher's exact test for strand bias and then MQ was the root mean square of the mapping quality. So those are the three things that we're going to look at today to filter our snips. That was a catch of jingle. Alright, so we've talked about a whole bunch of potential filters but the question remains how do you combine these different filters in a meaningful way. So this slide was also taken from Mark De Pisto's group. Here they're showing that they're actually tracking three different things at one time. So as they're tweaking these filters so basically they're looking at the number of snips in a data set. So as they look at these different snips they're tracking the dB-snip rate, the translation-transversion ratio and the number of snips that overlap with an omni-chip. Basically a snip-chip. And by using this you can basically decide where is the appropriate cutoff rate for whatever filters you are using. But the next slide pretty much shows what I'm getting at. And that is if you're putting together a pipeline you might have a couple of liners that you're interested in. So in this case, let's say you had BWA and you wanted to use GATK with it. Let's say you also had BOTI and SAM tools. And both of these cases, because the aligner is different you might need different variant calling filters for each of these cases. So in this case it's two different variants of filters that you would need. But it gets more complex. Let's say you wanted to use BOTI with GATK or BWA with SAM tools. Basically you have some more stuff. Let's say you had different coverage of experiments. Somebody's doing exome sequencing and somebody's doing low coverage and somebody's just doing normal to no make resequencing at 30x. Each of these will have a different set of filters that might be required. Even then, the actual run qualities might be different. You might have like a new tech that comes in that day to do your run for you. You get a very low 230% basically a low quality data set. Whereas your experienced lab tech member might produce these pristine runs for you. In both of these cases you might have to trim those filters a little bit. So you can see there's a large number of filters that you might have to consider. And then if you start adding more liners and more varying colors, you just get frustration and rage. And so that's where I want to introduce there's this GATK program called the Variant Quality Recalibrator or Variant Quality Score Recalibrator, VQSR. And what that one does is it sort of dynamically looks at different filters. It tries to look at the actual data set you have and then you basically say, hey, I want you to look at these three info fields. Maybe you can find the appropriate thresholds for that. And so what happens here on the X axis we're showing one of those parameters. We mentioned QD before, which was Variant Quality divided by depth. And on this Y axis we're basically showing evidence for strand bias. So that would be that FS. And what they're noticing is everything that overlaps with half map is blue. So these are potentially good variance here when you're looking at this 2D chart. And everything in red is everything that doesn't overlap with half map. So these are potentially bad. And so VQSR what it does is it tries to learn where are all the good variance. So it starts using the expectation maximization algorithm to try and figure that out. It does some clustering. And then what it does is for each variant it assigns a probability for how well it clustered. And then what it does is it takes the worst 3% of those variants to figure out okay where are all the worst results. And it clusters all those together. What you end up with is something that looks like this. So everything now in green is accepted as oh these are potentially good variance. And everything here in purple in this case are the variance that look the most anomalous. These are the ones that you wanted to filter out. But what's cool about this is because you've done all the clustering and everything with this particular data set it doesn't matter if you had a low quality or high quality, a low coverage or a high coverage data set it's learning what the appropriate thresholds are for each of those clusters. So another example if you look at this in a classical histogram fashion here we're showing this QD parameter and so here is varying amounts of varying quality over depth and here is just a histogram, the number of SNPs. What they notice is these SNPs overlap the ones in the blue overlap with the SNP the ones in red are novel and basically you see well you know maybe if you place a threshold here that's probably a good filter. So it's dynamic filtering and if you were to look at this across an entire chromosome you see that all the blue dots are retained variance and all the red ones are screened out and again you see because you might have higher coverage in the centromeric regions those would be the ones that you filter the heaviest on so you see there's a heavy red band over here and some over here as well. So to summarize this preprocessing is key basically the steps we talked about in the previous discussion duplicate marking indel realignment and base quality recalibration will all contribute to make your varying calls that much more robust. We have a standardized file format it can be pretty hairy it's very intimidating the first time you look at it it's actually very intimidating the first ten times you look at it but eventually it all sinks in and then to improve your results there's a number of fields that you can filter and you can optionally look at dynamic filtering. Now for today's lab we're not actually doing the dynamic filtering because what I've done is for the sake of time I've picked a very small region of the human genome just 300 KB so that we can actually align in time that way nobody misses lunch basically is what I'm saying but because you have such few data points the one drawback with this dynamic filtering approach with VQSR is you need lots and lots of data points so VQSR won't actually work well with the data set we're looking at today so we'll be applying manual filtering questions on the talk so since I'm going to subject you guys to one whole block of a lab how do you so now I'll be like Michelle how do you guys feel about like a 10 minute break before we actually start the lab I like that yeah so she is quite correct feel free to log in when you feel pretty safe there we'll grab some refreshments