 Okay, okay. We are going to get started now. Before lunch, we learned about alignment algorithms. The second part of that is going to be really useful as you want to learn how do we call variants? So the way I've broken down this talk is basically into five different sections. Just what are we trying to accomplish while we're doing variant calling? The second process, we'll look at what goes into variant calling pipeline. What kind of components would you usually see when you're trying to do variant calling? Just like we had the BAM format before, there's a standardized file format, the theory is called the VCF file format. And then to get the most power out of your variant calling, a lot of people have individual variant calling filters to filter the variant call that haven't produced. This helps reduce a lot of false positives that you might see in the data. And then I'll talk briefly about dynamic coding. So variant calling is really all about trying to find in the sample genome, you're trying to find those locations that are different from your reference genome. So you might be looking for a heterozygous snitch on the left, or on the right, you might be looking for homozygous deletions. You might be looking for more advanced things like structural variants. That's something that Erin's going to discuss later on today. What complicates this process is, ideally, if this talk represented the reference on all these reads down here, where individual sequencing reads, what you're trying to look for is some sort of a pattern. So here you see suddenly a location that might be a heterozygous snitch with AIDS instead of a G up here. But you also see a lot of other places where you might have various differences from the genome. What you're trying to do, the complexity of this process is actually to ignore those sequencing errors and concentrate on what might be a true biological variation. One of the ways we can look at it, and one of the earliest ways of trying to do some detection, was one of the earliest ways that you could tackle variant calling was simply by looking at the base qualities. So again, if the toppers presents a reference sequence, you might have a situation where the read has a T with a very high quality here, and this reference location also has a very high quality. But because it's a very high quality discrepancy, you might be more likely to believe that this is potentially a snitch or more accurately as a snitch. In this version, however, you have very low base qualities. And so because of that, you're not either sure of what the actual reference basis should be in this case, or what the actual read base should be in this case. So you're less likely to call this position a snitch, simply because you're not as certain about what the actual calling is. And so the way you can increase this sensitivity is you start looking at not just one read, but you look at entire stats of readings. And so that's why traditionally with Illumina technology, people will say you need probably 30x of coverage to do a whole genome sequencing, that's a very high quality. Some other technologies use to acquire less coverage. But that's pretty much the way it is. A lot of the variant callers use a ACN scaffold for doing the variant calling. It's just that basic statistics basically rely more on higher probabilities and observations rather than frequentness. That's our statistics. You don't have to memorize this when I promise you, but it's interesting to note that basically what goes into these calculations is, you know, what is the number of individuals in your population. You also include information about what were the actual base qualities of those needles that you observed in that call. As well as the actual needle call on that read. And then this is all normal by this time here. So who knows what the the actual classical definition of a SNP. The classical definition is that you're observing some sort of a new allele in at least one percent of the population. They always leave population as something very negative and undefined. And a lot of the times when I'm talking about a SNP versus a SNP, a lot of the times even when we're doing looking for variants in a single individual where people are erroneously calling it a SNP. But it's only truly a SNP if you're looking at a population of samples and you find a certain allele that occurs there. So what you might see in literature that talks about tumor normal assays, you'll see them talk about SNVs because you're really only looking at that one sample. When you're doing variant calling, besides having lots of reads to give you additional power, also having additional sequence samples that gives you a lot of extra power. So this is from the 1000 Geomes Consortium pilot paper. And what they were showing here is here we have some different data sets. And on the Y axis basically showing how many variants in a single individual were not discovered. And so as you can see, when you only have small numbers of samples, that can be quite high as high as 20 percent. But as you add more and more samples to your population, the chances of you not discovering one of those variants in a single individual decreases quite substantially. And it levels off pretty quickly. Questions so far? All right. So what goes into a variant calling pipeline? So this is taken from the Nielsen paper in 2011. So the processes that we've already talked about, the top is image analysis and base calling. First, we need the reads to do something. And then we need the methods we used to do Nielsen. That's pretty much what we did in the previous module. And then here's that extra step that we've also done. We've realigned. We've removed duplicates. And one of the steps that we haven't done, but a lot of people do, is recounting those base quality scores. But then it basically involves into two different types of pipelines. If you have a single sample, or if you have a population of samples. And so this becomes sort of an iterative type approach into the sample calling. You'll identify a series of candidate snips. And then you might do some additional filtering here at the end. Or you might tweak those filters and then go back. Now that seems the next slide. We have a question. Yeah. Base quality of calibration. So it sounds like somebody did a study and found that the bases are made. The quality of these bases is not actually overestimated. It seems like an opportunity to integrate that back. You're exactly right. And actually it's of our opinion that some of those studies were originally done by the throat. They were looking at a lot of new ones, for example. They did find a systematic bias there where we were overconfident on the higher end. But nowadays we actually, when we do our internal tests, we find that base quality calibration doesn't help at all. So at least at RUMA we skip that step altogether. We still do duplicate removal. We still do alignment. But the base quality recalibration, we see that as an option as well. However, if you check the GATK best practices webpage, you'll still see that listed there. So it's really up to your discretion. Other technologies, it really depends on what is the base color. So if you're using a 4x4 base color, they might have other biases. So I feel like probably the best route forward is if you're ever in doubt, it couldn't hurt just to check and see what those graphs actually look like on a dataset that you care about. Let's see if there's a systematic bias. If you see those biases, try to go right ahead and use that calculator. But if not, maybe that's some stuff that you can read out to speed up your analysis. But that's exactly what happened. We basically saw that bias. We went back to the base color and said, oh, how can we improve this? So this last step looks a little bit strange. And this is one of those topics that I have the most problem with when I started helping into the world of variant calling. That is, well, I don't have to spend so much time filtering my steps. Why can't you guys just get it right the first time? But it's just one of those sad truths. It's all the variant colors have their own issues. And when you look at the data, you have to sort of weed out the likely false positives from the true positives. So if we were to look at the pipeline from a file-centric view, if you have a full human genome, you might have about 200 gigs worth of BAM files that basically have recalibrated base qualities and the duplicates removed. And then from there, you want to produce sort of a raw VCF file. VCF file is the one that contains all the SNPs. We'll talk about that one in a moment. So what can I actually use to do my variant calling? Well, central is how to build a variant calling. G to A has one. That's the one we'll be using today. Freebase was created in both Aaron and my own lab at Gabbermars lab. And then Cortex, a bar, is another variant called thinking with attraction. And as you can see here, that process could take anywhere up to 10 hours, depending on how you process that. But then here comes that next critical step. Those raw variants aren't good enough. And so at the road, they said I might take days for an expert to go over the data set and figure out, okay, which ones are good. You could use things like the GATK variant filters. And those only take about 30 minutes to process. And that's what we'll be doing today. I'm sure you guys are happy that you won't have to spend days looking at all the variants. And at the end of the day, you end up with about one gigabyte of filtered variants if you are doing a whole genome sequencing of fellowship. You expect between three or four million SNPs in a single field. And they don't have to do that all the time. Yeah? If you don't want to slide back, when you're going to move a sample, call a single sample, what do you mean, what are the results? And you do that across all the individuals? Yeah, so some variant callers produce can actually call variants at the same time. So for example, GATK, when you run the command line, we can actually specify multiple downflats all at the same time. And I know few of those variant calls, and it actually gets additional statistical power to call the variant. So you often see better results if you're trying to reduce an Italian population to specify all those downflats. The alternative is almost what you mentioned in your question. That is basically doing the single sample calling approach and then basically figuring out the appropriate filters for the future of the sampling. So does it make any statistical assumptions about the information? No, I think it made some statistical assumptions. And that's probably why you get additional power. So for example, if you were sequencing a population of 100 individuals, 75% of those individuals, you saw that there was just one deletion there. The variant caller might look extra hard on those remaining 25 to see if there was any evidence of that occurring there. Whereas if you only did the variant calling on those individual ones, maybe there wouldn't be enough evidence for it to call that. So that's, if you prefer you can have enough data, it's prefer to do it this way. Right. And if you can, if you can basically, if you can rationalize that it makes sense to call it together, like certainly if you have samples from perhaps different populations, it might be more questionable if you should do the variant calling together. And so you make the judgment on how many additional multiple samples you're going to run. Let's say you just want to analyze one single sample. Right. But you have access to thousands of others. You would just allow randomly X number to call where you want sample. No, it would be more like so. For example, for 1000 genomes, they have three main populations in the pilot project. You have the Chinese slash Japanese population, you have an African population, and then you have a European population. So it would make sense to take like the European population and call those together in the African ones. Sure, that's 300 bound files. So surely he don't want to, that's a lot later. I mean, if you're doing this practically downloading 300 bound files to call one bound file, it's going to be very challenging. It could take you weeks. But it's often worth it. The results do get better. And so at least, at least one the Broad Institute did their analysis, 4,000 genomes, they would literally use all the samples all at once. So yeah, it is a lot of data. And the big data problem is a huge concern with next gen sequencing and particularly the data analysis. That's why it's still cumbersome while you have the BEM files, but it gets a little bit nicer once you have the VCF files. So it's almost like a sieve at each step here if you're getting rid of some data and you're keeping the stuff that you really want, the derivative data. So we talked about the BEM file format. I know there's the VCF file format and that was also created around the time of 1000 genomes just because there was no single file format that took care of this. And now let's see if I can use a laser pointer like that. So basically, you also have a number of tap limited columns. The first column is basically the reference. And then you have the physician. Third column is the ID. So that's usually an RS ID. You see those typically on the Ruby snack website. The next one is a reference. So G is, if you look at this position on the phone, so 20, you'll see a G. And actually what the alternative of the yoke is A in this case. Now that's not enough to tell you exactly what's happening for this particular sample. So if we stick ahead, if we stick ahead past this, you see these weird fields here. One starts with zero and bar zero, one bar zero, one slash one. What that actually means is for this sample, whatever the first sample was, zero means reference. One means the first alternative yield. So for this first person, he's homozygous for the reference yield. The second person is heterozygous for the alternative yield. And the third person is homozygous for the alternative yield. Does that make sense? Questions? Is there a type 4 or a bit different? Ah, good observation. It's not a typo. So usually what you'll see in these ECM files is this slash. The vertical bar, means that that variant has been phased. So you actually know on which half the type, you know, the respective alleles occur along. So that would actually make a difference if you have a pipe. It would actually make a difference if you have one bar zero or if you have zero bar one because you've phased it on the appropriate height. After the ultimate allele, we have a number here. That's the quality. So just like we've sort of given scores for bases and for alignments, we also have a quality score associated with the variant. This is where it gets a little bit dicey. We've been talking about like 10 means 10 percent error, 20 means 1 percent, or 30 means 0.1 percent. It's logarithm. The same is supposed to be true here, but we get a wildly high quality source on variant color. This is what I consider a problem in the field. I think people are working on it. But what you'll see is in this case, you actually see a reasonable 29, but it's not unusual to see when we look at our variant scores later on, you'll see a variant score that's like 1000 or 3000. If you actually work out the arithmetic there, that's an extremely high confidence that you have a sniff there, and almost never are you that confident that you have a variant in every position. So that's something that they would need to be calibrated for. The next series of fields are part of the input field. Actually pass basically, if you write pass, it means that variant pass all the filters. If you write anything else besides pass, it basically defines why it failed the filter. After that, you have an input field. And so here basically I'm describing what these means. So n equals 3 means we have 3 samples, dp is combined down across all 3 samples, so 14 x in this case. A little bit on c was 45. These were some older info fields, so some vcf files, you'll see they have dv listed if it's listed on vd sniff, h3 if it's a half max. And now the final part is this actually gives you the key to understanding what's going on here. So gt means the genotype, so that's the 0, 0, 1, 0 and 1, 1. The next one is genotype quality gf, but here we see 48, 48, 43. dp is the depth, so here we see 1, 8, 5. And then the last one is hq, which is the half of type quality. And it was a list of the last one. So is your genotype quality 0, 5, 4, 1 depth? Cool. So individually, yes. If you have 1x coverage, you're suddenly wondering, oh, and I don't call this as a sniff at all. But if you have done a combined variant quality, you might see evidence in the other samples that might make you to believe that this might be a reasonable genotype. Same thing with, if you take it one step further, there is something called half of type aware variant quality, or like the GATK has something called the half of type color. And when you basically use that, you actually have the power to call variants in places where you have 0x coverage. Just because you've seen that this one sample has variants that all the other ones did, but even in this one place that we had no coverage, we saw that all the other 50 samples or so also had this other sniff. So because they're in linkage disequilibrium, you can assume that it also had a variant of that location. Here's another representation of the VCF record. It basically just sort of distills what can use infobields much. So here are some more. The ones that we use a lot. GUD is a very interesting one. It's basically the quality score divided by the depth. So it's basically a normalized quality score. And LQ is basically the looping square of the mapping qualities. That's a useful filter. Also, the number of map quality zero leads out of the location, MQ0. So I mentioned there was a best practices document for GATK. And one of the things that they talk about is which of these infobields would you actually use if you're trying to filter your variants? And so what I've shown here are all the different infobields that are used for the primary calibration or filtering. So QD is the one I just talked about. Grade scored by the only filter depth. There's also a haplotype score. So if you have two segregating haplotypes, it has some consistency. MQRag, so we just talked about read position ranks, that was interesting. Let's say you have a bunch of reads that only have a variant seemingly at the end of the read. We're not going to believe that, you know, if you assume like a little bit of technology, we're less likely to believe that's a true variant than if we see that basically there are reads at all positions that contribute to that being known as a variant. So this read position ranks on basically gives us a score that leads us to believe whether or not this can be a true variant, depending on where in the read position. We also have a Fisher's exact test of the text strand bias. MQ we talked about in reading coefficient is used, but only if you have 10 or more samples, so it's almost never used. And then DP is the total unfiltered depth over all samples. Yes, unfiltered. So all the variant collars, they have almost like pre-filters that run. So some, for example, Luna has a variant collar called Starling. And Starling will pretty much ignore all reads that have a mapping quality lower than something. So any mapping quality lower than like 20, might get ignored by the Starling. So that's something that's already been filtered for a variant collar. So in this case, when we're listing DP as total unfiltered depth, that's the depth before the variant collar has to do any pre-filtering. And it's different for each of the variant collars. GATK has different metrics. So a little bit related to the variant collars. Is there a way to tell the variant collars software that the sample is related? I don't think I've seen anything like that. There are pedigree aware or trio aware variant collars where you, in a separate step, define the relationship between the individuals. The logical place would be, remember how we defined these read groups and we said all the sample name is something. It would be wonderfully, if you could, in that read group define, okay, this sample has a father who sample ID is that, this sample has a mother that has this following sample, or if you're doing any sort of other type of genetics, you could define other sorts of relationships. But I haven't seen anything like that other than these very specific variant collars. Most of the time that actually happens as sort of a post-processing step when people evaluate the variants, they'll enforce that criteria. Other questions? So I've shoved a bunch of info fields down your throat, but I was going to show the relevance of knowing about these now. So the first thing we talked about is you can use, it's almost like you look ahead, you can use familial relationships to verify how well your variants are doing. So basically, if you did sniff calling individually, so you did sniff calling on the mother, father, and child separately, if you found that a site was in conflict with each other, you might be less likely to believe that that locates. Now, in the 1000 Genomes project, we did measure what the apparent nominal mutation rate is in humans. And what's interesting about that is in males, it's about seven times higher than in females. Just an observation. But so you can use that kind of constraints in your bank calling? Here's another project where we were talking about using haphalotypes or imputation to strengthen everything. So in the 1000 Genomes project, they had three main sub-projects. But a lot of them had lots and lots of samples that were a whole genome sequence, but only had between four to six acts of coverage. And then they had this excellent project where they had a lot higher coverage, probably in the range of 30 to 40 acts. But because they're exome sequenced, you couldn't necessarily use imputation on them. And what they found is, is the performance of this low coverage project. So here we have basically on the x-axis, the number of genotype calls and the y-axis is basically the number of ink breadth genotype calls. And what they saw is, because we used imputation on the low coverage data set, we were able to get results that were very close to those on the exome project, even though they had vastly higher coverage. So basically going from about five acts to 30 acts. So six times more coverage, you were able to overcome that difference in coverage simply by observing your patterns of ninjutsu as a glaring. So that's another method that people take. They can use imputation. This slide just simply shows what is the power of using a single sample approach versus a multi-sample approach, as far as the genotype accuracy goes. So you see you get a slight improvement here, just looking at the proportion of low calls. But the best overall was if you used G to K with a program that's really good at doing imputation with people and then the genotype calling accuracy shot way up. A lot of people basically use different data sets as properties. So the Hapna project, basically they genotyped a bunch of locations about one sniffer for 2KB. But it has lots of individuals and a lot of different populations. And that's really useful if you have a human data set you're trying to figure out if you have false negatives. You can see, look at your overlap of the Hapna variance and see, are you detecting the same variance in the same populations? The converse of that is basically looking at DBSNP. So DBSNP has tons of variants. Basically it's a database just packed with variants. Some of them are not even good variant calls. But it's a good estimate. If you suddenly do your variant calling and you see that 20% of your data set does not exist in DBSNP, you might want to question, do I have a lot of false positives? Because the belief is most of the variants that have already been seen in these different populations for different organisms are already catalyzed in DBSNP. So it's a good proxy for false positives. Another thing people do to screen these variants is by looking at coverage. So in this case, basically, and often, and Hernandez at Cornell, they looked at what was the coverage along this chromosome and basically everything in red were Hapna sites. So basically they saw that the mean coverage should be around here. So suddenly when they saw variants that had coverages way up here or way down here, those were most likely to be false positives. So basically they devised a filter that basically tried to define an upper and a lower bound for what the coverage would be when accepting a variant call. One of the examples we saw this morning was, remember we were using the indeliga liner? We saw one of the signals for trying to evaluate if you wanted to read a minor region or if you saw a bunch of SNPs in one region? Well, that's also a really good filter for your variant calls. So if suddenly you see a lot of variant calls that are within like 15 bases of each other, that might be an indication that those are false positives. So this was just a study where we tried to look at which of our SNPs were in DBSNP, not in DBSNP, in a Sanger data set and not in the Sanger data set. And basically you saw that basically anything closer to 15 base pair in that study was most likely a false positive. There are other metrics. People might just find out a good cutoff for the variant quality score. Some people pay close attention to the transition transversion ratio. So a lot of projects we saw basically trying to monitor that the transition transversion ratio in humans, for example, is really close to 2. If you have something divergent from that, it might be an indication that you have columns in your variant calling. But, so I've given you some insight now. There's a lot of different filters you can use. And it's all daunting. I see these puzzle notes on your faces, like, oh my gosh, this is so much. And it is that daunting in real life too. So one of the answers of trying to cope with this is dynamic filtering. So, for example, the road used to like looking at three things simultaneously. So as they were to adjust their variant calling parameters, they would look at their DBSNP rate. So what percentage of the variant is going to be DBSNP? They look at that transition transversion ratio and then they would also look at what is the overlap with a SNP chip. So in this case a thousand two-year-olds chip that they used. And from that they were basically able to figure out, okay, at what point can we define our cutouts. But to illustrate the problem, let's assume you have several pipelines. Today we use BWA and we're going to use GSP for our variant calling and we'll apply some filters and that will annotate those variants. But you might also assume that somebody else might have a different pipeline that uses bow tie for the alignment and the SAM tools for variant calling. But those will need different filters because the variant callers behave differently, the liners behave differently. But what if you wanted to use BWA with SAM tools or bow tie with DHK or let's say you have different types of data sets, you can have low nominal or high coverage data sets. But even amongst those you might have good runs, you might have some bad runs. And then it gets really confusing when you start adding even more liners and variant callers to the mix. And so you get this rate, you know, people start adding their laptops and start yelling at their laptops. And so the solution to that is to instead of using hard filters, what are the programs that's out there and it's also by DHK called the variant quality score with calibrated. It's a bit of a misnomer. Basically what it tries to do is it tries to evaluate your data set from here to other data sets. So it compares it to have the or DB set and tries to adjust the filters in just the right way. So that you get the results that you're looking for. No, well, assuming that you had truth data from elsewhere. So one, so for example, in BQSR, right now it uses half math data and this aluminum omni chip data as the truth data. If you were working on a different organism or even strain, you know, any sort of strain genetics, as long as you had some sort of truth data that you could use as a reference point, you know, no invariance, et cetera, you can use that with BQSR. So the idea is you would partition. This is how BQSR does it. Luckily, you don't have to do this on your own. You would partition the data into sites that overlap half map 3.3 and omni chip. And so basically here, it's basically on the x-axis, it's this QD score, varying quality divided by depth. And then on the y-axis, you have evidence of strain bias. So everything in blue here is basically overlapping with truth data. Everything in red outside of that is basically something that might be suspicious. It might be a false positive. And so they use an expectation maximization algorithm here to basically learn what the most probable variants are for the truth data. And then they assign a probability to each variant based on how well it contrasts to that training set. And then what they do is they throw out the 3% worst variants from that data and they train it some more. And in the end, you basically end up with a set of variants that definitely lie outside of here. So instead of filtering everything that you saw here that was red, basically what it does about teaching itself is that only these pain-cruelish variants up here are the ones that are the most likely to be erroneous. And so it does this automatically for you to several iterations of this expectation maximization algorithm and it comes up with those filters that you can then apply to your VCF files. So how long does it take and when it's looking at that map data, is it looking at it online? No, no, you actually have a flat file on your hard drive that has all that. The only caveat that I would say about deep USR is that you need enough overlapping points. So the sad thing about today's exercise because we have time constraints about how long we could spend aligning all our reads, etc. Because of that, we simply the little interval of reads that we've aligned to chromosome 1 isn't enough to actually utilize BQSR. So in today's exercise, we're going to use the manual filters. But moving forward, if you have a whole genome sequencing data set or even it's kind of a border case if you have Excel data, but definitely with whole genome sequencing, you can use BQSR and it doesn't mean a good job. So another description of how BQSR might do things, if you were just to look at, you know, this 2D filter, what it's doing internally is basically looking at, okay, what percentage of the variant's overlap, for example, and which ones are basically novel you've never seen before. So then you could basically place a threshold at some point. If you also look at it this way, basically look at it across positionally. So you can see, okay, across the chromosome, you know, which variants are novel, which ones overlap. So in this case, what we see here is we see a bunch of red close to the center here. That's why you might be more likely to filter the variants that occur close to the center there because those are the novel ones. It's a bit of a two-edged sword because if you're really trying to find new variants, those are also the ones that are less likely to be finding these differences. Yes, a lot of these filtering variants are to rely on the fact that we probably already looked at them. So you have to kind of be careful of that depending on what your project is. But for most projects, BQSR and this kind of filtering works really well. So to summarize, pre-processing is key. Basically everything that we did in module 2, so we did the duplicate marking, the indel re-alignment piece, and the base quality and recalibration. All three of those actually contribute to reducing your false positives. Just like before, we have standardized file format called BCF. It's a bit complicated, but after a while after you've seen it a few times, it starts to make sense. And for improving results, we've learned a little bit about using manual filtering techniques and dynamic filtering. Any questions before we begin? We're using two different cores in the same sample set. What have we expected? Oh, we often saw huge differences in quality. Oh, sorry. So the question was, if you have one series of experiments, like a series of experiments where you've used different brain colors, so let's say two different brain colors, what kind of overlap could you expect between the brains that are called? That can often be a problem. I got instinct would be, I'd say, 70 to 80 percent overlap, but I've seen less than that. In fact, if you read the thousands of notes out of paper really carefully to address that issue, they end up with a really weird voting mechanism. So it ended up being like, if two sequencing centers out of three sequencing centers actually call that variant, then it would actually make it into the official thousand genomes list of variants, just because that overlap was such a huge issue. So 70 to 80 percent of this. Yeah, I think that might be good. I mean, you would want it to be much higher. Certainly, after filtering, that should be a lot higher. But before filtering, if you're just looking at the raw VCF files, I think it might only be 70 to 80 percent. Variant colors for filtering, they both agree on the variants. Yeah, no, that certainly works as soothing that the variant colors perhaps use very different algorithms. We also saw that issue in the thousands genomes context, where we were looking at how samples was calling variants versus, you know, there's another variant called GeoMap multiples from the University of Michigan. And basically, they both use almost the same exact equations for varying calling. And so they said, oh, look, we have, we have almost the same results. They must be good. As soon as you compare it to another variant of calling, that didn't use the same exact equations, then you did see the differences. So you have to be careful about that. But it is a good approach. I'm a big fan about using orthogonal approaches. And that's for everything. So you might, you might even say, you know, right now you might have an assay that you used all living technology for, it would be useful to like combine that with Pac-Bio, they over 454, because they have different error modes. And so sometimes you can see, oh, these are most likely errors because I didn't see this in the other part. Other questions? All right. So let's go.