 Good morning. My two sessions are really about detecting genomic variation, right? So how do you actually go from the data that you've mapped onto the genome to detect either the single nucleotide variants like we see here? So we were sort of looking at the data quite a bit by GV yesterday, but now we're actually going to use algorithms to call these variants. So the objectives of the module is really sort of go over the steps to actually call variants, understand the principles of variant calling, know what are the different steps that can affect and improve variant calling step, know how to filter, get rid of bad variants, annotate variants. I had already some questions yesterday about how do you annotate variants, so call variants. So far the file formats and the files that we've looked at a little bit were FASQ files and also the BAM files after you've mapped them. We're going to look a little bit at the VCF format that was mentioned yesterday, which is the variant calling format. So this is the files that you get that are much, much smaller actually, that actually include the description of all the variants in your data. And then we'll go back to using some of the tools you've learned yesterday and to visualize these SNPs and play in an IGV event. So before I start, so I wanted to do a parentheses and these are a few extra slides that I've added yesterday because we mentioned yesterday that so even though for this workshop we're using Amazon Cloud, there's other resources, especially if you're Canadian, you actually have access to what I think is a great resource, which is these large clusters that are part of Compute Canada. So at McGill we have a cluster with 25,000 cores. Sherbrooke is another one, there's HPC4Health, we have something like this in Toronto. But no matter where you are in Canada, you actually can request access to these clusters. You just sign up or your PI needs to sign up and then you can get an account on these resources. And so the types of commands and the types of tools that you use, it's really the same process but instead of using the Amazon Cloud, you can use these resources. So again, if you're Canadian, that's great. If you're American, well, if Trump gets elected, that's another reason to come to Canada. We've installed many of the BindFarmatics tool on these resources. So a lot of the tools that we're using now, you don't even need to install them a bit like on your Amazon instance, the GATK that we're going to be using, BWA for mapping, all of these tools are already installed. So I've added some links with information on where you can find more information if you're interested. And then the last bit, and this is related to the Galaxy that you're going to be using this afternoon, but we also have set up Galaxy on Compute Canada as part of this project of GenApp. So you won't be using this particular Galaxy instance, but again, if you enjoy that section this afternoon where you're using Galaxy and the public Galaxy is too busy, you could potentially use Galaxy from this site. But again, I mean, this sort of, it was just an aside. I think David will talk more about, when he talks about Galaxy, we'll talk more about the Galaxy instances that we've set up on Compute Canada. Okay, but back to the real meat of the module and what we're going to be talking about. Just a little bit of context. Why is it interesting to call variants? And why do we re-sequence? We've sequenced the human genome, why do we re-sequence? So we re-sequence because we're interested in detecting both single nucleotide variants and larger structural variation across individual in the context of disease. So we want to know if there's specific variants that are more common in people affected by a particular disease. In cancer genome sequencing, I had some questions yesterday about the difference between somatic and germline mutations. So you acquire new mutations, so you want every tumor has its own set of variants. So you actually, that's one of the reasons why you would want to be re-sequencing. And these are two examples that are a bit more human-centric. And I guess most of this module, we've been using the human genome as a reference. But of course, there's parallel application in many of the other areas in which you work in. And the principles are the same. So even though we're using here the human reference genome as an example, obviously the pipelines with some small differences, and Jared yesterday talked about some of the differences and some of the challenge with some of the other genomes, but still hopefully many of the principles that we're talking about here also apply to your model organism. So this is similar, but looks different than the slide that Matthew showed, but this is a similar workflow for a variant calling analysis. So you start with some re-trimming and removing adapter, trimming for quality and so on. So that's what you see here on top. Then there's the alignment steps. And then we move on to variant calling. And in parallel to that, it's useful to look at some various statistics. So in terms of the workflow, again, we start with quality control, which is very important, especially important for variant calling, because if your reads are very noisy and have lots of mistakes, that's obviously going to affect your variant calling. So it's good to take a look, because especially if you have a project with many, many datasets, it's good to make sure that the profile of quality of your reads is similar across different individuals, because otherwise you might end up with some outlier that have lots and lots of variants. And you wonder why? Well, it's likely because it was a problem with the sequencing and the quality of the sequencing is not as good in some of the samples. So quality control, very important, preprocessing. So this, again, is what you were doing with Matthew yesterday. So trimming the reads, removing adapters, and so on. The mapping step, which is quite important. And again, that's what you did quite a bit yesterday. And now we move on to the two other sections. So calling small variants, this is what we're going to cover. And then after the coffee break, calling larger variants. Okay, so again, one more slide on this quality control, but it's really, really key that you look at your data before you start and at every step of the way. That's why IGV is such a great tool, for instance, after you've mapped. And as you'll see, once you call variants, it's good to look at some of them to see if there's any patterns or if the data looks reasonable. So that's really what we're going to be doing. And then a couple of comments I guess that I've made already, which is, we're all the sample sequence at the same time. Are there some samples that maybe are a little bit weird? So it's good to look at. And that's why many of the pipelines sort of generate statistics throughout. And it's good to look at these statistics to make sure that today is quite uniform. And of course, a lot of times I get, I mean, we get questions on what's good or what's bad. I mean, in some ways it's hard. It's more like you want to make sure that your data sets is uniform. And then at some level, you get more, with experience, you get more familiar with what values to expect at some level. Okay, but this particular module, so what do we want to do? So we want to call single-nucleotide polymorphism and indels. So the goal is really a little bit like we were doing yesterday, but yesterday we were just scanning it visually. The question is, so this is an example where imagine that to the top, it's probably is a tumor. And then the bottom is a same sample from that individual, but from the blood. So what you see is that you have perfect matching to the reference, except for maybe a few errors that are sparse. So these are probably just sequencing errors. But once you sequence the tumor from that same individual, there's clearly a difference at that position. So I mean, you see that visually. How does the computer know that this is an actual position? So this is now a cartoonish version of the same thing. So the sequencing, the sequencers are great. By now, like Illumina sequencing, 99.9% accurate for the most part. But it still makes errors, I guess. I mean, maybe one, I'm not sure I'm ready to do mental math at this stage, but imagine if you have 100 bases, right? So say it's 1% errors, let's make it easy. 1% errors, you have 100 bases read. So you have one error per read on average. But then sequencing run, you have 100 million reads. So you do have 100 million errors in your data set, right? So how do you distinguish the errors, which are going to be sort of random from an actual SNP where, and that's why we sequence such that every base is covered multiple times. Because if you only cover a base one time, you won't be able to tell that this is an error, if you only, or a SNP. But because we cover the genome multiple times at every position, and that's really one of the requirements to be able to do there in calling, we're going to see this SNP multiple times. So this is what we expect. So we can distinguish the two. So Matsuya mentioned already the fact that in the context of the sequencing itself, even when it reads a base, it itself has some information. Because sometimes, because every base that's read, the sequencer will report its error or predicted error, right? Because I mean, it estimates that just by how clean the signal is for that particular base. So depending on whether the signal is very ambiguous, a very clear or ambiguous, it's going to associate also a quality score to every base. So there is information that comes from the sequencer also that says this base looks suspicious. We call it an A, but we're not really sure. So this is converted into a quality score of every base. So as typically, again, we hope or we expect to have a quality score around 30. That means the base, there's 99.9% chance that that base is correct. So when we do the SNP discovery, we also want to integrate that quality information. Because even if you only see one read, so here you get the reference on top, on the bottom. So you see you have the reference on top. So when you read, even one read, there's a difference between some reads that say, I see a T here and I have a high quality that this is a T versus I'm not so sure. I think it's a T, but I'm not sure. So this information that comes from the sequencer about the quality of every base, we also want to integrate. So this is a slide that I'm going to be quizzing you all at the end on. So this is the actual underlying. So this is for GATK. This is the underlying algorithm or calculation to basically predict whether it's a variant and how likely it is to be a variant. And the details of the formula are not important. What's important is what goes into it. So given the data at a given position, given all of the reads that you have at a given position, you want to calculate the probability of a genotype. So you want to calculate the probability that it's a variant, that it's different from the reference and have a score. So that probability that it's a variant will be converted into a score that will use downstream. There is part of the reason why this is a little bit more complicated than why the formula is a little bit more complicated is that every position, especially in the context of the human genome, you have two chromosomes as we saw yesterday. So in some case, just like in the example I had here, or no, this one's better. So in this example, if there's a variant, you don't necessarily expect all the basis to be different from the reference because you might have one chromosome that's different from the reference and one chromosome that's the same as the reference. So this kind of profile is expected because you've got the two chromosomes. So the formula also has to take into account that at every given position, you can have two alleles. So that also goes into the equation. But again, the detail of the equations are not important just to know that this takes into account how many reads are saying that there's a difference, takes into account the quality of the basis. So you see that here actually. So it also takes into account the quality of the basis as reported by the sequencer. So again, this is just sort of the mathematical representation of what we're doing with our eyes when we're looking at it and we count how many reads say that there's a difference and it, but it also incorporates the quality score and the fact that we expect potentially two different haplotype at the same, two different genotype at that position. So, showing up a little bit. So just to re-emphasize some of the things that Mathieu talked about, there's a few additional things that beyond just that formula that calls how likely is it that it's a variant given the data. There's a few things that are important to do. So these are four steps and we're gonna go through these steps. Well, you've done some of these steps already yesterday. So local realignment, if you remember. So one problem that happens is that when the reads are mapped individually, right? So in the step of mapping where you're mapping the step, they're each read individually. But when you're mapping each read individually, the best way to sort of align the very end of the read, you don't have much information to know how to align the very tip of the reads like this. So what happens is that you tend to make systematic errors where the mapping basically makes the same mistake multiple times. But by making the same mistakes multiple times, it looks like very good. If you compare it to the previous part that I showed, these look very good as variants. So if you simply take that data and feed it into the magic formula that I've showed you, you would probably be calling variants here and here. So this local realignment step just identifies all the regions that have variants and then just double check that there's no way of repositioning the reads in a better way to explain that region. But now you're using the information from all the reads at the same time, so you're doing a better job. So that's what you see here. So if you just realign the reads, you find that there's a better way of doing all of this mapping that actually can now explain only just one indel. And it's really around indels, sorry. It's really around indels that these types of mistakes end to happen. And that's why it's useful to do that realignment step. Okay, so local realignment. Duplicate marking, which you also saw yesterday. This is to recover from potential PCR artifacts. Again, in the context of the variant calling formula, it's gonna be looking exactly for these kinds of patterns of multiple repetition of a change, right? So, but the problem is something else can lead to this pattern, which is if the same identical DNA fragment ends up being PCR-amplified, you're gonna basically, and there's a mistake, there's an error that was inserted, you're gonna basically be reproducing that mistake, so it's gonna look like a variant. So same thing, so removing reads that are identical, the pattern that you would want to see or that you would expect if I go back. Oops. If I go back to this, this is the type of pattern you want. You want reads of all sorts of different reads that are pointing to that variant. You don't want only the identical read that might be a PCR artifact to point to that variant. So by compressing the, you still keep one, but you compress it to just one, such that it's only gonna be counting as one piece of evidence for the calling. The last one that you've covered as well yesterday in the steps is this base quality recalibration. So if you recall, this is just because the quality score of produced by the instrument for every base is used in the formula to calculate whether it's a variant or not. If the sequencing instruments make systematic mistake or systematic mis-evaluation of the error rate, that's gonna translate into bad variants being called. So doing this step of readjusting the scores that are produced, if you do that before, again, you feed that into the big formula, that's also gonna help you. So the last one that I don't think you covered yesterday that's also important is population structure and imputation. So what do I mean by that? So this is when you have more than one sample. So when you have more than one sample, and suppose that, so you have two appetites. So that particular piece of chromosome has only two version in your population. And that's typically true. There's not so many appetites in the population. So suppose your population only has either the ATG or CGA in that region of the genome. And then your sequence and all you see are reads of this type below. So can you guess the value? Let's see if you guys are awake. So can you guess the value of N? Yes, what is it? T, okay, correct. So here notice that it's quite, what you did in your head, it's quite different from the formula or what I was saying before. Everything I said before was you're just looking at the evidence at one particular position. And you make the call based on that particular position and the number of reads. What you guys ended up here doing in your head is the fact that because the human genome is organized such that you actually always give segments of chromosomes or haplotypes, there is sort of correlations between positions. And then when you have information about neighboring position and information about like I have here in the population, when you have a C here, you always have a G or when you have an A here, you always have a T. So you can combine information from your population and things you've seen in other sample to basically make prediction at the position. So this type of varying calling is a little bit more elaborate where you're actually using this information about the fact that there's correlation between variants when you're calling them. So again, this is a slightly more advanced way of improving variant calling. When you have multiple samples, is that you're using the structure of haplotypes to improve it. So to do that, you actually have to call variants using multiple sample at the same time. And this is especially useful if you don't have very high coverage. If you have high coverage in a position, right? If you have high, then actually you don't need to use haplotype calling as much, but there are projects or instances where all you have is, you know, five X coverage or three X, so every position in every sample ends up being covered just a few times. And then you won't have, you know, so if you've only observed it once, you know, you don't know. So this is where using, you know, did I observe it in other sample might help you as well. But again, this is a bit more technical. But it's just, it's been shown that, you know, using multiple samples to improve variant calling can really help you. And that's especially true if you don't have very high coverage. So this is just a demonstration that the multiple calling, multiple sample calling ends up, can improve your variant calling. But this is, you know, if you have, yes? I'm just not sure. So you sequence it separately, but to call variants you make use of information in other samples because, right? So you don't call one sample at a time because in one sample, I mean, so assume you only have, you only have five X. So most positions in the genome only have five reads. And so if you see a variant in one sample at one position, but you only have one read that tells you it's different, you might not know, you don't, you won't have enough confidence to distinguish it's an error from, it's a real change. But if you have a hundred individual and in the hundred individual, you often see a change at that, the same change, you can use that information to give you confidence it's probably real. So if you wait the confidence difference three, if it's a sample sequence multiple time, I mean, re-extract and re-amplify the sequence compared to multiple times. Yeah. Wait in the same difference three. Yes, so I mean, if you're, I mean, this type of, you know, identifying saying I have multiple samples to make the call would be used in the algorithm. If you have multiple data sets for the same sample, you can also use that. And actually there, there is especially if, so suppose you sequence twice the same individual and the variant, you only see it in one of your batch of reads, that's probably a bad sign. So that's another thing that, so it's very useful if you can verify that indeed that you have reads that are coming from both sides. A little bit like, that's one of the things that's nice in IGV, for instance, you might want to double check that you've got reads that are going in both directions that are coming from both read sets that overlap in that call that variant. Oh, right. No, no, absolutely. So you can also do that to verify that the calls are reasonable, for instance. So that's very useful in annotation as well that you can use on the variants, right? So once you've called the variants in the, if you have a trio, you could annotate the variant and make sure that inheritance makes sense, right? And that, which is another way you can actually improve your quality of variants. And using like thousand genomes. Yeah, so we'll get to the annotation, I guess I'm getting into that a little bit. I mean, both, here we're really just calling and then there's gonna be lots of ways of annotating the variants to further filter or improve them, but we can look at the frequency in thousand genomes and things like that. This is still not using any other data than the samples you've sequenced and how best can you call the variants just using your sample. Yeah. So within GATK, you have two modes in which you can run GATK. You can use it, you know, sample by sample, but you can also feed it multiple sample and use this Apple type caller module. No, I mean. So, I mean, again, so the difference is whether it's just doing the relatively simple form, well, relatively simple formula that I showed, which is basically only using the data from one sample and the number of reads and the quality or whether it's incorporating into the score some information that's coming from the other samples and the Apple type. Yeah. Sometimes you don't get some. That's right, all right. Well, no, so the Apple type, if you're using the Apple type information, it's using a different formula that incorporates this structure and this is what this plot was showing is that they've shown that the performance is also better, especially when you have not a lot of reads, right? Because if you don't have a lot of reads and that position in that sample is only covered by a couple reads, sometimes it's gonna get it right, sometimes it's gonna make a mistake. If you then do the same at the same time, you're looking in multiple sample. In each sample, the calls will be a little bit better. So if you look at the GATK framework, which is really what we're following for our variant calling in this case, and this, so you had module two yesterday, which is really, you know, we did this, the mapping, the realignment, the duplicate marking, and the recalibration in the context of the variant discovery now, especially if we have multiple sample, and this is what we're gonna do in the practical. I'm not the practical of this one, more the structural variant, but we'll do this distinction of calling variants using just one sample or using multiple samples, but this is gonna be in the second practical, but still you get the idea. After the mapping, we're gonna actually be doing the variant calling. So to give you a sense of the size of the data sets and the time, the various steps take. So yesterday was mentioned that, you know, the files that come out of the sequencer, if you're sequencing a human genome are in the order of 200 to sometimes 500 gigabytes, if you have high coverage. So the first step, well, so this is after the mapping, right? We get the band files that are roughly this size. Calling variants across the whole genome actually takes quite a bit of time, whether you're using GATK, which is what we're gonna use, or there's other tools, SAM tools, FreeBase, there's other ways and slightly different formulas about you actually calculate where the positions are of the variant. So we're focusing on the GATK framework, but again, there's very similar strategies that are used by other tools, but if you're processing all of that data across the whole genome to call variants, it takes 10 hours or so. So for that reason, we won't be doing it on the whole genome. We're gonna be doing it just like yesterday only on one particular portion of the genome. But what you get after this are really a list of variants and this is in the VCF format and we'll go a little bit over this format. So these are all the sites that appear to be different from the reference. The size of those files are much smaller because now you're not keeping all the data, you're just keeping the data of the positions that are different from the reference with all of the extra information of the evidence that you have. So now that we've called the variants and so on, I'll move to the second section of what we're gonna be doing in the module, which is, and we touched on that already, how do you filter and how do you annotate the variant? So I've been talking about this VCF format. So if we look at it a little bit at the structure of that format and there's links on the website with the full description of what that format is all about. So the VCF file starts with lots of header files that just have information of what's down below and how things were calculated and so on. So what did you use as a reference? Information about the various information because this is sort of encoded in the way that then is a bit cryptic, but you have some information about what's in the file, you know, the main, so it's really just a file where in previous files you had read that corresponded to lines. Here you have every line is really one of these variants. It identifies the position of the variant. So in which chromosome, the position, whether you have a DB SNP ID, we'll get back to that. But basically it says, you know, the reference genome at that position has this sequence and we've observed a variant and this is the variant we've observed a quality score, a bit like what we had with the mapping quality, but this is now a quality of the variant itself at some level, whether it has passed. So the variant caller will report whether according to its algorithm, it actually qualifies as a good variant. So it's passed the filter and then information, more specific in the number of samples, the depth of reads that covered that position and so on. But we'll look more at some information, but that's the rough format itself, really one line per variant, some information about what's the reference at that position and what's the variant that we've observed. Yeah. Is the quality score? So it's Fred's scale, but it's the way it's calculated is not, there's not an interpretation that's as simple as, you know, probability of error, but that's still the scale, it's the same scale. But I say that it depends on the variant caller itself. So sometimes they actually use a different scale. So the mapping quality scores usually use the Fred's scale and then depending on the variant caller, they're not necessarily using the Fred's scale. It's really, so you really have to look specifically at the algorithm and what's the quality score that there is. I use it more as a ranking usually because different algorithms use different scale actually for the, yeah, it's more, yes. And then for the specifics, it's good to look into the manual of the software itself. All right. Okay, so from the variant calling, we'll get one of these raw VCF with all the calls. What we probably wanna do, some variant filtering. So typically when you get the raw variant calls, you have a lot of false positive. So how can you filter? So one way is to filter directly based on specific parameters. You can have a filter based on this quality score or depth of coverage. You might say, well, I only trust variants that are above a particular score and that have 10 reads covering them. So you can come up with yourself with some kind of rules of what looks like a good variant. And one way is to look at some of the variants in the browser and then the side. And again, there's some, but still it's a bit arbitrary. It's sort of required. And it's not clear exactly how to define this. By now, there's better ways of filtering the data in particular within GATK. There's what's called variant recalibrator. So this sort of builds up a set of rules itself based on the raw output of what's the score and what's the depth of coverage. It sort of will reorder and filter variants that look suspicious. And the way it does that is using known variants. So so far it wasn't, we were really doing this sample by sample using only our data, but there's a way to actually use known variants to improve the variant scores and variant ordering. So that's called variant quality recalibration. And the idea is that you actually use known variants that are coming either from the HAPMAP project to see whether you're actually missing out. So you can use the very good quality variants from the HAPMAP project to know whether you're missing out a lot of variants. So you can also use, DVSNP actually has a lot of variants that are errors in some case that have been detected in other project and reported as been observed before. So some of the work that was done by the team that set up GATK was to look to see which variant, you know, what was the profile of some of the, you know, overall of the variants call of the, you know, you have a set of very good variants and a set of more suspicious variants. And using that and comparing the scores, you can recalibrate the raw scores that are outputted to sort of filter out. You learn sort of from previous mistakes and from good data, you learn from that and you recalibrate your score such that you can filter some suspicious looking variant. So that's one of the steps that's implemented in that workflow as well. So you input what are known as good variants and bad variants to sort of fine tune the variant calling algorithm. So this is the filtering step that's recommended. So again, if I go back, you know, you can have your own set of filters if you'd like. So you can set, I want to look at things that are above a certain score, that are above a certain coverage and so on. You know, there's reasonable ways of making these choices. Something that's recommended in the context of using the GATK framework is to give, to make use of known wheel snips and known potentially problematic variants to do that filtering automatically out of your data. So that's one of the steps that we're gonna cover as well. So on a slightly different topic now, not a filtering but of annotating. This is a project that I was part of that I wanted to show as sort of a motivating example. So in this particular project, and to give you a sense of the challenge ahead. So this is a project where we sequence a hundred Hoji Nome kidney tumors. And when we did that, we're now looking at germline variants but at somatic variants. So these are new mutations in the tumor. But if we sequence a hundred Hoji Nome, we detected more than half a million point mutation. So every little square here is a thousand mutations. So we detected over a half a million change. So on average, each tumor had close to the 5,000, 5,000, 6,000 mutations. So there's mutations everywhere. So one challenge is how, where do you start looking? So every tumor had on average 6,000 mutation. One obvious annotation is whether these are coding mutations. So you see that there was small subset of these mutations that are coding. So that's great. And that's one basic annotation you can do. But it's still then a little bit unsatisfying because you've sequenced the whole genome, you detected all of these mutations and there's only very few that you're able to annotate. It's a very small subset that's rating the gene. Those are still the interesting one and you want to annotate them as such. But it's just, you get a lot of mutation and you don't really know where to look. So how do you know where to look? So this annotation step is quite important. There's tons and tons of data and we start talking about some of them, 1000 genome and BBSNP and so on. So especially if you're sequencing something like the human genome, there's a lot of annotation in the human genome. So it's important not just to take the raw VCF or the filtered VCF, but to also start adding information about whether it's hitting a gene, whether it's coding or non-coding. If it's non-coding, is it hitting some of these encode elements and so on? So there's different ways that this annotation can be done. The one that we're gonna be using in the course is this SNPF software that was actually developed by Pablo Singalani. That was at the genome center for a while. So it doesn't, I mean, it has the advantage of already regrouping many database and doing many annotation all at once. So to go back to this overall pipeline and the size of the data sets and the time it takes. So typically, so by now we were at this stage where we have much smaller file which are these raw VCF, the variant call. So it's good to then use potentially GATK again to filter these variants and then annotate. And there's different tools, but annotating these variants such that you know which ones are coding, you know which ones are more likely to be impactful and so on. So I think that's all I had in terms of an introduction before going to the actual lab. So, yeah. Kind of a tough question about the tumor sequencing. So tumors are so difficult because typically you're dealing with a mixed sample. It's a bunch of good, healthy cells and tumor cells. So you're not looking at an even distribution. So you might have a mutant there only has a certain small percentage, maybe five or 10%. So how much extra sequencing coverage do you need to get to be able to convert and call? So not only do you need more sequencing, you also need, so the formula I showed had assumption on the genotype and the fact that it probably, it expected basically roughly 0%, 50% or 100% to call a variant. So you also need to adjust and tell the algorithm I actually don't necessarily expect these proportion and I want you to call variants that are at 10% or 20% because if it's a mixed sample and you might have a somatic change that it's in 20% of the cells. So you don't necessarily expect the same proportion. So you need to make, hopefully have sufficient coverage and then you're still limited because if it's a very small, the advantage though is that typically unless you're doing something very fancy, you're looking at something that's probably present in most of the tumor cells or something like that and not the very, there are other application but typically you're looking for variants that are not the rarest variant in the tumor. And so in that sense, you don't necessarily need super high, but you definitely need to adjust the algorithm such that it knows that it doesn't necessarily expect the 0, 50 or 100% of the read to be calling a variant. So that's something that you as the buyer and the petitioner have to do or do you have to? Well, so there's specific variant calling software for somatic mutations. So something like UTEC and so there's variant call so there's already where you, and you also you need to call the variants in some case, it's even better if you give it the two sample, the normal and the tumor and then you really sort of adjust and change the parameters and it's a bit called specific, yeah. So to show the formula at the beginning, and you highlighted this part and says it's a bit deployed so does GTK expect you to use to work with the deployed organism? So there's settings there too. So you can specify whether it's not the case. That's by default is the expectation but you can change that as well. But yeah, that's definitely something that if you're not dealing with a deployed genome, you need to adjust. Yeah, from the center still, but it seems to be... It's true that it's definitely tailored for that, but I mean, based on the setting, the raw variant calls, for instance, have all of this information, right, initially. So you can still revisit them but it's true that the quality or the filtering, for instance, would not apply if you're actually looking at the population. That's right. I mean, so there's definitely... I'm less familiar with them, but there's definitely callers and specific settings that allow you to do it in the context of population, but you need to... Maybe let you know. Population, or it tends to use that kind of the pool. The pool of population, you have in your data and you use it again in your database to call in the form of some tools, and then you use it in your database. You tell it, I estimate to have this amount of capital type in my data, and you go do the coding in that information. That would, I think, be popular. What's the name of the software? Population. Population. Yes. Put a note on the sticky note. Yeah. I was going to say, for that one slide, when you're like, okay. Right. So that's really... So it's GATK Beagle. So I mean, this is an older slide. So GATK Haplotype Caller. This is similar to this GATK Beagle. So that's the Haplotype Caller module within GATK, I believe. But again, you only get this type of boost and improvement for very low sequencing. So this is a thousand genome type of sequencing where there was 5x sequencing, and then, of course, it makes a huge difference, right? But otherwise, it makes a difference, but it's not as incredible. It does because even if you have 30x sequencing though, you do have some regions of the genome where typically you have much lower coverage. And again, it's going to help you in those particular regions. But in regions where you have sufficient coverage, it doesn't help you that much. There are some bias in them that we might use the real variant because of boosts in the... So the realignment, right, is based on not the indel calling itself, but just variants. So regions that have lots of variants get realigned because there might be indels. So the indel realigner component is based just on... Because if there's an indel, it's going to lead to lots of other variants around it, and then that's what gets realigned. Indel calling is still... I guess I didn't have much on that in the slides. It's still not nearly as accurate and as easy as especially slightly longer indels or in regions that are a little bit more difficult. The problem is that indel calling, if there's real indels, the mapping step breaks down. So this goes back to what Jared was talking about a little bit. If it's just a snip, right, you're going to get accurate mapping of reads in that region. If it's an indel, the mapping step that's the first step that allows us to look into that region is not as good, right? So indel calling is definitely lower quality in general. And we're not covering that in much detail, but typically, yeah, you have to look at those. Especially if you're not using any specialized salt specifically for indels, you have to look at those with grain of salt usually. And then, again, there's lots of other algorithms that specifically try to call indels, but by default, the quality of indel calls is not as good as single nucleotide variants. Next question is like, amongst all, we need to manually add those indels. Will it add what? After annotating with the database refining, do we need to add the info in the end section? Like, because for AC, like, all the info that is in, all the infrequency info that GD is, Gnode is. Right, right, right. Yes, you'll see that in the end of practical, in the end, so many of that will be added automatically. It's part of the SNPF annotation. Okay, so let's, unless there are other questions.