 You know, you'll have to follow with me the presentation. So typical slides, feel free to reuse and share these slides. So what I'll be talking about in this first module, a small variant calling an annotation, better on the slides than on the screen. But it's really this idea that once we've mapped all the reads onto the reference, one of the key things that we're interested in is actually identifying these places that are varied. The reason why we're sequencing different genomes is because we're interested in really being able to pick up the sites that are variable that are different between the different individual. This is relevant in different disease. I don't have a lot of slides on the motivation for why this is important. But it's really one of the main goal of these resequencing experiments is to identify sites that are variable like this. So the objectives of what will be covered are quite a lot of things. So again, if it's not clear what is very common, hopefully it will be clear after this module, understanding the basic principles of how we do variant calling. So know what's important. So when you're doing variant calling, a lot of the things that you were doing yesterday, maybe you don't know, but they have a big impact on variant calling. And that's exactly what I'll show you in this module. I'll show you how to filter and annotate variants. And all of that is relevant to small variants, a single mutual file and a short date. So basically, within the practical, starting from the band file similar to what you generated yesterday, be able to call variants and annotate these variants, learn about the DCF format. So just like yesterday, you saw the Q format, you saw the band format. So we're gonna go over the variant call format we did DCF format a little bit, visualize a SNF. So this is a lot to get. You've done a lot of the work yesterday and now you're gonna be putting that together to hopefully make sense out of it. So I like this slide. This is from the Senior Bind from Edition at the Geno Center. I like it because it's just an overview of some of the steps when we do variant calling. So to begin with, so you start a little bit like you did yesterday again, you start by cleaning up the leaves, trim the leaves in some cases with adapters. The module two, we put this alignment, aligning the leaves. This is really at the center of variant calling. And once you align the leaves, then you really do the calling of the variants, annotate the variants and you can call the structural variants, which is what we're gonna be doing today. And parallel to that, of course, you're collecting a lot of statistics. So without looking in the slides and without cheating, there's actually something that all of you guys forgot to do yesterday. So what's the first thing you have to do or at this pipeline or for any pipeline like that, actually? What's the first thing you have to do? I'm sorry? Well, okay, fair enough, after that. So what's the first analysis step that you have to do with any of these pipelines? Yes. That's right. So we have to look at the data, right? And that's one thing, so I work in a genome center. If you get the sequences from us, they're perfect. Not true. I mean, sort of true. But it's really key to look at your data at every step along the way, but especially at the first step. So when you get the SQ files, you have to look at them. Especially if you're doing, depending on what you're doing, do you have multiple samples that were generated through time? If you're getting data from the internet, you have to look at that data before you do any analysis because otherwise, you don't really know what you're gonna get. If one of the samples is really horrible quality, for instance, you might get all sorts of weird things down the line. So, I mean, I don't know if your background is to work in the lab, but it's the same thing. You have to go along at every single step. You have to look at what you did and what the data looks like. The first step being the most important is this quality control. And look at it. And in the lab, actually, we'll go back and look at the quality of the files that you used yesterday as well. So, the main analysis step is really, and this is for variant calling, structural variant calling that we're gonna be doing today, but again, this is really sort of general quality control, pre-processing, based on that quality control, what do you do in your meetings? So, you've covered extensively yesterday the mapping of the reads, and now, and then you sort of diverted into two other modules that were also very important, but not directly relevant to the actual pipeline, because after you map, you do variant calling, which is what we're gonna be covering now, the calling, the annotation, and then after the break, we'll go to structural variant calling. But, with these initial steps, I just wanted to re-emphasize that they're really quite, quite important. So, I won't spend much time on this, but I wanted to just mention this again. I'll come here. So, in terms of quality control, before you start an analysis, it's important that you look at your raw data, because otherwise, everything you do downstream, maybe won't make sense. You know, where they all sequence in the same URL instrument. This is especially true if you're getting data from different places. Are there any technical issues affecting some of the sample? So, it's really important to get a sense of what your data looks like. So, one tool that's especially used and useful with next generation sequencing data, it's FastQC, there's different variant of that tool. I ran that tool on the data set that you were using yesterday. So, it's, and this is part of the output, but this is reassuring. So, the data set that you were using yesterday is pretty good. So, if you run this tool, and this tool you can download from the web, or you can run it up, I mean, it's easier to download and run. It's also included in Galaxy, maybe that's part of what you're gonna be doing this afternoon with the Galaxy. So, you can run this tool, you get basic statistics, it's hard to see how many reads that you have, what was the length of the read, but then you get all the other metrics, like the quality of your reads. So, this profile is sort of a typical profile with next generation sequencing data, where the quality of the read is, so here on the y-axis, you have quality scores like in reading. So, I guess this is 30 or 20, I can't read, it's even smaller here. So, I think this is 30. You have it on the slide? Yeah, 30, I guess, right? 30 is in green, so 30 is 99.9% accurate, and this is a cumulative distribution of all of your reads. So, this is first base and going down to the last base of all of your reads, and you see that by the time you get to the end of the read, that quality distribution, you have some reads that have much lower quality. So, when you do this trimming step, it's something that actually removes some of these reads. But overall, this is pretty good, and a lot of the liners and of the very calling can take that into account. So, this was the first read of your dataset. The second read is also not bad, but you see that one of the summary report tells you that this is actually, because with all the reads, they know where, what was the position of the read on the slide, on the sequencer. And what this shows is that there was a number of regions on the slide that where the quality scores are not as good. Again, this is not an extreme case of a bad dataset. This just shows that the second read didn't have quite the same quality, but still pretty, pretty bad. But these are cases where it looks more or less good. But again, I encourage you to run it to a like that on your dataset before doing any analysis downstream, because if you get a lot of skewed values or something like that, you wanna know that before you get started. Let me just... Another thing, and again, some of the tools downstream will catch that and will not be affected by the fact that there might be some adapter sequences in your reads. But that, again, is a good thing to watch for and to know that it's in your data. So, if you get samples back, what fraction of your reads actually have adapter sequences? So all of that information also comes out at some level from some of these QC reports. This is just an example where the fragment itself, the read, or this is sort of an old, selects the slash unit information, but the read itself might actually read through the adapter. And so your reads would actually contain the adapter sequences. One way to catch that is if you're, and that's coming out of these QC reports, is that if you have, it just looks for over-represented sequences in your reads, and if you see that you have a lot of over-represented sequences, you might have either duplicates or adapter sequences. So this tool in particular was mentioned yesterday by Nathan DeNovo for assembly. So there's a number of tools that are out there that allow you to trim. So after you've done your QC, if you see that there's lots of adapters, or if you see that there's lots of reads of bad quality, depending on your application, you might want to clean up your file. Cutting adapter sequences from, so you just feed. We know what the adapter sequence is for enuminal sequencing. So you just remove reads that have that adapter sequence. You can cut the bases off at the end of the read. You can drop the reads that don't have sufficient quality and so on. So this is just, I guess, a heads up that this step is really quite important, especially looking at your data, making sure that your data, your different samples are comparable. You don't need to be perfect. Yeah. CTK. CTK. Yes. How do we know? So the way, I mean, this step is pretty simple. So we selected the one based on usability. So you look at the options and what you want to do. There's no, anyway, there's no fixed criteria of what you should be removing. We ended up picking this one based on usability because just the options of the tool and what it allows us to do and how we could run it, it was what we needed. This step is not, there's nothing too fancy here where you're just really removing reads that have this property. It's pretty, pretty straightforward. Yes, definitely. So, I mean, but again, it's really, yeah. So, but it's, that's exactly right. In my mind, you know, every time you do one of these analysis steps, you look again at your reads or at your Mac reads, how many are there, how many do you remove, right? Because sometimes you misunderstood the parameters and you ended up cutting 90% of your reads, right? There's nothing like that. So you want to know what you've done after every step. So, well, so any question on that part? Again, I didn't want to go into too much detail. Well, we're kind of doing it with the normal, sometimes with duplicate, you know, so we moved duplicate was one of the steps that we did yesterday. So these are PCR duplicates where it's exactly the same started in. So you can do it at the left after mapping, which is actually a good thing, but you could also do it here. So, I mean, it seems it's better to move the duplicate after mapping, but you would already see it in the fast queue that you have lots and lots of sequences that are right in here. Are you planning to do better? No, no, no. So those are more important. And again, so it's initially, it was very important to do these filtering steps because the mapping and the very calling were very much affected by that. But now, to be honest, like the trimming step is not necessary because a lot of the very calling take into account the quality score and won't weight these bases. So you don't really need to do that, but it's still good to get a sense of what your data looks like. If anything, and it's also good, perhaps to, you know, and it's a little bit like what we're gonna be doing. You know, you can try with and without those steps, see if you get different results and see if, you know, if you don't trim, how many variants do you get? And if you trim, how many variants do you get? If you get 10 times more without trimming, you know, you should be a little bit cautious. Yes. I have a question. I mean, what's the question? So the question is cleaning up, it's important for small variable calling or for large variable calling, structural variable. By the time we'll get to structural variants, there's a lot of false positive there to begin with and those false positive for the most part are not because of the quality of the read. So I don't think it would affect the structural variant calling as much as it affects the small variant calling. Small variant calling is actually pretty robust pipeline. And if you have bad data, it might lead to not so good results, but it should be giving good results. So there, I think cleaning up your data has a bigger impact than the structural variant calling. I mean, structural variant calling, what could it, well, we'll get to that in a bit, but would be chimeric reeds and things like that or, but that's not what you would easily be able to clean up in this trimester. Yeah. But Gamer always remains in that, that we did after getting it to become orange, but not green. So good question. So, I mean, we've had lots of these questions. If you look at FASTQC, you don't know whether it's good or not, right? So, you know, the first one you get is that you'll get some orange and you'll get some red and is that a problem or not? The chambers in particular are hard to interpret from my perspective. I say, I think the best approach to that is do it two different ways, clean the data and then go forward with your analysis pipeline or don't for something like that and see if it changes the results. I think that's the best way of doing it. I mean, for most part, it shouldn't be affecting things too much and it's normal to have some of these orange warnings and things like that. I think that's all fun. But it's more whether your samples all look the same. That's really important. It's more whether you have 50% duplicates or something like that. So, we don't necessarily have all the green, no, no. So we almost, I mean, again, like here, this is actually absolutely, even this is fine, right? To compare the read one distribution of scores with read two, there's a bit more lower scores that are coming from this particular region of the slide. You could remove them, but I don't think it would affect very much. I mean, you can try it with and without and then depending on whether you really care about any false positive or not. Yeah. That's my circle, provide some information and summary about the vaccine. So I know that we run fast QC and then when we provide as a sequencing center, we provide that, but so the same type of output. But again, it's really easy for you to do it as well on your own. So if it's not provided, I encourage you to do it because, you know, and then you can ask questions. How come, you know, again, you do expect this type of profile where the quality decrease, but you know, as you have, so here I'm not showing it, but there's statistics on the length of the read, right? If you end up like having shorter read and longer read, you know, there might be something that was mixed up and things like that. So it's really easy and definitely worth your time to look at the data that you get before you start the meeting. Okay, moving on to the actual module. So SNP and Baron calling. Again, the goal of this is yesterday you mapped, you explored IGV quite a bit, but the goal of this particular module is how do we identify sites like this where clearly, so this is a tumor sample and a normal sample. So you see that from the same individual and you see that there's a variant, all the reads sort of point to a difference in the normal sample here. Looks easy if you look at it this way and practice it's a little bit trickier. So, but the main idea is what you can guess from the previous slide itself. Actual variants, you expect many reads to point out that there's a difference. So we look for sites where you have many reads that support a variant at that position and that's going to distinguish from sequencing errors that are sort of, you know, 99.9% accurate sounds like super accurate, but that means 0.1% of errors in your, there's so many reads and there's so many positions that there's actually lots and lots of errors. There's millions and millions of sequencing errors in your data set. But those errors should be all spread out throughout the reads. So we look for positions where multiple reads point to a difference, yes. So, no, because we're looking for these errors here, right? So we could do error correction in the way we use the reference or sort of, I mean this would be, so the error, this is a site that actually has both apple types, right? It has the G and it has the A. If you want to want to try to combine the nourishment, every read in this step, every read is mapped individually for the reference, you know? And afterwards, we actually look for these differences as opposed to sort of collapsing them because there's no SMB problem. The challenge is to, you know, we know, because you don't have the reference, you want to know what reads go with what reads and start putting them together, which is really good. Yep. Well, but there's, again, I mean, this is basically what we're going to be doing with the bearing calling itself. These will just simply be ignored. So in a way, that's sort of an error correction. This is saying, you know, this is a T. So we'll get, for every position, we'll know, you know, we'll have an actual call that says, this is very, very unlikely to be a G at this position. But that's exactly what the bearing calling is going to be. It's going to give us, for every position, the probability of the different bearings. I mean, different. So this, so I'm not going to go into this in much detail. I'm putting that up just to scare you a little bit for you non-mathematician. So this is, but this is still, the way bearing calling is implemented. Yeah, that's a good question. I think it might be quality scores, right? So based on quality, you actually have a capital A or a small A. In IGP, it's a shame. Yeah, so we do, but so for bearing calling, I didn't go into that, but we do need to cover every position, you know, typically 30X is the target. And the reason for that is because if we only cover every position one time or two times, we won't be able to tell that this is a sequencing error or an actual difference in the team. So typically whole genome sequencing, we have a target of 30X, with exome sequencing, it's even higher, it's 100X. And the reason for that is to get exactly that sufficient reads covering every position, such that we can distinguish errors from. And this is the main, I guess this slide is pretty important because this is really the main, the main idea of how bearing calling is implemented. It's going to take into account the number of reads that observe a difference. It's going to take into account the quality score of every position. So that's what I have on this slide, which again, we don't need to go into too much detail. But basically, you know, this is going to compute the probability of any particular genotype given the data. So given all of the reads at that position, what is the probability of every genotype? And this, you know, one thing that makes this more complicated is that if you're in the human genome, there's actually, it's a diploid genome. So there's two, two half the types. So you have to, just like we saw in the previous slide here, you know, it's not that it's a gene or a name. At that particular position, there's probably both a gene and a name. What makes it why are our tree, for sure, so you can hear it from there. And a lot of times they have to be run separately over and over the reasons. A lot of times, they're in polyomics and why aren't they? But again, so, you know, I don't want to spend too much time on this if you're interested in the detail. To me, the main point is how is this done? So this is integrating all of the data at a particular position to provide, you know, a score probability assignment for every particular genotype at a given position. And this incorporates information on the number of reads that are supporting a different genotype and also the quality score that are here. Because again, this takes into account the fact that you are making errors and so on. So it's going to do a little bit like what we were saying. It's going to make a difference between sequencing errors and actual variance. So, and this is the place where I said trimming used to make a big difference. It doesn't as much because now that these particular software, this is the one that we're going to be using, which is the GTK pipeline and framework. They take into account the quality scores of the base. So even if you don't do the trimming step, those particular bad reads or bad base will not be counted much into the output score. So that's taken into account. So what we'll be covering more, though, is, you know, yesterday we talked quite a bit about local realignment. We talked about the duplicate market-based quality recalibration, which we didn't cover, but I wanted to talk about a bit. And then population structure and infutation. So even if you have this method and this model that I talked about before, all of these steps, when you do the alignment and post alignment, are going to make a big difference in varied columns. So I wanted to go over these different steps. So starting with local realignment, which you saw yesterday. So this, again, was mentioned that around insertion deletions, typically reads are frequently, reads are not aligned very well. And this leads to these types of patterns. So if we were doing variant calling on this type of data, this is alignment without the realignment step, we would probably call these as variants because there's lots of reads that are supporting a difference at that position. And this is why it's important to do the realignment step, which is given that it looks like there's an indel here, you know, the alignment itself was done individually in every read. The realignment step now takes into account all of the reading at that position and does a much better job at alignment based on all of the information around that region. So this particular step that you did will go over that in the lab. What happens if you don't do it? And what happens if you do do it? So this is one thing that improves variant calling. Duplicate marking is another thing that you did yesterday. So here again, you know, based on what I was saying, the variant calling takes into account the number of observations, the number of reads that see a difference at a given position. The problem with duplicates is basically this was one read that had one sequencing error or one error that was then amplified, such that you have many reads, but they're all the same reads and they all have the same error. So if we do variant calling using this data, we're just going to call this a variant, but there's only one piece of evidence for that variant and that's again wrong. So the variant calling, I'm sorry, the duplicate marking is quite important because it really collapses that to, we have a suggestion of a difference here, but it's only one read that's suggesting that difference and so that's going to be down the way and we're not going to call that out there. So duplicate marking is another thing that's quite critical, especially for variant calling from DNA data. RNA-seq is a whole other business and you're going to hear about that in the next few days. These quality recalibration is another one that we didn't cover yesterday, but that I wanted to point out. So what's that? So it turns out that the sequencers give you information about the quality score, but it was observed that the sequencers make systematic mistakes. So they always overshoot the quality of their base. So the quality score that's provided with your FASTQ is off by a little bit. If it was off by a little bit, but it was sort of random, it wouldn't matter, it turns out that it's off and there's sort of systematic mistakes that the sequencer is making. So it's making systematic mistake relative to the position. So that's not the end of the world on the read, but it's making systematic mistake based on the dinucleotide. So if it's a G that was just after a C, the quality score tends to be off more often than not. So this leads to small mistakes, but because there are systematic mistakes, then if you have lots and lots of reasons making these systematic mistakes, it tends to, again, lead to some false cognitive errors. So, well, so this is also implemented within the GATK framework where after you've mapped all the reads, you can recalibrate the qualities. It's basically done by looking at all the reads and all the positions that are correct or incorrect and just changing the quality score. And it's just one more step that they've shown and people have shown, I guess, improves the quality of their outcome. So another thing to keep in mind that you can add this base quality recalibration. And the last one that I wanted to mention, and this is not always appropriate, and like yesterday, we were just doing one sample. So if you're only doing one sample, it doesn't work. Yeah. So it's already, in many cases, it's a kind of quality score, but the quality scores provided by Illumina in the fast Q are off by a certain amount. So it looks at all the reads that are mapped onto the reference, and it adjusts the scores in the original fast Q. Say, well, this, you said, the quality was 30. That type of base, you always make mistakes there. We readjust the quality of the base in 20. And then it does very well. So it improves the input. It's too much. That's right. Yes, yes. And so this is just recalibrating, it's just tuning the quality scores that you had in the past. So the last one, the last thing that also improves very calling, and this works when you have multiple individual in particular, is this population structure and then mutation. So here's a little quiz to see if you're awake and ready. So, using haplotypes. So suppose that there's only two haplotypes in the population. So, I mean, as you probably know, we actually inherit whole blocks of our DNA, such that there's not so many differences in the genome, and typically in locally, there's high correlation between the bases. So suppose that in the population, there's only two haplotypes, ATG, and I'm masking the other bases, or CGA. And then when you're observing the reads, these are the reads that you get. Can you guess what is the value of it? T, correct, right? So this is just sort of a toy example to show that you don't need to only infer the basis by looking at number one at a time. You can use information about flanking, especially if you make assumption on the population when you have all four samples. This, again, is sort of more advanced and more tricky, but just so, that is another way you can improve variant calling is if you feed in information about other samples and you assume that you have this type of structure of correlation between the bases. But, again, I mean, we're getting into the more advanced type of structure. You don't really need this, especially if you have 30X or 100X, but if you only have 5X or 2X, and that's what they've shown, then even with 5X, with lots of individual, you can still call variants using this type of intercom. So, again, if, so this is just showing the performance if you only have one sample versus having the low coverage or using this information about multiple samples in this intercom. But this, again, is typically you don't need that if you have sufficient coverage. So, yesterday you were, most of the tools that we've been using are tools from this GATK framework, the genome analysis toolkit. Yesterday, starting from the raw reads, what you did in module two is the mapping, the local realignment, the duplicate marking. We skipped this step of recalibration, but you can try out if you're interested. And then what we're gonna be doing in the module later is this step of variant calling. And I'm gonna, we're gonna do it with and without this local reminder just to see how different it looks and what we get. This stuff is this multiple samples, integrative analysis that we will cover in much detail. In a real dataset, because yesterday's dataset was a small version of a dataset, in a real whole genome dataset, you start in the files, Dan files that are sometimes 200 gates, that's roughly how much the size of the files that you would get. And doing the variant calling with GATK or there's alternative tools, sand tools or free base, would take actually multiple hours. And we go from very large file to man files to these much smaller variant calls. So that's what we're gonna be doing but on a smaller dataset in the practice. But before we do the practical, I'll also talk a bit about the annotation. We'll go into as much detail here. So once we've called the variants, how do we do variant filtering and annotation? So the file that we're gonna get after the variant calling is a VCF format file. So a header file, with all sorts of parameters here and information about how the file was generated. But the key thing is, it's gonna be one position per line, one position per line with information about what is the reference of that position. What was observed as an alternative? What was the alternative genotype? The quality score, and now this is a quality score, not of the base, but the quality score of the genotype. So this takes into account, how many reads we're supporting a difference here and so on. And that's converted into an actual quality score for your variant. And then there's lots of additional information about the little frequency, the number of reads that we're supporting that base and so on in the other columns. If you want to look, there's a link both here and in the wiki for more information about that. Yeah. And I thought you had a question? Yeah. So it's comparable that every, so it depends on the mapping and it depends on the variant caller that you use. They have their own formula that they produce a quality based on that. And depending on which variant caller you use, you're going to get different range and different interpretation of this quality. So you have to look at what is the definition, but I mean, the higher the better. You get a look at, your distribution of qualities for it to see how many are united and how many are united. It's going to depend on the caller. There are a lot of them. That's right. I mean, there can't be a program that's right. And so again, if you so, and there's going to be information in these extra fields that support, well, that actually explains in more detail what evidence was observed to support it. Yeah. It is possible to have multiple observations. I mean, there are trickier ones too, where it's not just a single variant, but it's an in-del, right? Where you go from three days to one to an expansion. But again, so I recommend you look at the actual distribution of those. So I'll push forward because we want to get to the exercise. Even though this section is going to take maybe a little bit longer and I think the one structural variant is going to be a bit shorter. So once you get the raw variant calls, they might have a lot of false positives. So how do we filter? So there's two strategies. Again, sort of historically, it started with manual filtering based on different parameters. So based on quality score, based on depth of coverage, you might say, I want only variants that are very good score or very good coverage or so on. So this works well, of course it's a good starting point. There's also sort of more advanced ways now you can filter your data. So you can learn the filters directly using the data itself. So GATK has what's called a variant recalibrator. So this is a way of sort of automatically deciding what would be the right filters. And it sounds like magic. The way it works is that you just give it lots of actual variants. So you give it the variants that are in DBSNIP or you give it the variants that are in HATNAB and you say, based on these that I know are in my data or that are mostly true positive, what are the right parameters of my variant of from this VCF file? So there's really two options. This is fine too. You can filter based on quality and depth of coverage or you can use some of the more advanced machine learning where you give it actual data of variants and then it's going to tune its parameters automatically from them. So HATNAB, this was done very carefully. It's a proxy for false negatives because these are high quality SNIP. So you can give this set as a training set and use that to sort of optimize the parameters in your file. DBSNIP actually contains lots of real things but also lots of false positives because people have submitted a lot of things in DBSNIP and a lot of common mistakes that you might make sometimes make it to DBSNIP. So both of these data sets are interesting for different things, but if you annotate your variants with this and that's what I'm going to get to now, it actually provides quite a bit of information helps you identify your true variants. So again, I won't go into this too much but this is this variant recalibration that I was talking about, which again is within the GTK framework. There's a clear pattern of actual good SNPs, of the HATNAB SNPs have patterns that are very different than some of the false positives. So you can use that to calibrate your variant call and then give even better score and better ranking of your variants. But again, this is maybe a bit technical. So if you're interested, you can look into that further. Okay, so now I'm moving on to the last section of the intro, I guess, to all of this, which is the annotation. So this is a project that we were part of, which I don't know why that's the squares missing, but we sequenced 100 kidney tumor whole genome. In the end, we actually found 575,000 variants, somatic variants in those. So where do you start and how do you look? So, well, this is weird. This slide is behaving weirdly. I mean, this shouldn't have changed. I mean, the point I wanted to make here is just that out of all of these variants, only a subset of those variants were actually coding variants. And so adding the annotation of the variant is quite important and quite key. So, yeah, I'm sorry. There are hidden gene sequences as opposed to just being non-coded in a minute and a half. So having, even once you have variants, and even once you have good scores of whether they're good or bad variants, actually having these annotation, annotation like this, this is a variant that's hidden. A gene, of course, is gonna be very important to interpret the file. So doing annotation of variants, and there's lots of tools, and one that we're gonna be using in the practical is called SNPF, and it does that. So, you know, it's gonna annotate your variants to say whether it's a coding or a non-coding variant. If it's coding variants, it's gonna annotate whether it's a synonymous change, a non-synonymous change, a stock gain, and so on. It's gonna give you some basic prioritization. So, of course, if it's a stock gain in a gene, it's got a high impact, potential high impact, otherwise moderate, low, and so on. So, having this type of information of what the variant might be doing is obviously very, very relevant. So, what we're gonna be doing in the practical is to go from the BAN files to the raw VCF, using GTK, and then we're gonna use this, in our case, using SNPF to further, well, we're gonna use GTK to filter a bit, and SNPF to annotate the variants of it. So, time to actually do it.