 So, my name is Guillaume Wouk. I actually work at the Miguel Guillaume Center, which is actually in the city. It's my first participation in the workshop, but I've used the slides in the past, and I was telling these guys, I feel like it was my duty to actually contribute at some point, and I'm very happy to be here. You'll see this morning, so I'm doing two modules with you guys, small bearing calling and large bearing calling, but we're gonna touch a lot of the stuff that's been seen yesterday as well, and I'll put that in context. So hopefully, this morning's session will really sort of, now that you've gone through all of the steps yesterday, and slept on it, hopefully a lot of that is gonna come together in my work with new recommendation. I also didn't want you to cheat with the slides, so I gave them just this morning, so now it's all gonna be, you'll have to follow with me the presentation. So, typical slides, feel free to reuse and share these slides. So, what I'll be talking about in this first module, small bearing calling and annotation, better on the slides than on the screen, but it's really this idea that once we've mapped all the reads onto the reference, one of the key things that we're interested in is actually identifying these places that are varying, the reason why we're sequencing different genomes because we're interested in really being able to pick up the sites that are variable that are different between the different individual. This is relevant in different disease. I don't have a lot of slides on the motivation for why this is important, but it's really one of the main goal of these re-sequencing experiments is to identify sites that are variable like this. So, the objectives of what we'll cover is quite a lot of things. So, again, if it's not clear what is bearing calling, hopefully it will be clear after this module, understanding the basic principles, how we'll be bearing calling. So, you know what's important. So, when you're doing bearing calling, a lot of the things that you were doing yesterday, maybe you don't know, but they have a big impact on bearing calling. And that's exactly what I'll show you in this module. I'll show you how to filter and annotate the bearings and all of that is relative to small bearings, single, two, five, and short, and how to shorten. So, basically, and within the practical, starting from the band file similar to what you've generated yesterday, be able to call bearings and annotate these bearings, learn about the DCF format. So, just like yesterday, you saw the TASQ format, you saw the DAF format. So, we're gonna go over the bearing call format from the DCF format a little bit, visualize a SNF. So, this is a lot together. You've done a lot of the work yesterday and now you're gonna be putting that together to hopefully make sense out of it. So, I like this slide. So, this is from the senior biocompetition at the Gino Center. I like it because it's just an overview of some of the steps when we do bearing problem. So, to begin with, so you start a little bit like you did yesterday again, you start by cleaning up the leads, you trim the leads, in some cases with adapters. In module two, we put this alignment, aligning the leads. This is really at the center of bearing calling. And once you align the leads, then you really use the calling of the variance, annotate the variance and you can call the structural variance, which is what we're doing today. And parallel to that, of course, you're collecting a lot of statistics. But, so without looking in the slides and without cheating, there's actually something that all of you guys forgot to do yesterday. So, what's the first thing you have to do? Or this pipeline or for any pipeline like that actually? What's the first thing you have to do? I'm sorry? Well, okay, fair enough, after that. So, what's the first analysis step that you have to do with any of these pipelines? Yes. That's right. So, we have to look at the data, right? And that's one thing, so I work in a genome center. If you get the sequences from us, they're perfect. Not true, I mean, it's sort of true. But it's really key to look at your data at every step along the way, but especially at the first step. So, when you get the SQ files, you have to look at them. Especially if you're doing, depending on what you're doing, if you have multiple samples, if they were generated through time, if you're getting data from the internet, you have to look at that data before you do any analysis. Because otherwise, you don't really know what you're gonna get. If one of the samples is really horrible, quality, for instance, you might get all sorts of weird things down the line. So, I mean, I don't know if your background is to work in the lab, but it's the same thing. You have to, along at every single step, you have to look at what you did and what the data looks like. The first step, being the most important, is this quality control. And look at it. In the lab, actually, we'll go back and look at the quality of the files that you used yesterday as well. So, the main analysis step is really, and this is for variant calling, instructional variant calling, that we're gonna be doing today, but again, this is really sort of general. Quality control, pre-processing, based on that quality control, what do you do in the release? So, you've covered extensively yesterday the mapping of the release. And now, and then you sort of diverted into two other modules that were also very important, but not directly relevant to the actual pipeline because after the map, it's very quality. This is what we're gonna be covering now, the quality of the annotation, and then after the break, we'll go to structural variant calling. But these initial steps, I just wanted to re-emphasize that they're really quite, quite important. So, I won't spend much time on this, but I wanted to just mention this again, I'll come here. So, the importance of quality control, before you start in that analysis, it's important that you look at your raw data, because otherwise, everything you do downstream maybe won't make sense, where they all sequence in the same URL instrument. This is especially true if you're getting data from different places. Are there any technical issues affecting some of the sample? So, it's really important to get a sense of what your data looks like. So, one tool that's especially used and useful with next generation sequencing data, it's FastQC, and there's different variant of that tool. I ran that tool on the data set that you were using yesterday. So, it's, and this is part of the output, but this is reassuring, so the data set that you were using yesterday is pretty good. So, if you run this tool, and this tool you can download from the web, or you can run it up, I mean, it's easier to download and run, it's also included in Galaxy, maybe that's part of what you're gonna be doing this afternoon with the Galaxy. So, you can run this tool, you get basic statistics, it's hard to see, how many reads you have, what was the length of the read, but then you get all the other metrics, like the quality of your reads. So, this profile is sort of a typical profile with next generation sequencing data, where the quality of the read is, so here on the y-axis, you have quality scores, I can't read, so I guess this is 30 or 20, can't read, it's even smaller here. So, I think this is 30, you have it on the side? Yeah, 30, I guess, right, 30 is agreed, so 30 is 99.9% accurate, and this is a cumulative distribution of all of your reads, so this is first base and going down to the last base of all of your reads, and you see that by the time you get to the end of the read, the quality distribution, you have some reads that have much lower quality, so when you do this trimming step is something that actually removes some of these reads. But overall, this is pretty good, and a lot of the liners and of the very calling can take that into account, so this was the first read of your dataset. The second read is also not bad, you see that one of the summary report tells you that this is actually, because with all the reads, they know where, what was the position of the read on the slide, on the sequencer, and what this shows is that there was a number of media on the slide that where the quality scores are not as good. Again, this is not an extreme case of a bad dataset, this just shows that the second read didn't have quite the same quality, but still pretty, pretty bad, but these are cases where it looks more or less good, but again, I encourage you to run it to a like that on your dataset before doing any analysis downstream, because if you get a lot of skewed values or something like that, you wanna know that before you get started. Another thing, and again, some of the tools downstream will catch that and will not be affected by the fact that there might be some adapter sequences in your reads, but that again, is a good thing to watch for and to know that it's in your data if it's, so if you get samples back, what fraction of your reads actually have adapter sequences, so all of that information also comes out at some level from some of these QC reports. This is just an example where the fragment itself, the read, well, this is sort of an old, selects the slash unit information, but the read itself might actually read through the adapter, and so your reads would actually contain the adapter sequences. One way to catch that is if you're, and that's coming out of these fast QC reports is that if you have, it just looks for overrepresented sequences in your reads, and if you see that you have a lot of overrepresented sequences, you might have either duplicates or adapter sequences. So this tool in particular was mentioned yesterday by Nathan DeNovo for assembly, so there's a number of tools that are out there that allow you to trim, so after you've done your QC, if you see that there's lots of adapters or if you see that there's lots of reads of that quality, depending on your application, you might want to clean up your file. Cutting adapter sequences from, so you just feed, we know what the adapter sequence is for enuminal sequencing, so you just remove reads that have that adapter sequence, you can cut the bases off at the end of the read, you can drop the reads that don't, you have sufficient quality and so on. So this is just, I guess a heads up that this step is really quite important, especially looking at your data, making sure that your data, your different samples are comfortable. You don't need to be perfect. Yeah, CTK, CTK, yes, how do we know what you're doing? So the way, I mean, so personally, so I mean this step is pretty simple, so we selected the one based on usability, so you look at the options and what you want to do, there's no anyways, it's sort of, there's no fixed criteria of what you should be removing. We ended up picking this one based on usability because just the options of the tool and what it allows us to do and how we could run it, it was what we needed. This step is not, there's nothing too fancy here where you're just really like removing reads that have this property, right? So it's pretty, pretty straightforward. Yes, definitely. So I mean, but again, it's really, yeah. So, but it's, that's exactly right. In my mind, you know, every time you do one of these analysis steps, you look again at your reads or at your Mac reads, how many are there, how many do you remove, right? Because sometimes you misunderstood the parameters and you ended up cutting 90% of your reads, right? You're something like that. So you want to know what you've done after every step. So, well, so any question on that part? Again, I didn't want to go into too much detail. Well, we're kind of doing it with the normal, sometimes with duplicate, you know, so remove duplicate was one of the steps that we did yesterday. So these are PCR duplicates where it's exactly the same started in. So you can do it at the left after mapping, which is actually a good thing, but you could also do things here. So, I mean, it seems it's better to remove the duplicate after mapping, but you would already see it in the fast queue that you have lots and lots of sequences that are right in here. No, no, no, no, so those would be more convenient. And again, so it's initially, it was very important to do these filtering steps because the mapping and the very quality were very much affected by that. Now, to be honest, like the trimming step is not necessary anymore because a lot of the variant quality taken into account the quality score and won't weight these bases. So you don't really need to do that, but it's still good to get a sense of what your data looks like if anything. It's also good perhaps to, you know, and it's a little bit like what we're gonna be doing. You know, you can try with and without those steps, see if you get different results and see if, you know, if you don't trim, how many variants do you get? And if you trim, how many variants do you get? If you get 10 times more without trimming, you know, you should be able to do this. Yes. I have a good question. I mean, what's it worth you have? So the question is, is cleaning up important for a small variable quality or for a large variable quality, structural variable? By the time we'll get to structural variants, there's a lot of false positive there to begin with. And those false positive, for the most part, are not because of the quality of the read. So I don't think it would affect the structural variant calling as much as it affects the small variant calling. Small variant calling is actually pretty robust pipeline. And if you have bad data, it might lead to not so good results, but it should be giving good results. So there, I think cleaning up your data has a bigger impact than the structural variant calling. I mean, structural variant calling, what could it, well, we'll get to that in a bit, but would be chimeric weeds and things like that. But that's not what you would easily be able to clean up in this trimester. Yeah. But GAMR always remains in that region after getting it to become orange, but not green. So a good question. So, I mean, we've had lots of these questions. If you look at FASTQC and you don't know whether it's good or not, right? So, the first one you get is that you'll get some orange and you'll get some red and is that a problem or not? The chambers in particular are hard to interpret from my perspective. I say, I think the best approach to that is do it in two different ways, clean the data, and then go forward with your analysis pipeline or don't for something like that, and see if it changes the results. I think that's the best way of doing it. I mean, for most part, it shouldn't be affecting things too much. And it's normal to have some of these orange warnings and things like that. I think that's all fun. But it's more whether your samples all look the same. That's really important. It's more whether you have 50% duplicates or something like that, right? We all necessarily do have all the green, no, no. So we almost, I mean, again, like here, this is actually absolutely, even this is fine, right? To compare the read one distribution of scores with read two, there's a bit more lower scores, that are coming from this particular region of the slide. You could remove them, but I don't think it would affect very much. I mean, you can try it with and without, and then depending on whether you really care about any false positive or not, yeah, there's a different circle that provides some information and summary about the RAC, so you can get some information about the data. So I know that we run FastQC, and then when we provide, as a sequencing center, we provide that, but so the same type of output. But again, it's really easy for you to do it as well on your own. So if it's not provided, I encourage you to do it because, you know, and then you can ask questions. How come, you know, again, you do expect this type of profile where the quality decrease, but you know, if you have, so here, I'm not showing it, but there's statistics on the length of the read. If you end up like having shorter read and longer read, you know, there might be something that was mixed up and things like that. So it's really easy and definitely worth your time to look at the data that you get before you start the meeting. Okay, moving on to the actual module. So SNP and Varen calling. Again, the goal of this is, yesterday you mapped, you explored IGV quite a bit, but the goal of this particular module is, how do we identify sites like this where clearly, so this is a tumor sample and a normal sample. So you see that from the same individual and you see that there's a variant, all the reads sort of point to a difference in the normal sample here. Looks easy if you look at it this way and practice, it's a little bit trickier. So, but the main idea is what you can guess from the previous slide itself. Actual variants, you expect many reads to point out that there's a difference. So we look for sites where you have many reads that support a variant at that position and that's gonna distinguish from sequencing errors that are sort of, you know, 99.9% accurate sounds like super accurate, but that means 0.1% errors and there's so many reads and there's so many positions that there's actually lots and lots of errors. There's millions and millions of sequencing errors in your data set, but those errors should be all spread out throughout the reads. So we look for positions where multiple reads point to a difference, yes. So, no, because we're looking for these errors here, right? So we could do error correction in the way we use the reference or sort of, I mean, this would be, so the error, this is a site that actually has both apple types, right? It has the G and it has the A. But you wouldn't want to try to combine the version. Every read in this step, every read is mapped individually for the reference key note and afterwards, we actually look for the difference in the amount of sort of collapsing. Because in the no-go SMB problem, the challenge is to, you know, we know, because you don't have the reference, you want to know what reads go with what reads and start putting them together. Well, but there's, again, I mean, this is basically what we're going to be doing with the bearing calling itself. These will just simply be ignored. So in a way, that's sort of an error correction. This is saying, you know, this is a key. So we'll get, for every position, we'll know, you know, we'll have an actual call that says, this is very, very unlikely to be a G in this condition. But that's exactly what the bearing calling is going to do. It's going to give us, for every position, the probability of being different, very different. So this, I'm not going to go into this in much detail. I'm putting that up just to scare you a little bit for you non-mathematician. So this is, but this is still the way bearing calling is implemented. Yeah, that's a good question. I think it might be quality scores, right? So based on quality, you actually have a capital A or a small A in this machine. So we do, but so for bearing calling, I didn't go into that, but we do need to cover every position, typically 30X is the target. And the reason for that is because if we only cover every position one time or two times, we won't be able to tell that this is a sequencing error or an actual difference in the team. So typically whole genome sequencing, we have a target of 30X, with exome sequencing, it's even higher, it's 100X. And the reason for that is to get exactly that sufficient reads covering every position, especially that we can distinguish errors from. And this is the main, I guess this slide is pretty important because this is really the main idea of how bearing calling is implemented. It's gonna take into account the number of reads that observe a difference. It's gonna take into account the quality score of every position. So that's what I have on this slide, which again, we don't need to go into too much detail, but basically this is gonna compute the probability of any particular genotype given the data. So given all of the reads at that position, what is the probability of every genotype? And this, one thing that makes this more complicated is that if you're in the human genome, there's actually, it's a diploid genome. So there's two, two half a type. So you have to, just like we saw in the previous slide here, it's not that it's a gene or a name. At that particular position, there's probably both a gene and a name. What makes a YR are tricky, for sure, so if you look. And a lot of times they have to be run separately over a certain number of reasons. So a lot of times, bearing calling on X and YR, but again, so I don't wanna spend too much time on this if you're interested in the detail. To me, the main point is how is this done? So this is integrating all of the data at a particular position to provide a score, probability assignment for every particular genotype at a given position. And this incorporates information on the number of reads that are supporting a different genotype and also the quality score that are here because again, this takes into account the fact that you are making errors and so on. So it's gonna do a little bit like what we were saying. It's gonna make a difference between sequencing errors and actual variance. So, and this is the place where I said trimming used to make a big difference. It does it as much because now that these, these particular software, this is the one that we're gonna be using which is the GATK pipeline and framework. They take into account the quality scores on the base. So even if you don't do the trimming step, those particular bad reads or bad base will not be counted much into the confidence score. So that's taken into account. So what we'll be covering more though is, you know, yesterday we talked quite a bit about local realignment. We talked about the duplicate, marquee, base quality recalibration, which we didn't cover, but I wanted to talk about a bit and then population structure and infutation. So even if you have this method and this model that I talked about before, all of these steps when you do the alignment and post alignment are gonna make a big difference in variant calling. So I wanted to go over these different steps. So starting with local realignment, which you saw yesterday. So this again, it was mentioned that around insertion deletions typically reads or frequently reads are not aligned very well and this leads to these types of patterns. So if we were doing variant calling on this type of data, this is alignment without the realignment step, we would probably call these as variants because there's lots of reads that are supporting a different position. And this is why it's important to do the realignment step, which is given that it looks like there's an indel here, the alignment itself was done individually in every read. The realignment step now takes into account all of the read at that position and does a much better job at alignment based on all of the information around that region. So this particular step that you did will go over that in the lab. What happens if you don't do it and what happens if you do do it? So this is one thing that improves variant calling. Duplicate marking is another thing that you did yesterday. So here again, based on what I was saying, the variant calling takes into account the number of observations, number of reads that see a difference at a given position. The problem with duplicates is basically this was one read that had one sequencing error or one error that was then amplified, such that you have many reads, but they're all the same reads and they all have the same error. So if we do variant calling using this data, we're just gonna call this a variant, but there's only one piece of evidence for that variant and that's again wrong. So the variant calling, I'm sorry, the duplicate marking is quite important because it really collapses that too. We have a suggestion of a difference here, but it's only one read that's a different one. So that's gonna be down the way and we're not gonna call that a variant. So duplicate marking is another thing that's quite critical, especially for variant calling from DNA data. In RNA-seq, it's a whole other business and you're gonna hear about that in the next few days. These quality recalibration is another one that we didn't cover yesterday, but that I wanted to point out. So what's that? So it turns out that the sequencers give you information about the quality score, but it was observed that the sequencers make systematic mistakes. So they always overshoot the quality of their base. So the quality score that's provided with your FASTQ is off by a little bit. If it was off by a little bit, but it was sort of random, it wouldn't matter. But it turns out that it's off and there's sort of systematic mistakes that the sequencer is making. So it's making systematic mistake relative to the position. So that's not the end of the world on the read, but it's making systematic mistake based on the dinucleotide. So if it's a G that was just after a C, the quality score tends to be off more often than not. So this leads to small mistakes, but because there are systematic mistakes, then if you have lots and lots of reasons making these systematic mistakes, it tends to, again, lead to some false positive. So, well, so this is also implemented within the GATK framework where after you've mapped all the reads, you can recalibrate the qualities. It's basically done by looking at all the reads and all the positions that are correct or incorrect and just changing the quality score. And it's just, it's one more step that they've shown and people have shown, I guess, improves the quality of every column. So another thing to keep in mind that you can add this base quality recalibration. And the last one that I wanted to mention that this is not always appropriate, and like yesterday, we were just doing one sample, so if you're only doing one sample, it doesn't work. But yeah. So it's already, there are many cases that come with a quality score, but the quality scores provided by the umina in the past few are off by a certain amount. So it looks at all the reads that are mapped onto the reference and it adjusts the scores in the original past few. Say, well, this you said the quality was 30. That type of base, you always make these things there. We readjust the quality of the reads in the 20. And then it does very well. So it improves the input. That's right. Yes, yes. And so this is just recalibrating, it's just tuning the quality scores that you had in the past. So the last one, that the last thing that also improves very calling, and this works when you have multiple individual in particular, is this population structure and imputation. So here's a little quiz to see if you're awake and ready. So using haplotypes. So suppose that there's only two haplotypes in the population. So I mean, as you probably know, we actually inherit whole blocks of our DNA, such that there's not so many differences in the genome and typically in locally, there's high correlation between bases. So suppose that in a population, there's only two haplotypes, ATG and I'm masking the other bases, or CGA. And then when you're observing the reads, these are the reads that you get. Can you guess what is the value of it? T, correct, right? So this is just sort of a toy example to show that you don't need to only infer the bases by looking at them one at a time. You can use information about flanking, especially if you make assumption on the population when you have those scores. This again is sort of more advanced and more tricky, but just so that is another way you can improve very calling is if you feed in information about other samples, and you assume that you have this type of structure of correlation between the bases. But again, we're getting into the more advanced type of stuff. You don't really need this, especially if you have 30X or 100X, but if you only have 5X or 2X, and that's what they've shown, then even with 5X with lots of individual, you can still call variants using this technique. So again, so this is just showing the performance if you only have one sample versus having the low coverage or using this information about multiple samples and then still call. But this again is, typically you don't need that if you have sufficient coverage. So yesterday you were, most of the tools that we've been using are tools from this GATK framework, the genome analysis toolkit. Yesterday, starting from the raw reads, what you did in module two is the mapping, the local realignment, the duplicate marking. We skipped this step of recalibration, but you can try out if you're interested. And then what we're gonna be doing in the module later is this step of variant calling. And we're gonna do it with and without this local realignment just to see how different it looks and what we get. This stuff is this multiple samples integrative analysis that we will cover in much detail. In a real data set, because yesterday's data set was a small version of a data set, in a real whole genome data set, you start with files, Dan files that are sometimes 200 gates, that's roughly how much the size of the files that you would get. And doing the variant calling with GATK or there's alternative tools, sand tools or free base would take actually multiple hours. And we go from very large file to man files to these much smaller variant calls. So that's what we're gonna be doing but on the smaller gates in the practice. But before we do the practical, I'll also talk a bit about the annotation. And we'll go into as much detail here. So once we've called the variants, how do we do variant filtering and annotation? So the file that we're gonna get after the variant calling is a VCF format file. So a header file with all sorts of parameters here and information about how the file was generated. But the key thing is, it's gonna be one variant, one position per line. I mean, one position per line with information about what is the reference of that position, what was observed as an alternative, what was the alternative genotype. The quality score, and now this is a quality score, not of the base, but the quality score of the genotype. So this takes into account, how many we were supporting a difference here and so on. And that's converting into a national quality score for your variant. And then there's lots of additional information about the allele frequency, the number of reads that we're supporting that base and so on in the other columns. If you want to look, there's a link both here and in the wiki for more information about that one. Yeah. I thought you had a question? Yeah. So it's comparable that every, so it depends on the mapping and it depends on the variant caller that you've used. They have their own formula that they produce a quality based on that. Depending on which variant caller you've used, you're going to get different range and different interpretation of this quality. So you have to look at what is the definition, but I mean, the higher the better. You get a look at your distribution of qualities for it to see how many you're admitted, how many it's going to depend on the caller. That's right. I mean, program and other examples. That's right. And so again, if you, so and there's going to be information in these extra fields that support, well in this, you know, that actually explains in more detail what evidence was observed to support the data. Yeah. It is possible to have multiple observer. I mean, there are trickier ones too where it's not just a single variant, but it's an endel, right? So you go from a three base to one to an expansion. But again, so I recommend you look at the actual description of those. So I'll push forward because we want to get to the exercise. Even though this section is going to take maybe a little bit longer and I think the one structural variance is going to be a bit shorter. So once you get the raw variant calls, they might have a lot of false positives. So how do we filter? So there's two strategies. Again, sort of historically it started with manual filtering based on different parameters. So based on quality score, based on depth of coverage, you might say I want, you know, this I want only variants that are very good score or very good coverage or so on. So this works, works well, of course, it's a good starting point. There's also sort of more advanced ways. Now you can filter your data. So you can learn the filters directly using the data itself. So GATK has what's called a variant recalibrator. So this is a way of sort of automatically deciding what would be the right filters. And it sounds like magic. The way it works is that you just give it lots of actual variants. So you give it the variants that are in VPCNIP or you give it the variants that are in HAPNAP and you say, you know, based on these that I know are in my data or that are no or mostly too positive, what are the right parameters of my variant of from this VCF file? So there's really two options. I mean, this is fine too. You can filter based on quality and depth of coverage or you can use some of the more advanced machine learning where you give it actual data of variants and then it's gonna tune its parameters automatically from that. So HAPNAP, you know, this is a, so this was done very carefully. It's a proxy for false negatives because this, you know, these are high quality SNPs. So you can give this set as a training set and use that to sort of optimize the parameters in your file. DBSNIP actually contains lots of real things but also lots of false positives because people have submitted a lot of things in DBSNIP and a lot of common mistakes that you might make sometimes make it to DBSNIP. So both of these data sets are interesting for different things. But if you annotate your variant with this and that's what I'm gonna get to now, it actually provides quite a bit of information helps you identify your true variants. So again, I won't go into this too much but this is this variant recalibration that I was talking about, which again is within the GATK framework. There's a clear pattern of actual good SNPs, of the half-max SNPs have patterns that are very different than some of the false positives. So you can use that to calibrate your variant call and then give even better score and better ranking of your variants. But again, this is maybe a bit technical. So if you're interested, you can look into that further. Okay, so now I'm moving on to the last section of the intro, I guess, to all of this, which is the annotation. So this is a project that we were part of, which I don't know why that's what's missing, but we sequenced 100 kidney tumors, whole genome. In the end, we actually found 575,000 variants, somatic variants in those. Where do you start and how do you look? So, well, this is weird. This slide is behaving weirdly. I mean, this shouldn't have changed. I mean, the point I wanted to make here is just that out of all of these variants, only a subset of those variants were actually coding variants. And so adding the annotation of the variant is quite important and quite key. So, I'm sorry. They're hitting gene sequence as opposed to just being non-coding. So having, even once you have variants, and even once you have good scores of whether they're good or bad variants, actually having these annotation, annotation like this, this is a variant that's hitting a gene. Of course, it's gonna be very important to interpret the file. So doing annotation of variants, and there's lots of tools, and one that we're gonna be using in the practical is called SNPF, and it does that. So, it's gonna annotate your variants to say whether it's a coding or non-coding variant. If it's coding variants, it's gonna annotate whether it's a synonymous change, a non-synonymous change, a stop-gain, and so on. It's gonna give you some basic prioritization. So, of course, if it's a stop-gain and a gene, it's got a high impact, potential high impact, otherwise, moderate, low, and so on. So, having this type of information of what the variant might be doing is obviously very, very relevant. So, what we're gonna be doing in the practical is to go from the band files to the raw VCF, using GTK, and then we're gonna use this, in our case, using SNPF to further, well, we're gonna use GTK to filter a bit, and SNPF to annotate the variants of it. So, time to actually do it.