 Okay, so in this presentation, what we will do is really go a little bit deeper into what parent calling actually is from next generation sequencing data and what kind of calculations are actually done underdote because it's a relatively important part of parent analysis. So what we have done up till now in this workflow is, so yesterday and also the first part of today, we have aligned our reads to the reference genome, added the read proofs, it's not in this overview and marked the duplicates. So after the duplicate marking, typically there's a base called the score recalibration, of course, more about that in the presentation and then actually the core, I would say, of the variant calling would be, of course, the variant calling itself and how that works, you will learn about it in this presentation. My apologies. So it takes your alignment and it tries to turn that into a VCF. So it tries to say, okay, at this position, we really have a homozygous variant. If you look at it, it makes sense that at this position, it has a homozygous variant, same pulse for this insertions and solutions. So in order to call variants, we have to answer three important questions. One very obvious important question is at a certain location, do we see variation in there? Yes or no? There is variation from the reference according to the alignment. Second question is, okay, if we decide there is something going on at that particular location, what are the alleles? So of course, we have a certain assumption, for example, if a certain organism is deployed, we can expect two different alleles at a certain location. If your organism is deployed, you will have other assumptions. If your organism is deployed, you also have other assumptions. So what are these alleles? And do we have enough evidence that these alleles actually exist? And then a third question, okay, we have evidence for the existence of certain alleles and what is then their genotype, also depending on the assumptions we have about the organism, but if we are looking at the deployed, for example, for humans, is the position homozygous reference, so basically no deviation from the reference, is it a heterozygous or is it homozygous alternative? So different from the reference. And we do that by estimating the allele count in the sample, so meaning allele counts are the number of reads supporting a certain allele. So a hypotype caller does that in four steps and it does that all based on the level of the hypotype. So before, I think that would be like 10 or 15 years ago, typically you did that at the level of the individual variant, but these algorithms have improved in such a way that they take hypnotized information in account and therefore the variant estimates that definitely improved. So first is identified active regions. So it just looks for any deviation with the reference, then it tries to assemble a hypnotized and then also gets rid of unlikely hypnotized, so hypnotized we do not have to take into account by pruning. Then for each read, it tries to find a likelihood whether it supports a hypnotized yes or no. So basically then you get kind of counts per hypnotized that are supported by the read. And based on these count, there will be a genotypes assigned, so whether it is homozygous reference, heterozygous or homozygous alternative. So to answer the question, what are the alleles? And let's say we have this situation over here, so this is the truth, so we have two chromosomes and we have four variants over there. So at one position we have homozygous alternative at the W position. At the X position we have heterozygous, at the Y position we have homozygous reference and at that position we have also heterozygous, but these alleles, they are in repulsion between X and Z, so the alternative alleles are in repulsion. So we can generate reads out of that, in an ideal world, those reads would give a perfect representation, however what you can do from those reads is create a graph based on the existence of alleles in these reads. So of course, and we connect the alleles in these reads by their, whether they are in the same read, yes or no. So in this case, our reads are sequencing all four variants and that's a very exceptional situation, but for the sake of explanation, this is a nice example. So at the third position, we always have alternative, so we start with only an alternative alleles and then we have five reads supporting going from alternative to alternative apposition that, we have five reads supporting going from alternative to reference and five reads supporting going from alter reference to reference and the same counts for other reads. So this is kind of a perfect graph representing all the haplotypes in this, in this alignment. However, what can also happen is of course, is of course, sequencing error. For example, at position Y, we might have one alleles, that's a sequencing error occurring and then our graph is extended. But we see that this extension of the graph has very little support. So we have quite good support along this line. We have quite good support along this line, but not very well support along this line because it's only supported by a single read. And what the 8K then does is it prunes off those unlikely haplotypes that are supported by only a few or only one read and then ends up with the most plausible haplotypes. So then, what is also an important part of answering the question, what are the alleles, is to say, okay, what are the alleles of the indel? And that is not a very easy question to answer because what aligners are very, is very fast software. You saw that we had about 150,000 reads in the exercises and that performed an alignment on still a pretty large reference. So the entire chromosome 20 in just, or what was it, like 30 seconds or so. So aligners are very efficient software, but they are so fast because they have to make some choices. And one of those choices is that they do not do very precise insertion deletion alignment. So it often very much depends on where the insertion deletion ends up in the read, how it is aligned. So typically, if you have a normal alignment, these, in this case, an insertion look very, very messy. But it does in the process of calling variants, it does an indel realignment, it basically occurs at this haplotype step, where all the indels are nicely realigned and can be called as an actual variant. So now we have the possible, when we know that we have information about the active side, we have information about the possible alleles by making this graph. And then we have to estimate what the actual genotype will be. So is it heterozygous or is it homozygous? And that is also quite a relevant process, I would say. Also to understand what variant calling is about. So let's take the example that at a site we count nine beta. So over there we have a coverage of nine. So we sequence, we have sequence that site nine times. So this would be the case, right? So we have the aligned read. And what we see is out of those nine reads, we count four bases to be C. And in this case, it would be the alternative allele and five cases to be A. So what you can do is actually calculate a probability that in the case that you sample nine times a certain reference, if you sample nine times the DNA of your organism, and you have the assumption that the chance of finding a C or an A is 0.5, that would be the case of heterozygous allele. We can actually calculate the chance for that to happen. So let's say we have the assumption that this organism or this individual is heterozygous at that site. Then we have the hypothesis, OK, there is a chance that you draw. There's 50% chance that we draw an A. And there's a 50% chance that we draw a C. So B is 0.5 for drawing the alternative allele. So you can just use a binomial distribution for that. And then the probability that we draw four times AAC would be 0.25 out of the nine. Just by saying, OK, there is 0.5% chance that we draw that C. If we turn that into an hypothesis, let's say we have the hypothesis that B is 0.5, that's basically the hypothesis, OK, at this site, this organism is heterozygous, it becomes a likelihood. So the likelihood for the hypothesis that B is 0.5, so our organism is heterozygous. And we draw four C's out of nine, that likelihood is 0.25. Let's say we have a situation, for example, in a different site where we only draw nine alternative alleles, so nine C's. So typically you would say, OK, this is very likely to be a homozygous alternative because the C is an alternative. So we have the hypothesis that B equals 1, meaning that every time you draw from that organism, you will draw the alternative allele. Again, you can calculate a likelihood for that. So we can first test the first hypothesis. The first hypothesis would be that we still have a heterozygous, so in this case. And it's still possible that we draw nine times the alternative. So even under the relatively unlikely situation that we have a heterozygous individual, we still have a likelihood that isn't 0, just because it is possible, but it is unlikely, of course. You have a very low likelihood. However, if you take the other hypothesis that we have a homozygous alternative, we have a relatively high likelihood. So we can see that if we have to choose whether in this situation we have a heterozygous variant or a homozygous variant, you have much higher likelihood for the homozygous alternative variant, of course. And you can, for example, also compare those likelihoods. For example, the likelihood ratio, and you see that the likelihood is much higher compared to the heterozygous hypothesis. So this is how you can estimate genotype. So we can test all of those three hypotheses for the different situation. We can even take also the probability of 0. So we only have reference alleles in our individual. Of course, the likelihood will be 0 over here because there are no red marbles in here. So there are no reference alleles in there. So it's impossible to draw nine alternative alleles in there. This is all clear. You can use a simple binomial distribution to estimate the genotype. And of course, in this situation, it is already quite obvious that this is most likely heterozygous and this is most likely homozygous alternative. However, there are also situations where it's less likely. So in the question I have, I have an example where it's less likely. And for you, the question to you is, OK, what do you think should be the genotype? So share my browser screen. So what you can do is go to vcox.app and enter this number or scan the QR code. And then the question is, we have, again, we are sequencing a certain site nine times again. But instead of about 50-50, we have eight basis reference and one basis alternative. So basically eight, eight, and one C. What would you consider to be, at that site, to be the most likely genotype? I think most of you have answered. That will stop. So 50-50, good. OK, so most of you said, OK, it's probably homozygous reference because by far most of the bases are our reference. And some of you said it's heterosegous because having one base supporting an alternative gives us that there is an alternative. I would agree with both of you, especially if we assume this binomial distribution and this binomial distribution, of course, doesn't take error into account at all. So let's have a look at that situation. So again, this is a visual representation of the question we just had. So we sequence the site nine times and we have eight reference and one alternative. So in principle, what you would say, just by looking at it, OK, the C is probably an error. So we're going to test the hypothesis that we are working with a homozygous reference. So that would be that thing, hypothesis P equals 0. But we find one alternative. And just by taking a strict binomial distribution, that likelihood will be completely 0 because it is impossible to draw a red marble out of a jar of only green marble. So of course, so the solution that seems most likely to most of you actually by binomial distribution would give a term of 0. Well, the highest likelihood would be the case where we have a probability of 0.5 because over there it is possible. Although the likelihood is relatively small, it is the highest likelihood because also P equals 1 would be completely impossible because also in P equals 1 we only have red marbles. So only C is in there. So we cannot work with this strict binomial distribution. We have to do something else that takes error into account. And that's why base quality is such an important parameter in variant calling. So although these base qualities are very high nowadays, so typically if you sequence an aluminum library, you get base qualities over 30. It's often it's more than 90% of the base quality over 30. And that means really, really high accuracy. So more than 99.9% accuracy. But still it means that we can expect error in our data set. So for example, if you have a base quality of 20, which is also reasonably high, so 99% accuracy. So if you take a random base, it will be 99% sure it's the correct base. But still 1% of those bases are incorrect. So let's say if you have 100 samples and you look at a certain site and all of them have 40 X coverage at that particular site only, you already expect 40 errors. So these errors, they occur quite frequently, even if you have by very high base quality. So even though we have high base qualities, we have been sequencing very accurately. We have to take it into account. And we can do that with this simplified model. At least, well, this is the model that was used at first when people really start working on variant analysis in next generation sequencing data. Basically what it takes into account is a test for a certain genotype. So zero would be almost like a reference. One would be heterozygous. Two would be homozygous alternative. Take the ploidy into account. Well, which is actually two if you're looking at human. And then an important one, it takes the base error into account. And takes the base error into account of that particular base in every read. So the individual basis in every read. And then, of course, the number of bases at the site and the number of bases that equal the reference. So basically the count. And from that information, you calculate a certain likelihood that actually can take the base error or the base quality into account. Martin, that's the question. Yes, sorry, I don't understand the genotype zero, one or two. How can it be zero? OK, so genotype zero means homozygous reference. So it's a code or something like that. Yeah, it's actually very common to use this notation. So it's a relevant one to keep in mind. So zero often means homozygous reference. So we're looking at one site at a time over here, right? So at one site, at a particular active region, if you count zero alternatively, it will be homozygous reference. One would be heterozygous, and two would be homozygous. So it's basically counting the alternative alleles. Yeah, so could you adjust this if you have like a triploid organism or something like that? Yeah, so it's the same way. You're counting the alternative alleles. So let's say if you have a triploid or tetraploid, then the ploidy, of course, would be four. And zero would again be homozygous reference. One would be simplex, two would be duplex, three would be triplex, four would be quadriplex. OK, so we can apply this model, of course, to our example we just had. And let's say, so we still stick with our example. We had eight reference alleles and one alternative alleles. And let's make it a bit easier by saying, OK, we have equal base qualities for each of the base we have in C20. And we can calculate likelihoods for the different hypothesis. Homozygous reference, so with genotype zero, we can have a likelihood for heterozygous. We can have a likelihood of homozygous alternative. We can have a likelihood. So this homozygous alternative is very unlikely. That would mean that we may have made eight mistakes by estimating the reference alleles and only one to be alternative or correct. So it probably is between those two guys, homozygous reference and heterozygous. So much of you said it's homozygous reference. And indeed, by taking the error into account, so base quality of 20 or an error or 0.01, the likelihood would be 0.092. And the heterozygous hypothesis, the likelihood would be 0.002. Like many probabilities in bioinformatics, this can be recalculated as a thread-based score, likely to with mapping quality, likely to with base quality. So again, we can recalculate this likelihood to a thread-based score, which means that if you get a high number, the probability is low. So in this case, homozygous reference gets a thread-based likelihood of 20, and heterozygous gets a thread-based likelihood of 27. And in this case, we're looking for the highest likelihood. So the lowest thread-based likelihood. So if we have to select the most likely genotype, that would be the one with the highest likelihood and the lowest thread-based likelihood. So based on an error probability of 0.01, for each observation, in the case we have eight reference alleles and one alternative alleles, it's most likely that we have a homozygous reference at this site. So the lowest thread-based likelihood is the most likely genotype. So it's likely that we have just all references at that site. So this one. And we can actually get a genotype quality out of these numbers. So a genotype quality is basically how sure you are that the genotype you are calling is the actual genotype. And that is the difference between the most likely genotype and the second most likely genotype. And that's very often how you work with likelihood. You compare the most likely one to the second most likely one in order to get a confidence about how well you picked the right estimate. So and that's a very simple calculation. You just take the second lowest thread-based likelihood, which would be the heterozygous one, and you subtract it with the lowest thread-based likelihood, which would be the homozygous reference one. So we have 27 minus 20 equals 7. And that will be just the genotype quality. So that is the estimate of probability, whether you are actually having a wrong genotype. So in this case, we can calculate that back, because again, it's in the thread-based score. So the probability that we have a genotype error would be the inverse of the thread-based likelihood. It looks like this. So we have a 20% chance that we're estimating this genotype incorrectly. So with this, what I show is that we take the error into account, we estimate the most likely genotype, and we estimate the probability that we're actually estimating the wrong genotype. So ideally, the difference between homozygous reference and the heterozygous, in this case, would be as high as possible to have a high as possible genotype quality question for you. So this is just to check whether the message came across. So for variance, we find the thread-based likelihood 80, 57, and 300 for the genotype homozygous reference, heterozygous and homozygous alternative respectively. What is the most likely genotype, and what is the genotype quality? OK, I think most of you have answered right now. Don't worry too much. Of course, I will explain it later on. OK, I will stop. Good to do. Very nice. Oh, most of you were correct. Well done. So indeed, so if we remember the last slide, well, if you have thread-based likelihood, and we are looking for the highest likelihood, and the highest likelihood will result in the lowest thread-based likelihood. So if you want to find the genotype with the highest likelihood, we are looking for the lowest thread-based likelihood. So in this case, it would be 57. And that corresponds to a heterozygous genotype. So we're going to estimate that this individual is heterozygous at our site. And then the genotype quality, that is the second lowest thread-based likelihood minus the lowest thread-based likelihood. So we take the second lowest, which is homozygous reference, with 80. And the lowest one, which would be the heterozygous one, 57. So 80 minus 57 is 23. So we estimate that the heterozygous genotype and the genotype quality is 23. OK, so you can also put this in a graph, for example, where you quite nicely can see that if you are in, let's say, in fractions of having a certain allele where you are, when yourself would doubt also this model would doubt. So over here on the x-axis, we have, let's say, the number of alternative alleles, and then the genotype quality. So let's say we count four alternative alleles. Then the estimated genotype on the other axis, by the way, the estimated genotype would be one, the heterozygous, and the genotype quality would be high. If you are in regions that would make you doubt, for example, you have eight out of nine to be alternative. We have low genotype quality. And in this case, we estimate it to be a homozygous alternative genotype. So these base qualities, they are important. We very much rely on those base qualities to estimate the genotype, for example. And also to estimate these haplotypes. So they are important. However, the software provided by Illumina looks at each space individually. And actually, the base quality estimate can be very much dependent on the context. So typically, if you find a homo-polymer, meaning, for example, four As in a row, the base quality is often overestimated, meaning that you have to correct this base quality if you have a set of homo-polymers, for example, in your region. So this estimated error rate by the Illumina software is not the real error rate. So therefore, it helps if we do some base quality score recalibration on our existing base quality scores. So what base quality score recalibration does is, let's say, you have a set of bases that are sequent. And a certain base quality score is reported. Then the actual empirical quality score, meaning that if you're sequencing a known sequence, you actually can calculate the empirical base quality, is usually actually higher than that. So the base quality score, in this case, is in many cases underestimated. What base quality score recalibration does is it creates a model based on context of those bases and tries to correct those. And then the empirical base quality and the report base quality so that the corrected base quality should correspond to each other. And this model that is created depends very much, for example, on the lane, on the sample, on the library. And therefore, you have to specify these individual lanes at the individual lane level in the read boobs. This is all clear regarding genotype estimation and base quality. If so, that's great. So after we run a prototype caller, we get an output that is a very important output, of course, for variant analysis. And that is the VCF. VCF is, like many file formats in bioinformatics, it has a header. And after the header, it is just a felt-limited file. As you remember from, for example, a bump file, the header started all with add. But for VCF, it starts with a hash, or actually two hashes. And then the header line for the tap-limited part starts with one hash, as you can see over here with starting with Chrome. And then we have a tap-delimited file, in this case, depicting three variant sites with variant information. So it contains the position and the chromosome, some information about the reference, and alternatively, more about it later, and the genotype information for the individual samples. If you have more than one sample in your VCFs, in this case, I have included fodder and fodder in there. Janko, I have a question. Yes. Since we are pretty much manipulating VCF files a lot, I was wondering from your experience, what would you recommend as a VCF validator tool? It's a little bit out of scope from what we're looking at now, but it just popped up in my mind, so I sort of ask. Because we have some, but I'm not quite sure whether they are the best, or what are your recommendations on that? No, I'm not completely... So what you mean by validation is whether all the formats correspond to each other, so you do not have... VCF format, whether a VCF you get adheres to the VCF specifications? No, I wouldn't know. No, I also didn't do that very often, so I can't really help you with that. Sorry, I don't have to Google. No, we have some, you know. I was wondering from the actual bench work, if somebody has a hint, I would be grateful. If somebody knows, you can also write it on Slack, so it's easy to, by the link or so. So it's easy to find that thing. Okay, so this VCF format is a very specific format, and it's not really made for humans to read, because you have to go a lot back and forth in the cell. Of course, for computers, it is a very efficient way of storing variant information, because it is a super flexible file format. So in the header, there is information about what is and what can be depicted in the tab to limit the part of the VCF. So let me first explain first, are there six columns in the VCF? So as I said before, we have a position, so chromosome and base position of the variant that is specified in certain line. We can specify an identifier, or that specific variant could be a dbSNP identifier, for example. Then there is the reference allele specified, so the allele you will find in your reference genome. And the alternative allele is specified. And of course, it could also be multiple alternative allele. In this case, it's only one alternative, so we have a biallelic variant, biallelic SNP in this case. Then there's a quality score for the entire variant, so how sure the variant color was that at this position was a variant or was not a variant. Then we have the filter column. Over here, the filter column is empty. We see only dots, but we can specify whether we filter out the variant yet or no. So in this case, it can be anything, but in this case, we can have two ways of filtering, either by low quality, so low variant quality, or if we pass it, we specify pass over here. In this case, we don't have specified any, applied any filter, so there's nothing here, but if you do apply filters, it's specified over there. And that's actually something that's very frequently used if you work with VCF cells. Then the info column gives us general information about the variant as a whole. So we do not look at the individual genotypes at the sample, but about the variant as a whole in the entire cohort. And that's a semi-colon separated set of information. And what those abbreviations in the semi-colon separated column actually mean is specified in the header. So for example, we have AC over here, and if you want to know what AC is, we can look it up and we see a little count in genotypes for each alternative allele in the same order list. So apparently, we have an allele count of one for the alternative allele. Then we have AF, meaning allele frequency, AN, meaning we also have to look it up, total number of alleles in the cold genotype. So how many alleles did we call? It's at six over here, and that is because Stan was also in here, but just for simplicity, I call it up. So we have actually six in here. And DP means the approximate read depth, so the total depth of the read. So this is general information about variants. Then a format field contains specific information about the individual genotype. So it very often starts with DT, and what does DT mean? You can look it up over here in the header again. DT means genotype. So that is actually the information whether an individual is homozygous reference, heterozygous, or homozygous alternative. And it is specified in a rather special way. It specifies information about the first allele and the second allele. And then A1 would be the alternative allele, and the 0 would be the reference allele. So 01 in this case means heterozygous, and 00 in this case means homozygous reference. Then we have AD. AD means allele frequency for each alternative allele. No, no, sorry, sorry, AD over here. So AD is also present, no, it's not correct. AD is a little depth of the reference than the alternative allele. So that can also be very nice to look at. So in this case, apparently we have a depth of this individual of eight, and we find three reference and five alternatives. And that's why it is estimated to be heterozygous. In this individual, we have a very low coverage of only two, and we find only two reference alleles. So that's why it's estimated to be homozygous. Korra has a question. Sorry, so in the info field, I see 15 for the depth now. Yes. And then here it's eight. So what is the difference between these? So that would be the total depth, because it's for the entire variant. So for all of the entire cohort, so it will be the depth in the entire cohort. So actually how many reads actually supporting the existence of a variant in the entire cohort? So in that further, apparently we have eight, you can also actually get it from this DP over here. So we have a depth of eight, three plus five. Mother, we have a depth of two. That is 10. And as I said, there was also a column with sun, but I didn't include over here, because otherwise the font would have been too small. So apparently... OK, so there would be the sun here, and the sun would have... OK, OK. Indeed, that's exactly how it worked. OK, then we have... So DP we have already seen, so that the total depth in that particular individual, then we have TQ, sending for genotype quality. You can see over here, genotype quality, and that's that thing we just estimated, right? So that is the lowest, second lowest spread-based quality minus the lowest spread-based quality. Apparently over here it's 58, because the second lowest spread-based quality is 58, and the lowest spread-based quality is zero. So 58 minus zero would be genotype quality of 58. So these spread-based qualities, they are also specified, I should have said that, in the last column in this case, TL. So we have 143 for homozygous reference, zero for heterozygous, and 58 for homozygous alternative. So the most likely possibility would be heterozygous, and the second most likely would be homozygous alternative. Same, we can also have a look for the mother, which has very low coverage, right? So we only have a genotype quality of six, which is a very low genotype quality, and just because it is also very possible that you're looking at a heterozygous position. Okay, that is about VCF. Is that all clear to you? So that is how you can manually read a VCF. You can check what kind of information can be stored in the info, so general information in the format field about specific individuals, specific samples. The VCF format is super flexible, meaning that there are actually no standards about what can be stored in the info and format field or in the filter field. So you can store any kind of information in both the info and format field. However, typically, for example, free-based and GATK use very similar abbreviation, meaning the same thing. So there is some consensus on it, but in principle, you could store any kind of information about the individual variant and the genotypes of the samples in the info and format field. So if we look at some more extensive examples of how things are stored in a VCF. So for example, if we look at insertion solutions or indels, let's say over here, so the reference rule in this case would be GTC, and we have two alternatives. So either it's a G, so we lose the T and the C, or it is GTCT, so we add a T over here. So if we then look at the genotypes of the individuals over here, so we have the columns over here. So in this case, we have a heterozygous and in the second case, we also have heterozygous. Well, the first individual was heterozygous TTC as a reference rule and then G as an alternative. Well, the second one was TTC also as a reference as a first allele, but TTCD as an alternative. So another alternative allele, that's how multi-allelic variants can be stored in a VCFL and can be basically whatever number of alternative alleles that is supported by your cohort. Then there's other information that can be stored in VCF and that's haplotype information. So you have seen these slashes that separate the different alleles, but you can also have pipe symbols separating different alleles. And a pipe symbol specifies that the alleles are faked to each other. So we can also store haplotype information in the VCF. So that means that in this case, you need some additional information to which variants are faked to each other. But let's say all of those four variants are faked to each other, that we have in this individual, individual one, we have haplotype of 0010 and the haplotype of 0020. And we have two different haplotypes in the second individual, 10201 and 010. So if you have these slashes, there's no information about phasing. So you do not know where in which haplotype these individual alleles belong, but if you have pipe, you do have that information. Question for you. The question for you is, so now we have learned about VCFLs, which information cannot be stored in a VCFL. So we learned that it's very flexible and you can store a lot of different information in there, but this information cannot be stored. Okay, I'll stop. Very nice. All of you have a pretty good idea of what at least can be stored in VCFL. So haplotype and phasing information can be stored. That's what we just discussed with these five symbols, but you need the sequence surrounding the variants. The only variant information is stored. The reference alleles stored and the sequence of the reference alleles and alter the sequence of the alteration alleles for dot nothing, no information of the sequence surrounding the variants. So typically, if you do downstream analysis with VCFs, you also have to provide the reference genome in order to get information surrounding the different variants. Next question. Yes, I'm not really very clear on the phasing. So these are alternatives for a particular haplotype or something like that or? No, these are the haplotypes. So phasing is actually the construction of haplotypes. Okay, all right. I will have more about that in the coming slide. My question. Maybe can you go back to the other VCFL that you were showing? Yeah, here. So the last slide where you have this reference is GTC, which can have alternative either G or GTCT. So that's technically like a small indel. Is it understand correctly? Yeah. Yeah, okay. Indeed. So this is really how an indel is represented in VCF. So we have either an alternatively with a deletion compared to the reference and we have an allele with an insertion compared to the reference. Perfect. Martin. Yeah, just following up, you mentioned that the VCF file is super flexible. So in principle, you can annotate things you like put in your gene names and whatnot, I guess. So in principle, you could also like add, like the sequence of which surrounds a particular snippet. Yeah, yes, you can. Is this considered a bad practice, I guess? Well, I would say so, because while really storing the reference sequence, well, the VCF is not the place to store it. So it's probably bad to store reference information of possible. However, sometimes you actually want to store a non-variant region, information about non-variant regions in your VCF and more about that in the comments slide. It is a nice bridge. Yes, I also was wondering, actually, I didn't try this myself yet, but sometimes genome browsers, like the EBI may allow you to upload files, isn't it like VCF files, I'm sure of it. And then probably they obviously need to fulfill the standard, which we were like already discussing a little bit, I guess. No, not entirely. So usually, these are quite flexible. So what we will do in the exercise on visualization is actually we will load a VCF in IGV. And then there you will see what kind of information IGV uses to visualize these variants. Okay, cheers. Okay, so let's say we have a certain saturation. We have two samples. So these are the truths. So we do not know them beforehand because we're going to sequence those two samples. So we have nice-paced information again, three variants and in total, also four different haplotypes. So we have haplotype one over here, haplotype two over here, third haplotype over here and fourth haplotype over here. So we have sequence them and we did variant cooling, variant cooling went very well because we have a good representation of what is present, for example, in sample one. So we have their positions and we see that at position X, we have a homozygous alternative in sample one, in position Y, we have a heterozygous in sample one and we have nicely phased them, right? So you can see that we have the haplotype over here, the orange haplotype corresponding over here to in the VCF and the one zero haplotype corresponding with the purple over here in the VCF. We didn't do anything with position Z because it's homozygous reference and if there is no variant over there, there we are also not going to call it if we do the variant cooling on the individual sample. Sample two would look like this. Again, we nicely have called the genotypes correctly. We have also a phasing information. So we have phased the, let's say the orange haplotype as well, zero one one and the purple haplotype as well, one zero one. So third and the fourth haplotype were also identified. So that's how it would look like. So then at some point we want to combine those VCF into a code, maybe we want to use that VCF for example, a genome-wide association study and then we find ourselves having some issues there because in sample one, we did not have any information about site Z because it was not different from the reference and therefore we're also not going to call variants over there. So we're missing information over there. We do know that there is a variant of that position because we call a variant in sample two. So what should we do there? And that's the question I have for you. So what would be the solution you think for this missing genotype problem? So we have a missing genotype in sample one just because it was homozygous reference. There's no good or wrong by the way over here. It's just what you think is the best solution. Right or wrong? Okay, I think most of you have answered. So we'll stop. You have almost a tie. Yeah, cool, this also very much of course reversed to Martin's question. So all three solutions could work or valid, I would say. I think the third solution is a bit of a pity if you feel like missing values over there because we just, we have that information, right? We have sequenced that position in sample one. So we do have information whether if we would have no sequence over there, of course that would be a missing value but we have sequenced it. So we're pretty sure it is really homozygous reference. So adding missing value over there would be a bit of a pity. So I would say that these are with the other two. So we could do a variant call on all samples in one go. That is definitely a valid way to look at it. However, if we have a large cohort, so nowadays sometimes cohorts have more than 10,000 samples. If he wants to do the variant analysis in one go, we need massive computational resources, right? So we have huge, huge, huge bone cells and we have to do that variant call in one go. And also if we decide we are going to add samples to our cohort, so we're going to add maybe a hundred samples, you'd have to redo the variant calling entirely with all the samples together. So again, huge computational effort. Another way would be to store information on non-parent regions in the DCF. So not necessarily the sequence of those regions but the quality of those regions. So for example, what they covered was and what kind of base qualities we got over there because then we know, okay, we have sequence of region and we didn't find a variant. So it should be homozygous reference. The first solution is the solution that for example, most variant callers have that have been around for a while. For example, free base with the developers of free base would suggest that at the best solution. While nowadays more and more it goes into the direction of the solution that was proposed by the GATK developers, which would be doing a variant call on the individual samples but then store non-parent information in the DCF. So this is just a summary of what I said. So most variant callers do have all samples in one go but it becomes compensation intensive and if you have a new sample, you have to redo the entire variant call and then GATK and more and more variant callers nowadays they support the DCFs and the DCFs is the specific type of DCF that can store information about non-variant regions. We will have a look at these DCS in the exercises and then you will also see a bit how it's stored basically stores information about regions and their quality. So that is basically their coverage and their base quality, their average base quality. So then some information about other software. In this course, we practically were using GATK a lot but of course GATK is not the only software that can be used for prototyping and that there is also additional software you can use for example, to work with DCF. Here are just a few examples. Freebase is I think the most frequent use alternative to GATK, big advantage of Freebase is that it's relatively straightforward to use. It's basically a simple, a single command with a lot of options of course but it's relatively straightforward. Nowadays also can produce, nowadays also can produce DCCF. I would say it's a good alternative especially if you do variant code can help to use multiple variant callers and then typically people use Freebase as an alternative to GATK. If you are working with DCFs in general combining them, I don't know, filtering them and so on. There are many things you can do with a DCF. There are two tools that are frequently used, BCF tools that is developed by the same people that developed the SOM tool and then we have a VCF tools with similar functionality to BCF to purchase an alternative. If you want to do a real haplotype assembly so the haplotype information that is provided by GATK is relatively limited. If you really want to do haplotype assembly out of short-read or long-read then I would suggest you to have a look at what's HAP. It's a lot of nice functionality and it's basically the most frequently software for haplotype assembly. Also as an alternative that is becoming more and more popular to compare to GATK and Freebase is deep variant. The variant is developed by Google and it performs based on deep learning and it performs variant calling in both short and long-reads. So I think typically the first program you would like to use if you were calling variants from Oxford Nanopore Technology Data then the variant would be one variant caller to look into. And of course this list is very long. This is just a representation of tools. At least I have you've been using quite often but there are many more. So to show you where we are in the whole process of the exercises yesterday and part of the day we managed to prepare the bound files for base quality query calibration. So we have the duplicate mark and we have the read groups added. So in the next exercise what you will do is actually perform the base quality query calibration. For base quality query calibration it's important to realize that you always need a reference set of variants because it will ignore sites that are variable and so that's quite important. For human data that's usually not really an issue to get existing variant data. So for existing variant site for other organisms that can be a bigger challenge. So for example for dog and for cow there are resources available but for many other memos there isn't for a lot of plants there also isn't this information isn't available. However what you typically can do is first do a variant call without base quality query calibration then take that variant call and use it as input for base quality query calibration and then continue with the variant calling again. In the exercise we're working with human data so in that sense we're lucky that we have existing variant information that can use as input for base quality query calibration. After base quality query calibration we will do the variant calling and then for each sample we will have an individual VCF so that will be just a gene VCF that stores information about non-variant regions and then we make a database out of these VCFs and then we extract a VCF, a regular VCF with the entire cohort out of it. So with all the individual samples in our case it's only three but you can imagine those cohorts can be pretty large.