 There we go. So this presentation gives you an overview of some of the most frequently used cell types in biometrics related to next generation sequencing. So the cell types we will discuss will be the fast-fell. You've already seen a bit of it. So that's just a fail to store, for example, genomes, but nucleotide sequences in general also amino acid sequences can be stored in the fast-fell. Then the fast-q-fell, that is where sequence results are stored. So usually you store them the actual sequence of that read together with the base quality in the fast-q-fell. Then the sum slash bomb format, so that's where you store the alignment. So we will spend quite a bit of time on that because it's important and it's not super easy to understand that first was all in there. There's really a lot of information in there which you have to understand a little bit how that works. The fast-fell, which is a relatively simple fail, specifying regions in a genome, for example, GFFL or GFFL stores annotations and the VCFL stores variants. So GFFL has, for example, the positions of different genes and transcripts in there, VCF, for example, snips and indels. So a fast-fell is one of the most simple fail types, most frequently used fail types in biothermatics. So it stores a plain sequence and with dot fast A or dot FA can store both nucleotides and amino acids in there. And it looks like this, what you see on the bottom right of the slide. So the title starts with a greater than sign and you can have any kind of characters in there. And then there's a new line, so an enter and then a sequence and then you can have multiple new lines with sequences. And only if there is a line starts with a new greater than sign, a new sequence starts. So you can have multiple sequences for in a fast-fell. A very useful command that the decider can be used for a fast-fell is grab minus C. So grab find screen in a file in any file minus C means counts. So you count the number of matches and this little part over here means if the line starts with a greater than sign. So what this command reports is the number of sequences or at least titles in a fast-fell. So that can be quite convenient to quickly have a bit of an overview of what's in your fast-fell, at least how many sequences are in your fast-fell. So I think most of you have seen a fast-fell before, have a pretty good idea of what it is, then a fast-fell. Fast-fell contains of records of usually sequence reads and each of these records contains of four lines. First line is the title of the sequence reads. And in the title, you can store quite a bit of information for Illumina data that is typically, for example, where the spot was on the flow cell among others. And also you usually store the barcode in there. More on the title in the slide after this. Then the actual cold basis, so the nucleotide sequence. Then usually an empty line spreading with a plus. That line is there probably for backward compatibility. So when fast-q-fell was designed, people thought that would have been a good idea to use that line. But nowadays it's not used very frequently anymore. So it contains room for an optional description. And then the fourth line, that's also an important line, stores the base quality. So what you want to do in a fast-q-fell is to store the quality of each base that is called in our nucleotide sequence. But if you remember well, our FRET-based quality ranges, for example, from zero to about 40. So sometimes that quality is a single integer, so something between zero and nine. Or it's an integer consisting of two numbers, so somewhere between 10 and 41, for example. So if you would store these integers together with our nucleotide sequence, that will become a little bit of a mess. So people thought of a way on how to store that in a smarter way. And that's by storing these integers actually as an add-key character. So that means that each add-key character represents a certain integer. So in this case, for example, a exclamation mark represents a zero, a semi-column represents 26, an add-time 31, an i41, for example. For now, for example, that's how it's usually stored in a fast-q-fell. So that means, for example, the A over here is a B. And maybe somebody can tell us what kind of base quality that would be, what kind of base quality we have for this A over here. 33. 33. Yes, exactly, 33. So we go to the add-time is 31, then the A would be 32, and the B would be 33. So we have a slash over here. So the second base has a bit of a lower quality. I would really have to count what it is, but it's somewhere that looks like 12 or so. We have a smaller dance time, so then B would be 27 and so on. So that's how base qualities are stored in a fast-q-fell. Then the fast-q header contains usually quite a lot of information that can also be relevant to you. For example, what it usually contains is a first part that depicts the position on the flow cell and the instrument and the run. The flow cell identifiers the lane and the tail within the lane and the coordinates within the tail. So that exactly depicts the position of the spot in which run on which machine. So you can always trace back where you came from. As a second part, there is information about which read it was in a pair. So if it's a one, it's the forward read, if it's a reverse read, but it's filtered out. Yes or no, that's not used very frequently anymore. So in principle, you can store a fast-q-fell and then use this second part of the entire title to store whether you want to fill it out or read it just now, but it's not used very frequently. Control bits also not used very frequently, but there is also at the very end the barcode. And that's also a very relevant part of information that you often store in your fast-q title. So that is the identifier of the sample. So the barcode related to the sample that you have sequined. I have a question for you. I'm going to start a new share and try to find my browser. Here we go. So question. There's a bit of more of a technical question that's also related to how these base qualities are stored in a fast-q-fell. So what we have learned in the previous slide is with grep minus p and then k-read-created enzyme, you can count the number of sequences in a fast-q-fell because you're just counting the lines that start with a k-read-created enzyme, which is typical for the fast-aid titles. In principle, you could also say, okay, the title of the fast-q-fell, so with all this information about the position on the sequencing lane, it always starts with an add time. So we can also use grep minus p and then not the line starting with a k-read-created enzyme, but with an add time to call to calculate the number of lines in a fast-q-fell. Why is that not going to work? Okay, so most of you have answered the add time can also occur elsewhere in a fast-q-fell. And indeed, that's the correct answer. The add time is not special for a regular expression. You can just use it directly in a regular expression. And there is no specific start to the voice also, but we always start with an add time. So if I go back to the presentation and if we just go to the previous slide. So the example of our fast-q-fell, so we have our title that always starts with an add time. So in principle, you could start counting the number of titles with lines, but in an add time. However, as you can see over here is that the add sign is actually one of the asking characters that can be used to specify base quality. So let's say if this A would have a base quality of 31, then this line would also start with an add time. Well, it's not a title, but it is the line for the base quality. So you would count extra lines there. Usually you have really millions of reads here, fast-q-fell. So it's very likely that at least one or a few after read start with a base with a base quality of 31. So you would count them as an individual record, which would not be correct. Therefore, you cannot use that line of code to count the number of reads in a fast-q-fell. Using yesterday that we have been counting the number of reads in a fast-q-fell. How about we just did that by counting the number of lines? Because we know that each record consists of exactly four lines. So we just take the number of lines in a fast-q-fell divided by four. And we have our total number of reads in a fast-q-fell. I see something in the chat appearing. The Gabriela tried to grab add signs in a fast-q-fell and give zero as output. And that's because it's quite likely that there are no add signs in a fast-a-fell, of course. They will be in a fast-q-fell. All right, then the sum format. Sum stands for sequence alignment format. And the aim is to store alignments, obviously. So the sum file is like many other files and bioinformatics, just a regular text file that starts with a header. And then there's a tab that links this part. And the header already contains quite a bit of information. And it's quite nice to have a bit of an idea of what kind of information is stored in there. So for a sum file, the header lines always start, again, with an add sign. So you will see later on that other file types, for example, start with a hash sign. But some headers start, all the lines start with an add sign. And different type of informations are stored there. So you have these different tags that we have, HD. And that gives you basically, I don't know where HD exactly stands for, but something like header gives you a virgin name of the sum format and information about how it is sorted. And that's already relevant information. Because if you do an alignment, then the sorting of your alignments are according to the order of the FASTQ file that you use as input. However, usually you are mostly interested, not so much in the order of the reads, but that was used as input, but the order according to their position on the genome. So if you see over here that it's sorted by coordinate, then the reads are not sorted according to the order in the FASTQ file, but their position on the genome. And you will learn later on how to be able to actually do that. Then there are usually quite a few lines describing the reference genome to which your reads were aligned. Each chromosome or each individual contact has a new line in this sum header. We're now looking at the E. coli reference genome. And E. coli reference genome in our case has only a single chromosome, so that chromosome has a certain name and a certain length. And that's stored over here. If you have multiple chromosomes, you have multiple chromosome names and multiple chromosome lengths stored over there. Then another very typically used part of the sum header is the addPV spec. And that is information about which programs were run on this sample in order to come to the sample as it is right now. So usually you can have multiple programs doing different calculations on your sample. In this case, we only have information about the aligner that was used. So with this addPG tag, we can figure out what kind of options were used for the aligner. So what you see over here is this program we use. This called Botide, a version even of is stored. And then the original wrapper command is actually written over here. And the index and the reads are specified. So the entire command is stored in your sample, which is quite nice. Because if you, for example, receive a stem file from a colleague and you have no idea where it's from and what the history is, you usually can figure it out by just looking at the PG tags in the header. Then after the header, the tab, the limited part starts. And the tab, the limited parts consist of multiple columns of specific information regarding the alignment. Each line represents a single alignment, usually a single read. First column specify the read name. So that is exactly the title that it got in the FASTU file. Then a flag and a flag represents binary information about the alignment. For example, over here you see an integer and that integer represents binary information. So whether it aligned property, for example, yes or no, or whether it is made, is mapped, for example. How this binary information is converted into a flag and how you can convert a flag back to the binary information you will see later on. Then the reference to, so let's say the compact or chromosome to which the read was aligned, the start position of the alignment, the mapping quality. So that would be the information that is used of how sure you are the mapping position is correct. And then the SIGAR string and the SIGAR string specifies whether there are matches or mismatches between your reference genome and your read. So how you can read this SIGAR string is like this. So over here we have a read of, I'm not exactly sure how long this read would be. So that's always a bit of calculation. But our read starts with five exact matches. Then two deletions, I think those are deletions in the query. So deletions in the read. So we have the aligner deleted the part in order to optimize its alignment. Then again seven matches. And then three as this and then add me thought clip. So three bases are clipped off of the alignment. Based on the start position and the SIGAR string, you can also define the end position. Daniela has a question. Yeah, sorry, two questions. One, what was again the S? Sauce clipped. That means that three bases were sauce clipped from the read. So it can work for the alignment. I know it. Okay. And then the start position is the start position in the whole genome or in the chromosome. In the chromosome. Okay. So this read aligned at position 12,513. So based on 12,513 in contact or chromosome U 96.3. Exactly. And then based on the SIGAR string and the start position, you know the exact alignment of the read. I think Gabriella has a question. Yes. For me, it's not clear. How does it look? For example, how the aligner can figure it out when it's a deletion or a soft clip? I mean, because in both will be like a gap in both. It's like there is no alignment in the query seconds to the reference genome. So how is like they are named deletion or soft clip? Maybe for me, soft clip, this is not very clear. How does it look or how the alignment? It will make this. Yeah. So what an aligner tries to do is to optimize the alignment itself. So if you allow for stuff clipping, that means that the aligner is allowed to remove a part of the read and ignore that while calculating the penalty, the alignment penalty. So if the aligner would say, okay, I'm not going to soft clip, but I'm going to call this a deletion, for example, that would mean that it would add up to the penalty. And therefore it might not have a significant alignment at that position. So if you tell the aligner, okay, do the soft clipping. So ignore that part completely for the alignment. Try to optimize the part that does fit. Then that soft clip is just ignored for penalty calculation. And therefore you might get significant alignments with soft clipping. Well, you wouldn't if you would do, if you would not allow for soft clipping. So if you would call that a deletion. Okay. And what is the difference between soft clipping and hard clipping? I forgot. I have to look it up. Hard clipping is not used very frequently. Okay. So the alignment, the aligner just use more soft clipping because what would be that is soft because less bases were ignored? So soft clipping means that the information of the wreath itself is retained. So for example, what you can do with soft clip bases is you can display them in the alignment if you visualize it. And then you can see which wreaths are exactly clipped off. Okay. That's the option in IGV when it says, see a soft clip. Okay. Yeah. Yeah. Exactly. She knows the question. Yes. I was wondering in our case, we have a reference of only one chromosome, right? So we know that that starting position is of that precise chromosome. If we would use a reference with multiple chromosomes, do we see which chromosomes were in the reference name in this case? Yeah. Or in the starting position? No. So that would be reference. So this is a bit confusing. So you would say, okay, reference means the entire genome, but so that's just how it is named in the sum manual. So reference here means contic. So basically the false A title. So the thing that occurs after the greater than time in a false A file. So if you have multiple chromosomes, that would be chromosome name. And then a position in a chromosome. Great. So if you have paired and reads, you would like to store that, of course, as well in your stompbell. So which read belongs to which reads are paired in the stompbell. So for example, one thing you would like to store is whether the made. So the other one, the other read in your pair is actually mapped to the same chromosome or to the same contic. And of course, usually that's what you would expect, right? If that's not the case, either something went wrong or your reference is just broken up into multiple context. So if the mate is mapped to the same reference, you get an equal sign over here. If it's a different reference or different contic, I should say a different chromosome, they get the chromosome name over there over here. Then you get interface about start position of the mate. So exactly know where it starts. And therefore, you can also calculate the fragment length. Sorry about that. The fragment length. So that would mean the complete length of the unknown part that you have sequenced. So between, let's say the first days of read one, and that's also the five prime base of read two to the end of your fragment. So that's also calculated the total fragment length. Then we get the sequence itself that's also stored in the sample. So that's always there. And the base quality string is also stored over there. There we have the sequence and the base quality and the optional text that depends very much on the alignment. You are using what kind of information is stored over there. So that's basically extra information related to the alignment. So then a few words about trigger strings. So the letters that are most frequently used are the m, the i, the d, and dn, and also ds. So the first five. So the m specifies a match, the i, an insertion to the reference. So in addition, you have something inserted to read a deletion. It's a deletion from the reference. So it is deleted in your read. An n is skipped from the reference. And that is actually used for, for example, specifying intronic or spliced alignments, I should say. So if you then you skip an entire intro that's specified by the n, and then the add is used for soft clipping. However, what you can also store, which you see, for example, at the end, so this number seven and eight is a sequence match and a sequence mismatch. So in principle, you could store in the sticker string also individual snips. So if you have a mismatch between your reference and your read, for example, in your reference, you have an a, in your read, you have a g at that position. You can specify in the sticker string, for example, an x. However, this is actually stored as an m in the sticker string. So these are almost never used. And it has a reason. So let's say this is what I just explained. So let's say your reference is over here. And so on the bottom and your alignment is at the top. So you can say, okay, we have seven matches. So that's how it's actually stored. Which can also say, okay, I have three sequence matches, one mismatch, and then three sequence matches again. So we have a longer sequence string, but we have stored information about this variant over here. So the reason for that is that also the read itself is stored. So the sequence of the read itself is also stored in the sum file. That would mean that if you would store these mismatches in your sticker string, you would store the same information basically twice. That you have to read information. So the actual basis and you also have the reference information. That means that if you would store those individual mismatches in your sticker string, you would store information about a mismatch twice, both in the reference and the actual read and in the sticker string. And people said, okay, we're not going to store information twice. So we are going to say, okay, if there is a single mismatch, meaning for example an S&P, we are going to store it as a match and you can figure it out from the read itself, where that actually came from. Or to figure out what the actual nucleotides were in the read. So I have a different question. So I'm sorry, before you made the question, so it means that the mismatch are stored as SNPs. The mismatches, they are stored, so let's say the axis, they are just stored in the read themselves related to the reference. Meaning that you store the actual sequence in the sound file. So that's what you see over here. So the sequence itself is already stored in the sound file. And because you know the matches and deletions and insertions based on the sticker string, based on the reference sequence and the read sequence, in combination with the sticker string, you can figure out what the nucleotides are that are different from the reference and the read. Yeah, okay, so let's go to the actual question I want to answer to ask you. So now we have a bit of an idea of what the sound file looks like, what kind of information is stored in there. And let's say you have only the sound file. And my question is, whether you would be able to regenerate the fastq file, so the original fastq file out of the sound file, or at least the read that were in the fastq file, or and or can you generate the reference sequence, so the original what they felt from the sound file. The only the sound file, you have nothing else. Yeah, most of you have answered. All of you have, all of you were involved. So most of you have answered, or the most answered answer is only the fastq file, and that's also the correct one. So indeed, you cannot regenerate the fastq file from the sound file, but you can regenerate the fastq file in order to explain that. Let's go back to the presentation. So if we look at this step, the limited part of the sound file again. So everything that is in the fastq file is stored there. So we have the read name that is stored, we have the sequence, and we have the base quality, and all of those are stored in the fastq file. So you could again, and it's often actually also done, regenerate the fastq file from the sound file. So that's pretty good. The next question. But you don't have anything on the instrument information in the sound file? Yeah, so that's that would be the stored in the read name. That's this part. Everything? No, I shortened it. Oh, the read name is the entire title. The line. Yeah. So if the information about the reference that is stored is mainly stored in the header, and that's only the on takes or the chromosomes that you have in reference and their leg, other than that, that there's nothing stored in the sound file. So if you want to do, for example, variant analysis, you always require the sound file and the reference, you know, in order to find mismatches between your reads and the reference. Okay, then about some flags. So these are these the second column in the tab, the limited part of the sound file, and they are stored as integers. And the integer, they are a sum of different bits that specify a certain characteristic, whether that is true or false. So meaning that let's say if we look at the third one, so with bit number four, segment is on map means that, for example, your read is not met at all. So you couldn't find a place for a written genome, it's stored, still stored in the sound file. Then you get a four. And let's say you have another characteristic, for example, that your read is spared. So that this, I'm pretty sure that's this one template having multiple segments in sequence things, but it basically means that your read is spared. So if your read is on map and it is spared, you get a four plus one. So you get a five at that position. So at that second column in the sample. So all of these characteristics, these, how many are there? So about 10 or 12, they can be stored in the sound flag. For example, you can also store in the sound flag, whether we have a PCR duplicate, just know whether it's marked as a PCR duplicate, or whether it was the first or the last segment in the template, meaning is it the fourth or the reverse read, and so on. So how does it work? I've put together a small example. So let's say your read is spared and is properly aligned. So the bit that is, the specify whether the read is spared is the one. And the bit that specified properly paired is the two. So you add them together. So you will get a flag of three. Let's say your read is spared, not properly aligned. And the mate is on map. Then you get a flag of nine. If your read is spared, it is on map. And the mate is also on map. Then you get a flag of 13. And so, and you can, because we have, we're working with bits, you can always translate back these flags into the individual binary information. Question for you. She knows question. Yes. So this means that actually the points are always the same. So just go back. So like the read paired is equal to one. Give it like a point to that is always the same. So if you, so for example, you can also have stone bells where you combine paired ends with single end reads, for example. So then some of the alignments would not have this characteristic. So you would not have the one in there. See what I mean? Yes. I don't understand what are the one, two, four, and eight, the bottom. So these are the bits, they are, they are, that's basically this table. So there's one, two, four, and eight, meaning template having multiple segments and sequencing that basically, if you translate that, which basically means was your read paired, yes or no. So that's this one. Okay. So these, these are always the same. This code is always valid for some flash, right? It's always the same. Yes. If we have this table here, then we're sure that if I have a total of two, then I can go back. Okay. Yes. So the two always means each segment is properly aligned. Yes. The four always means the segment is aligned, the read is aligned. Thanks. All right. Then a few more examples of types of data. So we have the bedfell, and bedfell stands for browser-extendable data. And what it just stores are usually regions in a genome. So what it always stores is at least three columns. And the first column would be the sequence title. So for example, contact or chromosome. The second column would be the start position of a specific region, whatever a region of interest. And the third column would be the end position. Then you can give it a name. For example, you can store information about axons in there. So in the example, we have axon one, axon two in there. You can store a score for whatever reason. For example, how sure you are there is an axon over there. And you can store the strength. So whether it's in the positive or the negative strength relative to your genome. One thing that is very typical about bedfell is that the numbering of the start position starts at zero. And numbering of the end position is, let's say, standard. That is inclusively called. So if you would convert it in, for example, a way we also specify regions, starting with the chromosome name, a colon, and a start and an end, you would convert this bedfell into regions. So we would specify it typically, for example, also in ITV, we would start at the 1700 and two. Well, in the bedfell, it would be 1700 and one. This is a little bit annoying. So this really starts at the 1700 and second base. That's where your region starts. So basically in a bed, it starts after the specified start sequence. This is a bit confusing, but it is what it is. So at some point, people have thought, okay, this is what a bedfell is. And she just tickled that because every bedfell is formatted in that way. Then the GFFL is quite similar to a bedfell, but we have some additional information. For example, what kind of feature we are looking at. And GFFLs are typically used for storing information about genes, for example, about where transcripts are, where coding sequences are, where genes are in a genome. And there are a few columns. The first column is, again, the chromosome name, the contact name. The second one is the source. So where the source, where the annotation actually comes from. Third one is the feature. So what kind of feature are we looking at? So is that message RNA? Is that an axon or coding sequence or whatever? Then we have a start and an end. So the start and the end is different from a bedfell, if I'm not mistaken. So that would be really starting counting at the one instead of counting at the zero and just coming for the bedfell. Then we can store a score. So how sure you are that certain feature was actually there, depending very much on the software you're using, whether you store that kind of information. The strength, same as a bedfell, minus or plus the frame you can store there. So whether the translation of an amino acid actually start at that position or whether that's irrelevant. So if it's missing, so if it's irrelevant, so that would be the case for messenger RNA and axon. You get a dot. The coding sequence, of course, is you can use that to translate that into a amino acid sequence and then where to start this frame. If it's a zero, it really starts where the coding sequence also starts. And then you get a whole bunch of attributes. Usually that's a pretty long line of, for example, the gene identifier, for example, the ensemble ID, the gene symbol. And of course, also often, for example, the parent. So if you're talking about an axon, the parent would be the transcript. If you're talking about a transcript, the parent would be the gene and so on. And that's also stored in attributes. So you know, which kind of features belong to which. Then the last format I want to discuss, Teba. Sorry, I'm not switching on the video because my Wi-Fi is not good. Sorry about that. Yeah. So my question was about the difference between the information that's stored in a bed versus a GFF file. So was it that in the bed file it was only about axons, whereas the real annotation of the gene and transcript is coming from the GFF? That's a bed file can store any region for anything you want to annotate at the genome. So it doesn't necessarily specify icon, but pretty whatever you want. Then it could also, for example, specify promoter region or, you know, something that is enriched somewhere or. Okay. Then what additional information is a bed file adding apart from what is already, what could already be in a GFF? Well, it's more that. So a bed file is much more flexible. So you can store anything there. And a GFF file is really there to store information usually about genes. Okay. Genes and transcripts. So in principle, you can store the same information in GFF file also in a bed file. You can actually convert the GFF file to a bed file. It would have a few additional columns over there, but pretty much for all information that is in a GFF file can also be stored in a bed file. But not everything that can be stored in bed file can also be stored in GFF file. Because in addition to genes, you can also store, for example, I don't know, primer position, for example, in a bed file. Okay. Okay. Thanks. And a second question is GFF and GTF, are they similar or different? That is a bit of a complex story. They store mostly the same information, but there are multiple versions. So if I, just by heart, I think so you have GFF, GFF2 and GFF3. And I think the GFF2 is exactly the same as the GTF. So that's by heart. I'm not entirely sure whether that exactly true, but that is how they a little bit relate. So there are GFF versions that are exactly the same as GTF, but for most applications, you can almost use them interchangeably because the difference in how things are stored are very similar between the GFF version and GFF file. Okay. Okay. Thanks. So then the last file types. That's the VCF. And in VCF, you store information about variants. So like a song file, a VCFL also starts with a header. But other than, so where the headers, the song file starts with add signs, the header of VCFL starts with hash tags. And then after that, you again have the limited part. And this step, the limited part, not alignments are stored, but variants are stored. So basically what you store in a VCFL, we'll go into what's exactly there in later slides. I'm not mistaken. Yes. So we have our reference and we do the aligned reads. And at some points, you want to use some software, for example, to figure out whether, for example, at this position, you see sometimes a T and sometimes a C, which is the reference, whether that is a variant, yes or no. And that is stored in a VCFL. So what you want in a VCFL is a position, you want information about the reference allele and information about the alternative allele. And for example, also information about the genotype. So the genotype for this position over here with only seeds is probably homozygous C. In this case, homozygous alternative because it's different from the reference. And at this position, it is heterzygous for the alternative, meaning half of the reach probably or half of your genome is a C and the other half is a T. So if you're looking at a diploid, then of course one allele is a T, one allele is a C. Also, what you want to be able to store in a VCFL is, for example, insertion and deletion. So that's also possible. So over here, we apparently have a insertion of two bases. And over here, it seems to have a deletion of one base. So probably the T is the latest in the reference. So basically what a variant color does is converting our SunVal into a VCFL, specifying specific positions and alleles for our variant. It looks like this. So we have our header starting with the hashtags. And in this header, there is quite a lot of information. For example, what kind of version we have, the date, often also what kind of software was used to create the VCFL, not specified in this example over here. And there is information about what is specified in the tab delimited part. So the tab delimited part starts with, again, the chromosome name or the contact name, the position. So very similar, again, to other formats. Then we have an ID and identify for the particular variant. So if you call new variants, you do not know the ID on the forehand. So this is usually then empty. Sorry. Then we have the reference allele at that position. We have an alternative allele at that position. So these are only the positions where we actually call a variant. The quality of the variant in general, whether there's a filter specified over here. So if there's a pass, it is not filled out. If there's a Q10, it is filtered out. There's an info field giving information about the variant as a whole. And then there is also information about particular samples. So you can store information of multiple samples in a single VCFL. And what is stored in these particular samples is stored in the format column, specified in the format column. And then all these different codes we see over here, in both the info field and the format field, they are described in the header. So we have this part in the header where the info field is described. For example, the NF. So what kind of, how information is stored in there and a description is also stored in there and over here says, okay, NF means the number of samples with data. So we can check that in the tab delimited part and we see number of samples with data equals 3. And we actually see two columns per samples. And that is because this VCFL was a little too long for the slide to cut off the third sample, but there are three samples in total with data. We have CTP, for example, total depth, total depth of 14, and so on. So the VCFL is super flexible, which means that the software you're using, or the end user, can decide what kind of information is stored in this info field. Thing counts for the format field and the filter field, by the way. So first, let's go to the filter. So if the filter is not false, then you might specify why a certain parent is filtered out. In this case, you see Q10 over here. And the script in all is that there's a quality below them. And that's actually what we see over here, that the quality is below them. So again, also the filter can be anything, very flexible. Then the format part, the format part describes what kind of information is stored for sample, so not information about the parent as a whole, but what kind of information is stored for sample. And that is what you see over here. So for example, we have VenoType, so GT, that's a very frequent type of information that's stored in the VCFL. And so the order of the format column over here should also be, or is always the order of the different characteristics that are stored in each individual column of the different samples. They do not have to be exactly the same for each variant, because for example, for an indel, you might want to store different information than for a SNP, for example. So we have the GT over here. So the first part of the column of a certain sample should be the genotype. And the genotype is always specified according to the number of the ploidy level of the organism. So in this case, we are looking at human data. So it's deployed. And the numbers in the genotype refer to the reference and alternative. So over here, we see the zero and then a pipe symbol and then a zero. That would mean that this specific sample is homozygous reference. So the zero always specifies the reference value, while the one specifies an alternative value. Sometimes you even have multi-allelic variant. For example, over here, where we see the references in A and the alternatives are G and a T, they can also even have two specifying a allele. So the two here specify the T and the one specifies a G. In this case, this individual is heterozygous for T and T, and it doesn't contain the reference allele. And so you can also store other information, of course, about the specific samples, for example, in genotype quality. So how sure the variant color was that the genotype it has called is the actual genotype. Again, this is also in a threat-based core. So very similar to mapping quality and base quality. These are examples that was already clear. That is it in terms of file types. Everything clear regarding that. I have a question. I would like to know, for example, this upload type quality, what does it mean or what it explains? So there, so PCFLs are very flexible. That also means that you can store any information you would like to store about a certain variant. So hepatotype quality can be something that is completely used by only one very specific type of software. For example, software that does haplotype calling that uses haplotypes. So to understand what haplotypes and everything are, I think that goes a little bit too deep for this course. But you can, for example, link multiple variants together and calculate, for example, a score about how sure you are that the gold haplotype is actually the haplotype. So that's probably what this would represent. So whether the haplotype that is gold, so haplotype information can also be stored in a VCF. I am not going to explain it now. But so this haplotype quality would be depicting how sure the haplotype color was that the haplotype reports is true. In the sample column, which number is the haplotype quality and which is the read depth and which one is the quality, genotype quality? Okay, so that is specified in this form of column. So here we see GT, which is genotype or genotype GQ, which is genotype quality, BP, which is read depth, HQ, which is haplotype quality. And in that order, it is reported for the different samples. So we have the genotype quality over here or the genotype over here. So the zero zero, the genotype quality over here. So 48 for the sample, the depth. So read depth one, apparently. And the haplotype quality is 51,51. So there are two haplotype qualities. I have no idea why you would have two haplotype qualities. But apparently haplotype qualities is reported with two numbers. So you can already notice that haplotype quality is not a quality that is reported very often. Because in the, for example, in the bottom position, that is only like the genotype, the genotype quality and the read depth, and then the haplotype quality is not reported, no? No, indeed. So because you can report different characteristics for different variants. So in this case, the variant caller has decided, okay, I'm not going to report haplotype quality over here. And that is because it's not based. So if you see a pipe symbol over here, then a variant is faced. So meaning that it's part of an haplotype. If there is a slash, it is not based. So it's not part of a haplotype. So because it's not part of haplotype, also haplotype quality is not reported over here.