 Start recording, so welcome everyone. Also the people who are watching it on Moodle. Today we will be talking about DNA meta barcoding because it was requested. I made like four or five slides about it. Well, perhaps a little bit more. And there is the lecture we spent talking about literature management and these kinds of things, which is an essential part of science, not so much for bioinformatics, but there are some bioinformatic aspects there which you have to kind of know and take into account. So 13th lecture now, next week we will still have a lecture, but it won't really be a lecture. There won't be any new stuff next week. It will just be a repeat of all the other lectures and there will be a couple of example questions like they will be on the exam. So I just took some old exams and took like four or five questions that I asked before from that. So just that you have an idea of what I will be asking on the exam. So I think you can still register for the exam, but you have to register two weeks before. I already said it a couple of times. So I think today or tomorrow will be your last day that you can register. And of course, if you fail to register for this exam, there will of course be a re-exam. I also have the date of the re-exam somewhere in my email and I will send it around on Moodle. Although I hope that we won't need a re-exam. I hope that everyone will just pass the first time around, which will make it easier for me because I don't have to make two exams, which would be good. All right, so like I said, DNA meta barcoding, literature management today. So 12 slides on DNA meta barcoding. Then we will be talking about PubMed, Medline, things like citations. We will be talking about things like Web of Science, Google Scholar, ResearchGate, what is an H-index, what is an I-index. I actually put in a couple of slides about scientific reference managers. I assume people know how to use a reference manager and that everyone has their favorite. But I did want to point it out because it's something that when I started off in science, more or less at the beginning, so like 2010, no one ever told me that these things existed. So because I didn't know, I spent a lot of time formatting citations and making sure everything was correct. And the nice thing about these literature reference managers is that you don't have to. They take up like most of the work and hey, you can just create a database and take it everywhere with you. And finally, since we're talking about literature and these kinds of things, I also wanted to talk about version control. So what is Git, what is GitHub, what are different version control systems because when you are starting to write code, version control can and will save your ass many, many times, which is good. So you should use one. But first and foremost, we will be doing the assignments from last lecture. I think it was a relatively straightforward assignment. So I can't show you the assignment. So just switch to like the R window because we will be using R soon anyway. So when we are talking about standards for analysis, of course, we discussed a whole bunch of file types. And then on the Moodle, you could download a zip file with a couple of different file types in there. And for each of these files, there were a couple of questions and the first one were about the different FASTA files that there were. So let me extract the zip file myself. So just seven zip extract here. And then they just come into the folder, which is perfectly fine. All right, so the first one, let me show you guys in the Notepad++ window. So the first file FASTA underscore zero one dot FA, it looks like a FASTA file, but is it really a FASTA file? So if anything, I want to ask you guys, is this a properly formatted FASTA file? And if not, what's wrong with it and how can we improve it because that's or not so much improve it, but have what's wrong with it and why is it not an official FASTA file? So if you know, just throw it in chat and I will have a couple of sips of coffee while you guys think it over and look at what's wrong with this file. And remember, something like this could be on the exam as well because it's kind of an easy question to make, right? Like just give you a little sequence or a little piece of sequence and then ask you guys what's wrong with the FASTA file and then you guys can say what it is. No one, no one, no ideas, no comments. It's a perfectly valid FASTA file. The spaces are wrong. Which spaces do you mean? The spaces here or here or the one here. Like the main thing which is wrong with this file is that the FASTA format is defined as, so this is perfectly valid, right? So the first line is perfectly valid because it's just a comment line between the three different FASTAs. No, because empty lines are also okay in a FASTA format. So it doesn't really matter if you do something like this or this. This doesn't change the sequence. Like enters are ignored in the FASTA format. But the main thing which is wrong with this file is that before you can start giving a sequence you should name it. So a sample sequence in FASTA format cannot be written down like this because this is a comment character, right? Which means that here we have a sequence and the sequence is unnamed. So that is a, yeah. So the first thing that is here and Testesaurus comments on the second header line should be only one. That is not a per se definition, right? It just means that the name of this sequence is greater than symbol FASTA01 underscore 02. So this is a perfectly valid name because there's no reason why a name couldn't start with a comment like this or the other way around. Like so this just means because everything after the symbol is considered the name. So having something like this is perfectly valid. It just means that the name of the sequence is larger than FASTA01 underscore 02. But the main issue here is that the first sequence is not named. So you have to name it like this instead of with a comment character. And of course, the third sequence is also lacking a name, right? It just, it only has the comment symbol and it's okay to comment stuff. But the requirement is that every sequence has a name. Maybe the sequence can only be made from A, A, T, G and C. If it's a DNA sequence, then yes. But in this case, the sequences here are proteins. And the only one that I'm not sure of is if the star is allowed. I don't think that the star is officially allowed because it might be that it's allowed. But that's the only thing that I'm a little bit unsure of because I think it needs to be a dash. But the star might mean any amino acid. But you can just use it to amino acid sequence. And so if you want to fix this file, then you would say, well, the first line needs to be the first line is perfectly fine as being a comment. And then the second line needs to be the name. Of course, you can flip this around as well. So you could make it valid by saying, well, I have a name, right? FASTA 01, so to make the name equal to the other names. And then this is a sample sequence, which is just a comment in the file. So everything with a dot comment from this ignored. And then the sequence itself, let me check. This is 64 long. This is 64 long. And this is 70 long. So the length of the lines is still fine because they can be up to 80 characters. So since this is 70 characters on a single line, this is still allowed. A sequence like this would not be allowed, of course, because now there's more than 80 characters on a single line. There's now 140 and that would break the FASTA format. So there's a couple of things which were wrong with this file. But if you would really want to make it pretty or make it kind of consistent, then this would be more or less the best way to fix this FASTA file. All right, so then I think I gave you two more FASTA files. So again, the thing which is wrong with the first line is that there's nothing really wrong. It's just weird that the name starts with the greater than symbol. The sample sequence in FASTA format is perfectly fine. And here we see that the line is just way too long because maximum 80 characters, that's what FASTA defines. So too long, what's too long, Commando? Yeah, too long, this line is way, way, way too long. Yeah, and there's two headers, yep, that's well noted. So the second sequence has two headers. Yeah, the second, the two headers is also, is not allowed. So one of these has to be a comment character to make it valid again. Right, so this would be perfectly valid now, except for the fact that the lines are too long. So if I would go to having a width of 80 and just break it like that, then this would make it a perfectly valid FASTA file and the same thing holds for this one. So by putting in some enters, this is now a valid FASTA. Wasn't the maximum length 140? I don't know, I thought it was, let me Google that for you. FASTA sequence definition, I think it was 80, but yeah, no longer than 120. Now, in the original format, the sequence was represented by a series of lines, each of which was no longer than 120 characters and usually did not exceed 80 characters. So the 120 characters is more or less the upper limit and then the 80 characters is what is advisable, but nowadays it doesn't really make a lot of difference anymore and the Asterix is actually allowed. So the star is allowed at the end of a sequence. So if we would say star here, this would be perfectly fine, right? Because this means the end of the sequence, but if we would put like a star here, then this would not be allowed because this is in the middle of the sequence. So now the reader will read the sequence up to here, see the star and then assume that the end of the sequence is there and then it will complain that this part is a new sequence which doesn't have a name. So anyway, so that's what's wrong with the second file and then I think I gave you guys a third one as well. And then this one looks, yeah. This one, okay, yes or no. That's a question to you guys, so I think it's fine. I think it's fine as well. I don't see anything which is obviously wrong with it. It looks a little bit weird here, right? Like in this part of the file, but this is perfectly allowed, right? Because the sequence just continues here, right? Because it has a name, then you have the sequence which is perfectly fine and then this is just a comment so this whole line is ignored. So if you ignore this line, then there's just two enters and like you can add enters in the faster sequence without it being an issue. So I think this one is perfectly fine as well. I don't see anything really wrong with this one. All right, so then the next one was VCF format. So open a provided VCF file in a text editor. So let me do that for you guys. So that's sample that VCF. All right, so then you see the VCF header. So the header starts by saying this is VCF version 4.1. So that might have a little bit of issues. So the questions here were, oh yeah, it's a short piece of the file. So the whole file that I have because this is just a subset of the file. So the whole file that is like a gig. So of course a gigabyte file you can't really open in a text editor unless you have a very special text editor. But notepad++ won't like me opening a one gigabyte file. I think the maximum for notepad++ is like 350 megabytes. So I gave you a little piece. All right, so question two A, what is the official format specification of this VCF file and which species are we dealing with? So that's a question that is a good question. So the official format is of course written down in the first line. So that's the 4.1 format. And then the question is what species are we dealing with? So can we see that anywhere? Okay, so we can see here in the last line that the reference file being used is the musculos genome. So of course then that means that we are dealing with mouse. So question two A, genome is seen. How do you mean genome is seen? The genome is seen. Sorry, bad English. That doesn't matter, that doesn't matter. You can type in German in the chat as well if you feel more comfortable with that. Like I can read German. You can see the genome in Google it when you find, yeah, yeah, yeah, yeah. So and in theory people should know that musculos is mouse, but that's just coming from me as a mouse geneticist. So, but yeah, you can just say, well, the genome is MM10, musculos. And if you would have a little bit of Latin in school, and you would know that that is a normal common house mouse. All right, so question two B, when was this file created and by which tool? So that's more looking at the header line. So the question is, is there a date in this file where it says where it was created? So yes, there is because it was created using the GATK and that's what it says here. So this file was created. So this is the filter which was applied. And this is the format which describes the file. And then here we have the command line. So the line that created this file. So the command which was typed in by the guy that made the file. So in this case me. So this is the GATK command line. So that means that GATK, and that's something that you kind of have to know because the command line is just kind of pasted together. But GATK is just the genome analysis toolkit. So it's something which is made by the Broad Institute and there's many tools that can create VCF files. So it's not just the GATK but it could be BCF tools or VCF tools or something like that. And of course the file was created in 2014, Monday the 5th of May at 8.20 in the morning. Damn, I was early at work that day. 8.20 is not a very common time for me to be at work. How many chromosomes are in the full VCF file? So how many chromosomes are in the full VCF file? Well, you can just look at the context here, right? Because in this case it might be that it's full chromosomes but also context. But in this case there are 21. So it's 19 chromosomes, X, Y and M, T. So that's 22 chromosomes. Question 2D, what is the longest chromosomes and how many base pairs long is it? So the longest chromosome is almost always chromosome one because that's the way that they are ordered because the longest chromosome is just called chromosome one. It's kind of a standard that no one really talks about. The only thing which could rival chromosome one generally is chromosome X. There are some species in which chromosome X is longer than chromosome one. But chromosome one in this case is 195,471,000 971 base pairs and that is the longest one. To E, what is the shortest chromosome and how many base pairs long is it? So there's two which kind of qualify for this but then mitochondrial, do you really count it as a chromosome? It kind of is a chromosome. So if we would assume that the mitochondrial chromosome is the smallest chromosome, then it's only 16,299 base pairs. So question 2E. Question 2F, how many sequence variants are sequence variants are in the subset of the VCF file? So every line here is a single sequence variant and of course you can see here from the ID column that some of them are known sequence variants and the dots here are sequence variants which are novel. So how many are there in total? So it starts at line 51 and then we go all the way down. So 51, so it is 3,000 minus 50 which means 2,950 sequence variants. So question 2F, question 2G on how many individuals were sequence variants called? So then we have to go all the way up again and we have to look at the header. So here in the header line, we see that there is only one individual. So the first nine columns are more or less standard. They are defined and then there is a single column for each individual. So in this case, since there's only one additional column, it means that there's only one individual in this file. So this is the 860V2 individual. All right, then the next question is, open R such a working directory and load the file in R using the R retable function. So let's do that. So let's switch to R and oh, let's switch to R for you guys as well. So I'm just going to, in Notepad++, I'm just going to create a new file, right? So I'm just going to say, well, yeah, I'm going to start coding. First thing which I do when I start coding is say answers to lecture 12. Who am I, Denny Arends, and when was it first written? It's February, 2021. Just to have a little bit of a header, right? And I always do this. Every time that I start programming and I open up a new file, I make a header. Just to remind myself and force myself to work like a professional. Hey, in the end, hey, of course, code doesn't really care about if there's a header or not, but since you are not working in a vacuum, you're always working with other people. So in theory, your code could end up somewhere living on a server. And in case someone has any issues with the code that you wrote like 20 years ago, then it's nice that they can figure out who you were and when you wrote the code, right? So then they can mail you about it and they can ask you questions. In theory, what you normally would do is say, add my address or email address to it. And then that would be enough for people to find me later on. And it just gives you a little bit of an overview and it just makes code so much more pretty. So let's save this file somewhere. So I'm having to save this here probably and I'm just going to say assignment 12.r. Let me see. I'm not overriding any files there. All right, so the first thing that I do in any R file is set my working directory. So let me copy paste the location where it is. So I've recently uploaded all of the things to the OneDrive. So I can just get it from there. So in R we have to change the backslashes to forward slashes because we're not doing character escaping. And then I want to do a retable and I want to do a retable. What was the file called again? Sample.vcf. And then I'm just going to store this in a variable called myData. All right, so let's go to R, see if everything went okay and if we can load in the file. All right, so if we look at mData, just look at the first 10 lines, then I see directly that something goes wrong, right? Because there's definitely header in the file but by loading it in the retable function doesn't recognize this and it just gives me v1, v2, v3 for the different columns. All right, so let's go back to Notepad++ and fix that. So going back and fixing that means just adding header equals true. All right, let's retry this. So let's go back to the R window, copy based. Look at myData and now everything seems to be okay. Not really because my individual name is kind of worked. R doesn't like to have column names which start off with a number. So the individual name 860v2 is not a proper column name in R. So of course we can either just accept this and just remember it or we can do something else and we can say, no, I want to use the name that I gave to the animal which in this case I do wanna do, right? Like the people in the animal house are not going to understand when I change the name of animals so I just wanna do this. So I'm just going to say check.names equals false. So I don't want you to check my names or fix them for me. So let's copy-paste this. All right, so let's go to R, copy-paste and then show the first 10 lines and indeed now the name of the individual is correct. So without the X in front of it. All right, so that was the first question. So 3a, open up R, set your working directory and load the file. Okay, so then question 3b is how many sequence variants are known? So I already told you that if in the third column you find an ID, then it is a known variant and if there's a dot, then there's an unknown variant. So the question is how many are known? So let's go back to our node back plus plus window, there you are. And what we just do is so we take the third column and normally I would say we just do like three, right? But it's much cleaner to use the name of the column. So the name of the column in this case is called ID. So I'm just, because in theory, if at a later point in time someone adds a column in front, then I'm not selecting the right column. So in R, I always try to select column by name and that will kind of help me because if a next version of the file has an additional column or is missing a column, then it doesn't start shifting, right? And I'm still working on the right column. So in this case, I want to know all the known variants. So I'm just going to say, well, a known variant is not a dot. Which ones are those, right? So which IDs are not kind of missing? And then I'm just going to ask the length of that. And this should give me the amount of known variants. So let's switch again to R, copy paste it in. I'm just going to make my window a little bit bigger. And then it tells me that there are 2065 known variants in here. And if I want to know the unknown variants, I can just say which ones are equal to the dot. And that means that there are 885 unknown genetic variants in this file. So very basically just take the third column called ID. Check how many are either missing or are not missing. And just use the which statement to kind of get which ones those are. You could do this a little bit quicker, right? Because you could also say, well, I'm just going to do it like this, right? And if I do it like this, then it just gives me a true false factor back, right? So for every entry, it will say this entry matches or does not match the thing that you did. So we can kind of use this shortcut and just say sum it up because of course everything which is true, true has a value of one, false has a value of zero. So if I would sum it up, then it would also tell me the same thing. So you can use the sum function as well because a true false factor is nothing more than a vector with ones and zeros in there. But to answer the question, the answer is there are 2065 known variants and unknown variants is 885. All right, so then question three C was how many sequence variants are of low quality? So when we are thinking about low quality, we want to see where the quality is scored, right? And then here we have a column called filter. And here in the filter, it will tell me if the quality was deemed too low to be reliable. So in this case, because the GATK tools that we used generated the VGF file, but when calling the SNPs, it also gave you a quality score. So the quality score is here. And then the question is, like it probably had a quality score of 30 or 40, well, probably 40 because this one's not deemed low quality, but that's kind of what we do. So now we want to know how many times in this filter column is the name or is the keyword low quality. So when we go back to notepad plus plus, let me open it for myself as well. So we can just probably paste this and then we're saying we want to look at the filter column and we want to know where the filter is equal to low quality. And then the length of which will give us the number of low quality sequences in here. So let me go back to R and just copy paste it in. So there are 309 genetic variants cold, which have a low quality, which means that these should actually be removed from the analysis because they are not reliable. So in the end, when we want to look at these things, then it might be that they do validate in the lab, but there's a high likelihood that when you try to, for example, sequence one of these variants that the sequence will just not work because the quality of the original call was too low to be reliable. All right, so how many sequence, okay, so we already had that. So question 3D, how often is the A the reference base pair? So the reference base pairs are stored in the column ref. So we're just going to do the same thing again. So let's go back to the notepad plus plus window. I'm just going to duplicate my line and I'm just going to ask how often is the reference A and then that will give us the answer to how many there are. And of course, when you guys are writing these files and I would definitely advise you to add something like this. So say this is question 3D, I think, right, we were at 3D. Yeah, so this is 3D and here we have question 3C and this is the answer to question 3B and here we have the question 3A, right? So just to make it clear what you're doing. Of course, in this case, it's relatively nonsensical but the longer your scripts are becoming, the better it is to have some kind of feedback in the script telling you what you are answering. All right, so let's just take question 3D because we are curious to see how many times the reference base pair was A. So A was the reference for 646 snips. And of course we can also kind of do something tricky, right? So it was not in the assignments but we might want to know how many times is A the reference allele and the filter is not low quality. So we can actually in R quite easily ask those questions because we can actually just combine these statements, right? So we can say which filters are not low quality and the reference is A, right? So I can just combine like two or three of these questions in one go. So now I'm saying, well, give me all the snips which were not low quality where the reference allele is an A and then when we go to R, we can then have more confidence. So we can now just only select the base, or only select the snips where the reference is an A where there's not a low quality call. So we see out of the 646 snips which had an A as the reference base pair, there are 594 which are not low quality. So which are reliable and high quality. So that's one of the things that I like about R. R is really good at doing these kinds of stacked questions where you have like five or six filters that you wanna apply in one go. So give me all the, and you can of course combine as many as you want. All right, let's go on. Question three E for this subset, what is the average number of sequence variants per 100,000, 10,000 base pairs? Hint, you can use the diff function to get the distance between snips. All right, so and we want to make a subset. So first let's go back to the R, or to the notepad plus plus window for you guys. And so I'm just going to go with the question, right? And I'm just not going to filter for low quality. I'm just going to say, so which of these are A, right? So these are the IIs, the indexes that I want to use. You can also use like indexes or any other variable name. So I'm just gonna first store this into a variable. And then I'm going to use this variable to make a subset. So I'm going to say mdata indexes. So only select those. And this is my sdata for subset data. So first let's make the subset. So let's go to R and show you guys how the subset looks. So sdata now looks like this. Let me show you the first 10, right? And so now you see here that the reference is always A and here you see the positions. So now the question is, is how many snips are there? What is the average number of sequence variants per 10,000 base pairs? So there's two ways of answering this, right? We can just say, well, I'm going to take the last position, subtract the first position. This will work when they are all on the same chromosome. So let me first check that, right? Because I first wanna know if they are all on the same chromosome. Because if they are spanning like five or six chromosomes, then this will be a big, big issue. So when I just for my subset data print the chromosome, then we see that all of these variants are on chromosome 19. So that's not an issue. So then we can say, well, if we take my subset data, right, let me show the first 10. And then we have a column called pos, which is the position at which they are at. So I can say, well, take the position column and give me the maximum value and give me the minimum value, right? So that means that the first snip is located at around three megabases. The last snip is located at around 6.1 megabases. So I can then subtract these two to see how big the region is in which these snips are located. So I'm going to go back to the Notepad++ window and I actually closed it for myself. So I'm just going to say, so the length of VGF is the maximum value minus the minimum value. And then that will show me how long the segment is. So we are looking at a DNA segment which is in total, well, 3.1 million base pairs long. How many snips do we have? Well, that is the number of rows in S data, right? So those are the number of rows of S data. So there's 646 snips in 300 million base pairs. So that means that there is a snip every so many base pairs and every so many base pairs is, of course, the length of the VGF divided by the number of rows of S data. So if we throw this in R, then it will tell us that around, there's one snip every 4,806 base pairs. And of course, when we divide 10,000 by this, so we can just say, let's go back to the window. So now we can say, oh, something went wrong. So we can just say 10,000 divided by this thing. And now I have to add brackets, of course, brackets like this. And then we get the answer that there are 2.08, oh, you guys can show the CDR. So there are 2.08 snips per 10,000 base pairs. Is 3E. So let me go back to the window and go here and then say this is question 3E. All right, so question 3F, what is the average quality as mentioned in the Qualcomm? So again, had just calculated an average. I think this, I asked this for the whole data set. So let's just take the whole data, which is stored in M data. So we can just say, question 3F M data, take the Qualcomm and take the mean value. All right, so let's throw this in R. So R will then tell us that the average quality is 103.98, so around 104. And then the next question, question 3G, is what is the average quality when we ignore the low-qual sequence variance? So of course, we have to ignore the low-qual sequence variance so we can use the thing that we did before. So we can say high-qual. Which is not low-qual. So I can just say, oh, you can actually see my window. So I just copy-paste the which statement that we had before and then I flipped around the is equal to, to is not equal to. So which M data is not equal to low-qual and I call this high-qual, so high-quality. And then I'm just going to say, well M data, take the high-quality snips, take the Qualcomm. All right, let's go to R, see if everything worked correctly. So when we ignore the low-quality scores or the low-quality snips, we can see that the quality goes up a little bit but not that much. So it's 113.7, which is of course logical because like the low-qualities have low-quality scores so it will not improve the quality that much. All right, so those were the answers to question 3. All right, so next questions were about the GTF format. So let's go back to Notepad++ and show you guys the GTF file that I gave you. So let's open this up. So here we see the GTF file. So the GTF file looks like this and then the question was for a which species does this GTF file belong to? So anyone in chat have an idea what a escrova is? You guys are the biologists, I'm the bioinformatician. I'm not supposed to know this, I can just Google it but you guys should know what it is. Come on, chat participation. You get a participation prize. No idea, says Commando. Sorry, what? Well, can you repeat the question? So the question is like we see here that the genome build is called escrova 10, right? So what kind of an animal is an escrova? Question 4A of the assignments. Have a sip of coffee while you guys think about what an escrova is. Some people were really good in Latin last time. Like I had the crayfish and Godzilla. No, it's not a Godzilla, no idea. All right, so an escrova stands for susscrova which is a pig. So in this case, pig day. Scrova means hooker. What in which language does it mean hooker? That doesn't make sense to me, that. It's probably not in Latin. Italian, okay. Are you sure? I'm gonna ask my Italian PhD student that, but I'm not thinking that it's, could be, it could be. But escrova stands for susscrova. Oh, you Googled it. Yeah, no, Google's always right in that sense. All right, so susscrova, we're looking at pig. So question 4A, which species pig? Question 4B is load the file into R using the retable function. Okay, so let's do this again. So let's go back to our assignment and then say hashtag question 4B. And there's a key missing there. So again, we're just gonna be lazy. We're just gonna copy this. I'm just always starting off. So I'm never using any commands at the beginning. So the first thing that I just try is use a retable. And then in this case, our file is called sample.gdf. And that's called is not mdata, but, all right, so let's go to R, see if it loads incorrectly. So we go to R and then it gives me an error, right? It says scan line one doesn't did not have 41 elements. So that means that it scanned line one and then it scanned line two. And there was a mismatch between the amount of elements on line one compared to line two. So let's have a slightly closer look at this file, see what is actually going wrong. So hashtag lines are ignored by the retable function. So it ignores these things. And then here in line one, it tries to parse this and then it figures out that this line is completely different than this line. And when we look at it in here, then it does seem that the second line is much longer and the third line is also much longer. This is because we have to supply the separator, right? Because we can see that when R tries to figure out the separator, it just tries the comma first and the dot comma, then the tab. And as soon as it finds a whole bunch of characters, then it just says, well, this character is probably the separator. So in this case, probably R makes a mistake and recognizes the separator as being dot comma. So let's go back to notepad plus plus and I had just explicitly tell this thing that the separator that we want to use in this case is the tab, right? Just like a VCF file, a GTF file is tab separated. So let's go back to R and see if that fixes our error. And indeed it does. So we look at GTF data. Let's just look at the first five lines. And then you see that it kind of loaded it in correctly. You see that the ninth column is just a very big column with all kinds of data in there and this data has another separator. All right, so then the next question was how many genes are in this small subset of the GTF file? So you can see that the biotype of the thing that we are looking at is listed in the third column. The third column in this case is called V3 because R didn't find a header. So in this case, of course, we can't use the column name to select from, right? So here we're just going to say GTF data column three. So we want to do something like this. And then of course we can use the same structure as what we did before. But let's do it slightly different. So have because we already used the length of which a whole bunch of times. So we're going to be a little creative in answering this. And we're just going to answer this a little bit differently. So I'm going to take the third column and I'm just going to use the table function because you might not be interested just in the genes but you might be interested in some other things as well, right? How many exons are there or something like that? All right, so if you go back to R and we just use the table function then now what the table function does it goes through the vector that you supplied. So in this case, the column that you supplied and it will count the occurrences of all of these individual things that it finds there. So here you can see that it finds 1160 coding sequences, 1216 exons, there are 106 five prime UTRs and there are 120 genes in here. So the answer here is 120 genes. All right, so the last question, question 4D is from which sources was this file created? So let's go back to the Notepad++ window again and look at the versions on top, right? So from which sources was this file created? Well, it was created from the Suscrofa 10.2 genome build and the accession of this build is GCA 0003025.4. And this of course, this last part here this is the database identifier. So if you wanna look it up in the database like ensemble then you can use this database identifier to find the correct genome build or even download the whole genome sequence if you wanted to. So that's more or less it and then there was an additional question in the assignments is to make an R package. So I hope everyone tried. It's not hard making an R package. It's just like you have to follow the steps very carefully but it was explained in the last lecture. So if you just go slide by slide and just do what the slide tells you then in the end you should end up with a working package. Again, don't forget if you wanna build a package in R you have to install the R tools. So R tools is the compiler which allows you to install the R package and these kinds of things and compile the C code and all the other things that belong to it. All right, so back to the overview. We've been streaming for 50 minutes. So let me see if Jan is in chat. Jan, you're here. So I made you some slides about your DNA meta bar coding which you asked for last week. And I think we should just do a break first, right? Because if we start talking about it now then the recording will go over an hour and the file size will be too big. So I'm thinking that I will do a very quick break like I will probably be back at around three. So just a quick break and then we will talk about DNA meta bar coding and I think that there will be some answers or some questions from you guys and I hope I can answer that. Good, so I will stop the recording.