 All right, so I'm recording now. So welcome everyone. Also, if you're watching this on Moodle, thank you also for joining on Moodle. Today we will be doing descriptive statistics. So the first lecture was more or less about types and variables. The second lecture we talked about control structures and for loops and while loops and how to kind of get a flow of going through your program. The third lecture was loading in your data. So if you have all of these tools already, then you're more or less kind of done as a programmer. Of course, then you still need to learn all of the stuff that R gives you. And since R is a programming language for statistical computing, it gives you a lot of things like descriptive statistics. So we will just be very basically going over descriptive statistics today. But first of course, we will be doing the answers to assignment three. So I'm again very sorry for not being able to keep the Wednesday appointment or the Tuesday appointment. We had a very short brief Wednesday meeting with like three, four people joining. There were a couple of interesting questions. So I made a recording and I will put it online. I also forgot to put the recording of the previous Tuesday kind of question hour. And I now put that online. So I will put the other one online as well. The slide should also be online for anyone who wants to take notes on the slide. All right. So let's move to Notepad. And that is not what I wanted to show you guys. So these were my answers. So again, I always start with a header. You can see that I changed it last time May 2017. So that means that some parts might not work and I might have to edit them live. Let me pull up the questions as well. So assignments three. All right. So first question was loading in a whole bunch of data files. So of course here, question zero explicitly mentioned that you have to unzip the data. I know that R can read zip files directly. But the idea of this assignment was not to directly load them from the zip file because there's many different text files in there. So the question, read in the different text data sets, TXT, FASTA, using retable or read CSV and make sure that the separator is set correctly and all of the other requirements so that the data looks good in R and that you can kind of use it. So yeah, let's just start. So the first file that I loaded in was lecture three data one dot TXT. I'm going to open it up for you guys in Notepad++ because that's the thing that I also always do because you have to look at your data. So what we see here is if we look closely, we see that we have a couple of columns and we directly see that this file is tab separated. And this is one of the things that I really like about Notepad++ is that you can see the difference between a space and a tab character. So tab characters look like these little arrows. Furthermore, you can see that this file is quoted in some places. So how do you see that there's like these double quotes surrounding it, which will help us read it in. So the way that I answered in the first one is that it's just a matrix. The only thing if you use the read CSV function, you have to set the separator to tab because the separator for the read CSV function is of course a comma. So I'm just going to copy paste this into R for you guys so you can see how it looks and then you can also see how I normally would look at the data. So I just copy paste it in right and then I say head. So I call this one or I call the data set one. So I just define a new variable. I set the separator to tab and then it kind of looks okay directly like it has figured out that there is a header. The header is there. The first kind of data line begins with an ensemble gene ID. The only thing I think which you could do to improve this still a little bit is the ensemble gene IDs can function as row names. And as long as there's no duplicate row names, they should be allowed. So a call like this saying that row names, row dot names equals one would also work. And now of course you have row names. So instead of having to having the ability to say give me row number one, you can also say give me row with the name ensemble like this, right? So you can also select by ensemble ID and then you just get a single row of the matrix. All right. So the second data file, let's switch back to notepad. The second data file is a little bit more tricksy because that is a faster file. And I kind of put this in because not everything is a matrix, right? But even though it might not be a matrix, you can still use the read CSV function. And you see here that I did a little trick because I set the separator to a new line. So first I'm going to open up the file for you guys so that you can see it, right? So you see that the file here just has a name, right? Again, an ensemble gene ID and then it has a long sequence. And it doesn't look like a matrix, but it kind of is, right? It's a matrix which is separated by new lines in a way, right? Because you have like, and this is of course up for debate. You could use the read lines function as well or something like that. But just since the assignment was to load it in using, let me get that window away for me. So when I load it in, I just say that the separator is a new line because it is kind of a singular row, right? With new line separation. There is no header and I'm saying read everything in as characters because there's no numeric or something like that. So if we load this in into R, then we get something called two because I'm assigning it to that variable. So if I then look at two, right, and I just print it, then you see that it actually already figured out that there's like 50 elements. And if you go all the way to the back, then you see that it just made one big column out of it. So it just says V1, which means column number one. And you see that it has different rows. Of course, this is not the right format that you want to have your data in. So the thing that I did afterwards to kind of make it a little bit nicer is then say, well, because of the structure of the file having a name, and then the sequence, then the name, and then the sequence, we're kind of in the thing that we discussed in the first lecture, like in the first assignments where we talked about the grade Buchstabe, right? So the even numbers or the odd numbers. And that holds here as well, right? We can see that the odd lines contain names, while the even lines contain sequences. So we can use what we learned from lecture number one. And I'm just going to say, well, I'm going to define a new variable called FASTA. And first, I'm going to take the DNA sequences out. So I'm going to say from matrix number two, select. And then here, it's the same what we had before. So it's a sequence from two to the number of rows into go by two. So this is just saying two, four, six, until we end up at the end of the file. And then the next step, I'm going to say that, well, the names of these sequences are, of course, in the odd line. So one to the number of rows stepping by two. And then if we do this, the file looks a little bit better in R. And it's a little bit more structured, because now the names are coupled to the sequence. So if we do it like this, then now if we look at the head, it still looks very much the same. But we see now that it really has kind of names in there. So now if I use a name, I can use the name to select the sequence that I want. So I could just say FASTA, give me the sequence of this element, and then it will show me just the code, the DNA code, of course. Of course, it's just for you guys to practice rolling in different types of data. But this is the way that I approached it, just because there is some kind of a structure. And of course, you can use this little trick where you say that the separator is a new line to still read it in as a matrix. All right. So the third one, let me open it up so that we can just look at it quite quickly is number three. So this is a very common file format. So CSV is a very common file format storing single nucleotide polymorphisms or small insertions and deletions when you do DNA sequencing. And this file has a massive header. So the first like 53 lines are just a description on how this file was created. And because of that, this of course is not a matrix. But if we scroll down further, we see that it has kind of a typical matrix structure. So we see that it has a header. The big issue with this header here is that there's actually a comment character in front of it. So it won't load it in. So we have to explicitly give the column names because the format has two hashtags in front of a line. And then it's a comment or more or less a description line. And the single hashtag line is the header, but our treat hashtags as comments. So it won't automatically recognize that these are the column names. So the way that I did it is just say, well, it's just a matrix. So I'm just going to use the retable function. I could have used the read CSV function as well. But the retable has the default separator set to tap. And it's a tap separated file. Then I say load in the file. And then I'm going to explicitly assign the column names. So I just copy pasted this from the file and just made it explicit that we are going to set column names. So let's just quickly go to R and make sure that it loads in. Let me switch to the R window for you guys. And now when we say head of three, we see that well, it now has the proper chromosome and position. So the column names have now been assigned. And there's just, couldn't we use collapse hashtag? So how do you want to load it in then? Because there's no real way that you're going to convince R using the retable function to load in hashtag things. You could actually just replace the comment character, right? Because if you look at the retable function, it has something called comment.har, which is the comment character. So you could set that to something else. But you want to kind of ignore the lines which have the double hashtag, otherwise you would have to use the skip is something function. But it won't really load it in. But if you have successfully loaded it in in a different way, that's perfectly fine. There's more ways to roam. And this was the approach that I took. And since it's only a couple, right, if there would have been like a thousand columns, then of course I would have gone for a different approach. But this was just my approach. But the collapse hashtag might work, but you should have a look. One of the things that you notice here, right, is that the last column name 5073 is not valid according to R. So you see that R changes the column name to x5073. And that has to do with the width function. Because in R, every column should be a proper variable name. And of course, a variable cannot start with a number. It has to start with a letter. So x5073 is a valid variable name, but 5073 is not. So that's the difference what we see here. All right. So next one, lecture three, data four. So let me open it up. No placement is a little bit annoying. And let me switch you guys to the notepad plus plus window. So of course, this file, when you look at it, it's just it looks a lot messy. But of course, the thing that you can directly see is that this dot comma is the separator, right? It's again kind of a matrix format. And you see also here that the numbers here are all ones and twos. But you have 1234567. So I'm betting that this thing here is the row names so that it has explicitly numbered row names. So the way that I loaded in was to just say read comma separated file set the separator to dot comma. And in this case, I set the column classes to numeric because I want to load it in as a numeric matrix. And let me see how that works. So let me switch you guys to r because all of the numbers are ones and twos. It's definitely a numeric matrix. And of course, I only say numeric ones because it will reuse the column classes. So if every column has the same type, then you only have to mention the type ones, you do not have to do something like goal classes is C to make a vector numeric, numeric, numeric, numeric. No, that's just nonsensical. You can just say numeric ones. And then for each column, it will use that. So let's just look at the first couple of lines. Oh, that's not the way that I want. I had four. And now we see that it loaded in the data more or less correctly. It's still a big data set. It has a lot of columns. One of the things that I want to check is here. And you see indeed that it figured out that the row names were in the file and that it didn't put them under pvv4. So that's one of these little things that you have to check to make sure that your data doesn't shift by one. Of course, we could disable the row names and then we have an issue because then everything starts shifting or it gives you an error saying that, well, there's one more column in the data than there is in the header. So a comment like that could occur as well. But again, just by figuring out what the separator is. And in this case, we have to recognize that it's a numeric matrix. So we can just use the numeric call classes. All right, next file. I have to do something about my window placement. So the next file again is a CSV file. I don't know why I gave you two CSV files, not entirely sure. But there might be something to it. But again, it's just a matrix. So we can just load it in. And let me show you guys the notepad plus plus window. And so also this one has since it's just a VCF file, we can just load it in very similar to a matrix. So just we don't even have to specify the separator when we use the retable function. So you see that I'm switching between the retable and the read CSV function. And that is because both of them, although they are equal, they are slightly different in a way. They have slightly different default values. And sometimes retable works better than read CSV. And it just depends. So I always try both to see which one is the easiest and will give me the shortest call in R. All right, so I'm going to just copy paste it in. And then we're going to look at the head of five. And if we look at the head of five, then we see that indeed, we now don't have column names. So again, I should specify the column names to make sure that they're in there. And again, here have this is the chromosome. One of the nice things that we see here is that there's actually something weird, right? We see this dots. And in the file, when we go back to the file, we can see that the file has indeed dots at these positions. But these dots mean that the value is missing. So if I would be really, really wanting to do this correctly, I would say something like na.strings is C. Of course, the na itself is a missing value. But I also want the dot to be interpreted as a missing value. And now when I load it into R, it will replace or it will when it encounters a dot, it will actually replace it by a missing value. And of course, in this case, the value is really missing. So you probably want to set the missing value to add the comma of to add the dot to the missing value list. And of course, you still have to set the columns. I didn't do that because I was a little bit lazy. All right, so the next one is CSV file. Let me open it up for you guys, just so that you can see it. So it's a comma separated file. And here, we see that there's something weird because it has a header. And this header contains all kinds of information about this file. Is that weird that there's a longer sequence in v4. In the other ones, it was only two letters. Let me see v4. Yeah, so v4, I can tell you something about the CSV format. So the CSV format is based on sequencing, right? So let me go back. So it stores single nucleotide polymorphisms or small deletions and insertions. So you can see here in the header that it says chromosome position ID, ref and alt. So ref means this is the sequence in the reference genome. Alt means this is what it replaces it by. So here we see that there's a tg in the reference genome. But in the alternative, so in our sample that we sequenced, or in some of the samples that we sequenced, because we have three samples here, we found a t. That means that this is the coding for a deletion because the g is deleted. So in the reference genome, it says tg. But in the genomes that we sequenced, we find only a t. Here we see that there's a bigger deletion. So here we see that the original sequence is gg, a, a, t, c, a, t, a. So that's the part that's written in the reference genome. But in the samples that we use, so in the chicken samples, which you can see, we see that there was a deletion of these base pairs. So these base pairs do not occur in the chickens that we're looking at because the chickens that we're looking at only have a g. So it can be both ways around, right? So these are, these are all kind of minor deletions. And some of these deletions are much bigger, because in this case, actually almost 28 base pairs get deleted. So I'll head the original sequence in the reference genome has 20, 22 base pairs and 29 base pairs. And in our chickens, we only have one. So that means that 27 base pairs have been deleted in this case. So that's, that's kind of how this file is structured. All right. So the sixth data, this is just a database dump from Onim. So Onim is the database for online Mendelian inheritance in men. So it contains information about Mendelian phenotypes. Again, this, you don't have to know exactly how this works or what's in the file. But by loading it into R, one of the things that we can see is that we have to skip the first couple of lines, the first couple of lines do not match. And since it's not a double hashtag for all of the lines, R will not automatically ignore this. So when you just try and load it in using read table or read CSV, it will complain that the number of columns is different from the number of observations in a way. Because of course, if it looks at the first line here, then it doesn't figure out that this is a comment. Well, this one it does, but the next line it doesn't, because it's not a hashtag, hashtag. So the way that I loaded this in is to skip 17 lines at the beginning of the file. And then I say that the row names are one. And let's just load this in and in R and see how it looks and see if we can still improve on it a little bit. Because when I looked at the file, we can look at the head of six. And then what we see is that there is something weird, right, or weird. But we can see that again, we have a very strange missing value, because missing values in this file seem to be coded by minus, minus, minus, right? Because chromosome 17, then we have something from a missing chromosome or where we don't know the chromosome. And the same thing holds for the physical positions and the physical position end. So here in this case, we now, if we look at the file, then we can see it's minus, minus, minus and minus, minus, minus is a missing value. So how we could add that again, saying that na.strings is and then we just say, well, the na itself should always be interpreted as a missing value. But if you see three of these also interpreted as being missing. So if we load it in and then look at the head, now we see that indeed the missing values are coded as na properly. You can see that here something went wrong. But here, of course, we see that it's ambiguous position slash missing value. So we can't really solve that we could put the whole thing as a missing. But then we of course lose the information about ambiguous positions. So this is not a perfect table or a perfect matrix, but it's close enough. And after having it in R like this, it is kind of usable. You do see that some of the header names get fixed, right, because it has to be a valid, valid variable name. So spaces are replaced by dots. If you don't want this, we can actually do something like check dot names equals false. And then we will force R to use the names as if they occur in the file. So we now see that the column is called position space and so it doesn't replace the spaces by dots, which sometimes you want, sometimes you don't want. It depends on what you're going to do with this table and the rest of your code. But in this case, it doesn't really matter because we're not using the data later on. All right, back to the notepad plus plus window. And the last one is a matrix with malform names. So here we use the check names as false. And then in the assignment, I actually didn't do that. So well, let's open up the one drive. I think actually these are the answers to previous anyway, but so if we look at it, then we again see it's all numeric, except for of course the AX instead of the ID. So we have an ID and then we see here that the rest of the matrix is all numeric. So we could also add like call classes is numeric and that might be okay when we specify that the row names are in one. But the main thing that is wrong with this one is that you see that the names of the columns are not valid variable names. So here we would then say use check names as false. I don't know why I didn't add it. But had this would be the way that I would load it in. So if I would go to R, it would look something like this. And then we just say give me the head of seven. And then we see that when we put the check names to false that indeed we get the proper column names as if they were in the file. If we wouldn't have done this, right, then it would have just put an X in front of all of them. And of course, it depends on how you want to match things together if this will work or not. But that's the way that I solved it. So you can just use the check names. So just set the check names to be false. And then it will use or it will force them to use your names. All right. So next assignment was reading the lines of the text file Laura Ipsum line by line using a file connection and a while loop. So part of the code you could get from the lecture. But the way that I solved this is first make a connection to the file, right, and make sure that you specify that you want to have the file in reading mode. Because if you open it in writing mode, it will empty up the file and will destroy your original input data. So you always have to specify explicitly in which mode you want to have it opened. And then this this t file thing is our pointer to the file. And then we just use the magic incantation that says while the length of the line from read lines reading one line at the time is larger than zero. And so read a line if available. And then what do we want to do we just first want to print the line to the to the screen, right, because I just want to know for sure that it loads in everything correct. So let's just quickly go to our and load in the file. So answer to a. And it just just scrolls down the screen because I'm every time printing, I could have done something else. And I could have said, Well, instead of printing the line here, I can do something like line number. So I could say line dot n equals one. And then inside of the while loop, I could say line dot n is line dot n plus one. And now I have to do it here to my own structure. And now here I could just say, Well, instead of printing the whole line and just filling up the screen with letters or with with with words, just print the line number that you're currently reading. If I would do something like this, then of course it would just print one, two, three, four, until the end of the file. But if we do this, then hey, it takes a little while, but we can see that there's 151 unique lines in the file of in the text file. All right, copy the code from assignment to a adjusted so that each line in the file, the number of words on the line is counted using the spring split function, use a cut function to print out to the screen line x contains y words. All right, so let's do this. And let's see how I answered it. So of course, we have to again, use a line number, right? And since we're using a while loop to loop through the loop through the file, we can define a because initially we don't know how many lines there will be in any of these files. So we just say we use a while loop until we are at the end of the file. All right, general good luck. Let me actually copy your thing over things. I don't know. Let me see. So this is your answer. I'm just going to forward it because Twitch. All right, so you get line number, then you load in the file, then you use the magic incantation to go through it. Then you say amount is lengths, lengths. Yeah, you string split is lengths actually an over is an existing function in R won't want is should it not be the length instead of length. But I'm just going to format it right and then we can close. Okay, so you use the aspirin F function to do that. So to kind of make your line before you cut it. That's okay. But the thing is, what you see here is something that is kind of a very common mistake in programming is that you define a variable, and then you use the variable directly. Right. So if you define a variable and then use it directly, that means that it's not not really used, right, you could just say, well, copy paste this whole thing here, right, and now your code becomes one line less. Right, so that that already solves one one additional line of code or that that means that one line of code can be reused. But it's it's very common. And I like the fact that you're being explicit about what you are doing. Right, so you count amount by saying the length after splitting it, fix this true, you don't have to use fix this true when you split by space, but it's okay to set fix to true. So to fix this true, what it does, it actually normally when you use the string split, you can use regular expressions. So you can match things like a dot generally means that a dot means any character. And you have like a star, which means any character except for space. So fix this true means that you want to match this kind of verbatim. So exactly. So if a dot should not be interpreted as a wild card for any letter. But no, I think that the approach is perfectly fine. I like the aspirin F. It's a little bit old fashioned, because you could just say cut directly, right. So and here you do percent I, percent I, which is okay. But it's in our, you would not really do kind of an aspirin F it the function is there for people who come from programming in C. But I generally tend to not really use it. And I think the lengths here should be the length. But I'm remembering the links, did I ask you to remember them? No, I didn't ask you to remember the links. But so you could have just here said, I think it should be the length. Let me actually read in the proper file position and then see if your approach works at this part to the file, just so that it loads in correctly. Alright, so let's just do it like this. And then I'm just going to throw it in the R window and then see what it looks like. Oh, I failed to do the formatting correctly. So I forgot to format a single line. Interesting. So lengths also works. I don't, I never knew that lengths actually was a built in function. Get the length of each other on a list. Yeah, no, perfectly fine. So worked perfectly fine. You get the exact number of words. So like, it's good. It's good. It's good. It's good. I have no complaints about it working. So alright, so the way that I did it is probably very similar. I actually remembered the lengths. I shouldn't have, I don't have to do that. But generally, when you go through a file and you want to kind of get some statistics, then you also want to remember the statistics because you want to do something later on. But of course, these two statements here are completely meaningless. And so if I would kind of make my code as short as possible, then this would be my approach. I would say the line number that I'm on is number one, make a pointer to the file, go through it until we are at the end, do the string split of the line by space, and then ask for the length of the first element. And this is because string split is a function which allows you to do not only split one line, but it also allows you to split like five lines or 10 lines at the same, same go. And then if you split five lines, it will give you back a list. And then the first element of the list will be line number one split it. The second element of the list will be line number two split it. So I think that's where the length function comes in here, in your case, to kind of get rid of that. And then I do the same thing. I just don't use the sprintf function. I just say cut line number length line dot length, which I just figured out using the length function. And then I put a new line in the back and then I move to the next line. And of course, this should give the exact same answer as yours. But just for completeness sake, let's just run it in R. So and then you see that it works exactly the same. We get the same number of words on each line. And then of course, don't forget to close the file at the end. So overthink, I don't, I don't think so. It works. And if it works, it works. All right, so this probably will fail since I updated my R. And let me see if I actually installed the Biomart. Let me see. Is it installed or should I install it? I don't want to install it actually. Okay, good. So Biomart is there. And you can see that this package is relatively big. It takes some time to load it in. Not just that, but it needs an internet connection. So Biomart, because it queries external databases, you have to make sure that you're connected to the internet. So you can't use it when you're in a train in Germany, because trains in Germany generally don't have internet. Well, unless you're in IC, because those do. But that's the way that I it's just a limitation. All right, so the first thing is loading it and then listing all of the available marks. So all of the available data providers. So let's just do that. So we list all of the marks. And then we see that there are four standard marks that are available. So you can go to the ensemble gene database. There's the mouse trains. There's the ensemble variation database, which contains single nucleotide polymorphisms and little deletions. And then we have the ensemble regulation database, which contains regulatory elements, which allows you to query things like, does this gene have a promoter, which is able to bind estrogen or something like that. But for the assignments, it was connect to the SNP database for mouse. So the SNP database for mouse is the variation database. And you can see it in the name because it's called ensemble mart SNP. So first things first, let's just continue connect to the SNP mart. Oh, wait, I have to show you my window. So let's first just connect to the SNP mart, right? Because we don't know which one there will be. Like this SNP mart will have SNPs for mouse, but also for humans and cows and goats and all of all kinds of different species. So when we when we go to R, then we can go and we can connect to this ensemble mart SNP. And now, of course, we need to know which data sets there are. So we can say list data sets of this SNP database, and then it will give back a list of all available species. So you can see that, hey, you can ask for SNPs in cows and goats and dogs and zebrafish and horses. And then have we want to have mouse. So when we look through the list, then we see that mouse should be at the M. And mouse is of course called a mus musculus. So you can use that to connect. So then we can update our statement, right? And instead of just connecting to the SNP database, we connect to the SNP database to the mouse database. So the mus musculus SNP. Is it still called mus musculus SNP? Still called mus musculus SNP. All right, so let's connect to the database and make sure that that works. So let's go to R. And let's just connect to the mouse database. And this will take a little bit of time because it has to communicate. But all right, so it connected successfully. And now we could do something like list attributes, if we wanted to know which things we could query for, right? Because the attributes are the things that we can request from the database that we are connected to. And then we see that there are literally like 66 different things that we can actually ask from this database. And so we can ask, for example, what the name of or the idea of the variant is, where the variant comes from, which chromosome it is located, what the alleles are. So that's a lot of things that we can kind of get there. All right, so load in the SNP IDs dot txt file. So let me see up today. So it is called in my case, I just read it as a table. I take the first column and I just force these things to be characters, because R tends to load in characters as factors, because factors just take less memory in R. But just loading in the file will then give me the list of SNP IDs that I can use. I forgot to switch you guys back to R. So if I then look at SNP ID, then I'd see that I now just have a list or a single vector containing all of the different SNP IDs. Of course, I could just read in the table, right? And then you see that the table has a single column. But I don't want a column, I just want to have a vector, because a single column is a matrix with one column is actually just a vector turned on its side. So what I'm going to do is just select the first column using the matrix selection operator. So just square brackets comma one. So give me all the rows first column, and then I'm just going to say as character to make sure that they are really characters and not factors, which R tends to do. So R has a propensity for turning characters into factorials. But in this case, I just want to have them as characters. All right, so now I have my SNP ID vector, right? And now we have to query Biomart for the SNP IDs. And what do we want to retrieve? Well, we want to retrieve the ID. And this is because Biomart does not automatically give things back in the order that you provide them. The database is sharded, which means that it's on different servers in different continents even. So that means that it will just do the query, but it does not guarantee the order. So every time that you do a query to Biomart, the thing that I do is I use the filter, which is the SNP filter. But I also ask the same thing that I query for, I ask to give it back as one of the columns. Furthermore, I also want to have the allele, the chromosome name and the chromosome start. Since these are SNPs, the start is the same as the end position. If you would query for small insertions or deletions, like we saw in one of the VCF files before, then of course we also have to get the end position. All right, so we just say get Biomart, then what do I want to retrieve? So these are the attributes I want to retrieve. I want to say that the values that I'm going to give you are SNP filters or SNP IDs. The values that I'm giving you are of course stored in the variable that I just loaded in. And then of course I have to specify that I want to get this from my database connection. All right, so when we do this, then we end up with something which looks like this. Let me copy paste it in. It will take a little while to load it because it's just an online call. Well, that's actually really fast. So when I do REST on Biomart and I say, well, give me the first 10 lines, then you can see that I get back the ID. I get the allele. So this SNP is in the reference genome, an A, but some animals have been found which don't have an A here but a C. And the chromosome that it's located on is chromosome one, and this is at around 3.1 megabases. All right, I hope that's clear. So it allows you to query SNPs, but you can also query things like genes or which promoters or which transcription factor binding sites are in front of a gene. So Biomart is very flexible. And the four default values are not just everything because when we would look at this connection function, right? So here we say listMarts, but we don't have to because when we have this useMart function, we can also give it a host. So we're not limited to just connecting to ensemble. We can also connect to other databases as well. All right, then we have the long running analysis. So I'm just a question to you guys. Was there anyone who finished answer our question number four? And it's not bad if you say no. It's perfectly fine to not have done all the assignments. All right, so this one is not, and that's perfectly fine. Like the idea is just that you have something to practice. So for today, I also uploaded on the Moodle an additional PDF, which has a whole bunch of an additional assignments there. So it has like 20 or 30 different assignments. So nope, nope. Anyone more? Anyone did? Like if everyone's saying no, then it's better to ask it in a positive way, right? So did anyone finish? Thank you for following. I thought you were already following. Okay, as they mob, I started, but got stuck. I think so. Ben 0014. How do you mean you think you finished or you know you finished? There's a difference between that. But yeah, like the assignments are there for you guys to practice. And of course, if you get stuck somewhere, then that's perfectly fine. And this one is one of the hardest. This one is one like the reading through the file, right? Or doing like this big data computation, which you can resume at any point in time is something that I only learned like eight years into programming. But I wanted to show you at least at the beginning, when we talk about data, because it I think I ended up with the right result, but not sure. All right, so we will we'll see how my answer compares to your answer then. But I only learned it like six to eight years after starting programming in R that you could do something like this. So to kind of write the answers to the to the to a file. And then if you would pull the plug on your computer, then of course, the file would be stuck somewhere. And you can then resume from how many things are already in there. So I'm going to try and explain what I did. So the first thing that I do is set my seed. So just to make sure that, because I want this answer to be repeatable, I want to set my seed. And then I just generate a massive matrix with 10 million random numbers. This matrix is 100, or is 10,000 rows. And it is 1000 columns. So this is a pretty big data set. So first, let's do this right and see how this looks into R. So just go to our and then we generate this big data thing. And then we just say big data. Let me just show you like 10 to 10, right, because there's 1000 columns. And so you see that this matrix is filled with random numbers. What that's what we want. All right, so then the next step is make a file where you store your temporary results. Oh, crap, I didn't show you the note that plus plus winners. So this is where I want my results to be stored. So I'm calling. So I'm saying cut nothing into a file which is called temp.txt. And of course, when I rerun the analysis or when I quit the analysis halfway through, I'm not going to do this statement, right? So this is a one time thing. It clears the output. So it makes sure that this file exists and that this file is empty. But of course, when I want to continue my analysis, I'm not going to make the file empty, because then I'm going to rely on some of these things which are in the file. So then the next step is if this file exists. So if this file temp.txt exists, which of course, I just created it so it should exist, but I'm doing this in a try catch just to be sure. So I'm just saying, well, read in the variable temp, and then I'm using the try catch statement to make sure that when there is an error, it does something that I wanted to do. So I'm going to read a table. And this is the table I'm going to read. So it's just the temp.txt saying that there's no header. The separator that I'm going to use is tabbed. The row names are in row number one. And if there is an error, return an empty matrix. So matrix, which just has a single zero in there, but has no rows and no columns. So this is kind of a shim, right? It's like sometimes we do assign to the variable an empty vector. This is just assigning to the variable an empty matrix when there is an error. Alright, so let's first do this. And let's go to notepad plus plus. And let's just paste it in. So now when I look at temp, right, temp should be a matrix, which is completely empty, no rows, no columns. Alright, good. So now we can start doing our analysis, right? So I'm going to use this structure. So for x in the maximum of one, or the number of rows of temp plus one, right? So if there are no rows in temp, then this will just be one. But imagine that I've already done part of the analysis, and I already wrote like 100 results in there, then of course, I want to start not at 100, but I want to start at 100 plus one. So I want to start at the next computation. In the end, I want to do all of the columns of this matrix. So I want to get a correlation of column one versus all of the other ones, column two versus all of the other ones, and so on. So what I'm going to do, well, I'm going to calculate cores. So what I'm going to do is just say calculate the correlation of the big data of the column that I'm currently at, which is called x, towards the whole big data set. And I'm saying use is pair. And I always say use is pair, because this, this is if you have missing values, of course, we don't have missing values, so could have just left it out. But I always automatically when I do a correlation, I always type uses pair, just to make sure that when we have missing values that these do not cause any errors. Then in the next step, I'm going to paste. So save the correlation that I'm just computed to the temporary file. And then I'm going to notify myself that I just did a correlation. All right, so let's, let's see how this works, right? So the first time I can just copy paste this whole thing in. So I'm just going to say, well, I'm making my seed, generating this matrix, emptying my output file, loading in the output file, if it exists, if it doesn't exist, I'm just going to return an empty matrix. And then I'm just going to go through all of the columns and compute the correlation of the current column versus all of the other columns. So let's just go to R. And then I'm going to show you how this works. Okay, so just as ours has a question, let me switch you back. So what does this do? This selects column number x, it's just selecting from a matrix, right? So big data is a matrix. And the square brackets means select from the matrix. And here I'm selecting all of the rows, right? There's nothing here. So that means all of the rows. And this means column number x. Is that clear? It's just selecting from a matrix. So like saying matrix one to 10. Okay. All right, so let's just copy paste this all in and make sure that this starts computing. So I'm calculating my matrix. And now I'm saying, so it's not computing, right? And it has to do this a thousand times. So you can see that this is it's running, but it'll take some time. So imagine now that I get a call and they say, well, someone you know, got a car accident, come to the hospital now. Of course, I can just run out and leave my computer on. But the thing might catch fire, right? So the thing that I want to do is just say stop. And then I'm just going to run out and first shut down my computer. But then I come back, right? And now I want to continue. So I've already done 204. And I've already waited some time for this. And I just want to continue from 205. So what we're going to do is we're going to go to notepad, right? We're going to take the exact same code. But now we're not going to clear the input file. So we're just going to say, put a hashtag in front, don't clear the input file, because I already computed something. Again, what I'm going to do is just do the same thing. So I'm just going to select the whole thing. And now, if I'm correct, and I wrote the code correctly, it should start from 205, right? Because we already did 204 elements. So I'm just going to copy paste it in, not entirely, but I'm going to copy paste it in part by part. So the first thing, of course, I would have closed R and opened it up again, but I'm just going to leave it like this. So the first thing that I'm doing is generating this matrix again, or loading the matrix. Normally, you would load the matrix from a file and not generate random numbers. But in this case, I need to generate my input data again. Then the next step what I'm going to do is load in the results that I already had. So I'm just going to use the temp. So I'm just going to load in this matrix, right? So now, when I ask for the number of rows of temp, right, how many did I already do, it will say 205. So sneakily, it already did one additional. It didn't inform me here. But when I click the stop button, it stopped between the writing and the printing out number 205. So we're going to start at number 206, which is perfectly fine. And this is because of course, so when I click the stop button, I was exactly between these two statements, right? So here. So I click the stop button here. So it did write it already to the file, but it didn't print that it was done with this line, because it takes like a couple of milliseconds. All right. So now I'm just going to select my loop again. I'm going to paste it in. And then we are just going to continue from line 206. And that's what this n row temp plus one does, because I'm looking at the number of rows that I already have computed, and then starting at the next one. So when we copy paste this in, now it just continues loading. And now we can just continue the analysis. So that was the idea behind the assignment. And this is a very common structure is very commonly used if you have a computation, which takes like a couple of hours. And there's many things that you can do that take hours and hours and hours. And then you don't want like a single power cut to destroy like 24 hours of computation. And of course, I can do the same thing again. So I can now just stop it and say, okay, so I stopped it. And so we do the same thing again. And we can just now continue the analysis again by generating the matrix loading it in and continuing from number 406. Alright, so question for the start of the loop. Does this also work for call in one to number of column random matrix? Or do you need to include the n row? Yeah, because if you do it like this, or like you're proposing, right, it will always start at one. So it will redo the ones that you already did. Right. So here, the magic happens here, right? So the maximum of one, or if I already have something, the number of rows that I already have the result for plus one. So if you would do just four call in one to number of call random matrix, then the problem is is that you would redo the first 400 or something if you are at 400. Okay, good. So this was really, really hard. And trust me, this is something that you're not going to use on your own data any time during your master. But in case that you during your PhD, have like code, which does like image processing on like, have imagined that you are doing observations using an underwater camera fish. So you have like 24 hours of video material, and R is processing that for you. And the processing takes like half a month. Then of course, you don't want the computer to be on for half a month, you want your computer to be able to continue at any point in time. All right, good. The next one, we should take a break actually. All right, I'm just going to, I'm just not going to talk about it. I'm just going to show you the image file. So I'm just going to show you answer five, and then you can read on it on the Moodle. So I will put the answer on Moodle, and then if you have any questions, just send me an email. So I just wanted to show you guys, because I was really proud of finding the Obama thingy. And I'm just going to copy paste it in to show you guys how it looks. So I'm going to read a BMP file, I'm then going to process it. And then I'm going to use the image function to redo this image. This is actually not Obama, this is the Angry Bird thing. So what I'm doing here is very quickly loading in the image file, removing the header, getting the three different color channels by using like the odd even numbers had the sec function again. It's 200 by 200. That's just the size of the picture. And I'm going to an image from the red channel using topographical colors. So I'm using different colors. But this is just the red channel. All right, good. So those were the assignments for today. Of course, we can do the other channel as well, right? So we can look at the green channel as well. Use the topo colors. And then of course, it looks a little bit different because now everything which is green in the original image will get a topographical color here. All right, perfect. So I'm going to stop the recording for this.