 I think we should be live. Can anyone hear me? If not, then that's a shame. If people can hear me, then just throw it in chat. Just one person saying I can hear you is more than enough. And then we can switch to video check. So this is your audio check, one, two, three. All right, so seems to be okay. Let me check my settings. We're on low latency. We have DVR enabled so that people can pause and continue. I'm still getting the weird stream health suggestion. I don't know what's going on there. Like the audio stream's current bitrate zero is lower than the recommended bitrate. I have no idea what's causing that. All right, so my moderator hears me as well, so then we can switch to the lecture layout so that people can see me as well. So welcome, welcome. I hope everyone is having beautiful weather. Don't set it to sleep. Yeah, so I, whoo, that seems a little bit loud. I just went all the way into like the red area on my audio. So yeah, I hope everyone's having beautiful weather. The weather here is great. It has been for the last couple of days. Nice shirt. Yes, thank you. It's the new collection and I'm very happy with it. It's not as nice as the kind of pink salmon-like shirt, but at least got some compliments from my colleagues. Like you got a new shirt, so it's always good that people notice that you spend a whole bunch of money on getting new shirts and new outfits. All right, so we have around three minutes left. So I'm just going to see if I can talk three minutes and entertain you guys. How were the assignments from last week? Were they doable? Were they too hard? I think the ones where you have to read in the BMP file are difficult because it kind of brings everything together that we had, right? So we did four loops, we did while loops, but also we did things like second, repeat. So I'm hoping that they were okay and that people are actually able to do the assignments. I got only a handful of questions about the assignment, so that's a little bit of a shame, but that's up to you guys. Like if you do the assignments and you finish them and you have no questions, then there's no need to email me. I got one question that was interesting, so I will kind of hook into that when we do the assignments. And of course we have to decide how we want to do them because we can do them like we did last week. So me just typing them in and showing you guys my thought pattern on why I'm typing in certain stuff or we can just go through them relatively quickly. But then it will be a very short stream. I checked and I have 40 slides, I think. So not a lot of slides this time. I'm just hoping that we can fill enough time, but there's some other cool stuff that I wanted to show you guys as well. Because with the BMP stuff, you can actually do really interesting things. And I hope that I have some time to show you guys because I just like doing weird stuff in R that people generally don't do. But this one is pretty interesting that you can actually make pictures move and these kinds of stuff. So it was not an assignment per se, but it's something that you can do with the assignment. All right, one more minute to go and then we can actually start with the lecture. I'm seeing six viewers, so that means that we lost more than 80% of the people who signed up for the course or people are watching it back later, which is not an issue. It's just more fun when people are here and asking questions. So I'm hoping that some more of the students will show up. If not, then that's their loss, right? Like I'm getting paid anyway, students or not. The nice thing is no students means I get paid exactly the same amount and I can do whatever I want, which is also really, really nice. Good, so with that out of the way, my clock shows it's two, so we can start. So let me switch the slide. I hope everyone likes the drawings. Like I spend an insane amount of time making the drawings. It is really hard. It is really hard doing these kinds of drawings. They look very simple, but because I'm doing them in PowerPoint, it's actually pretty difficult to do them properly because PowerPoint does this weird feature that if you use the same color again, it starts merging things together and tries being smart, but I hope you guys like it. So the sun is still there. It moves every time, every lecture has a different position. That's just the way it is. All right, so the overview for today. First we'll do the answers to assignment number three. Last week we had a question about the API stuff because you don't have a drawing pad and pen. Yes, I do, but since I'm doing it in PowerPoint and not using a professional program for it, it's difficult. And I'm not the best artist to start off with. And you have to have an idea, right? Like so this time I did a normal distribution and some other little things, but they do take longer than I had expected. I had expected it to be relatively quick, but it's not as easy as you think. Just drawing a straight line on a drawing board is really difficult because you're not looking at the thing, right? You're looking at the screen to see what's happening. So there's this disconnect between, and normally if you would draw, you would draw on paper and you look at your pen, but that's not possible here. So yeah, about the question from last week because there were some people that said we don't do biology, so we can do anything with this Biomart API thing. So I decided to make a small little example of how you can use different types of API. So I decided that I would use the API from the CBS, so the Central Bureau for Statistic, which is kind of the Dutch equivalent of the status, so the German Agency for Statistics. And that contains all kinds of information about like murders and about deaths and crime and population density and all of these things. So I thought it would be just fun to kind of look into that data. There's a lot in there, right? Every statistical fact that's being collected by either the Statisticist Bundesamt or by the Dutch kind of equivalent goes into these kinds of databases. And the thing which is a real shame, right? Because Biomart is known for harmonizing all of the kind of interfaces between all of these different databases is that doesn't seem to happen in social sciences slash political sciences. Every country has their own database and for example in Germany it's even worse because every Bundesland has their own database which then feeds into kind of the central database. So, but I found an example and I think that it's gonna be interesting for the people that don't do biological sciences but want to use R to do other stuff. So, that's it for more or less the first part. So I think that we will take around an hour on the assignments. The API thing will take around 10, 15 minutes and then we will just start with the lecture. So I added some little things to the lecture which were not there last year. So I hope that that's okay, that we don't run out of time. But I kept the number of slides relatively low. So that should be okay. So we're going to talk about descriptive univariate statistics and that's the topic for today. So we will talk about things like central tendencies like how to calculate a mean, a median and a mode and then we will talk about dispersion measurements. So if I have a distribution, how do I find the range? What are quantiles? What is variance, standard deviation? And I wanted to say some things about outliers because especially dealing with biological data outliers are very common. And then of course, other measurements like the shape. So if you have a normal distribution a normal distribution can have a certain skew and it can also have a certain kurtosis which is interesting to kind of look at because they influence statistics when you do statistical test. But we will not be doing any statistical tests. We will just be talking about kind of the basic of basic parts of statistics, right? So describing a distribution. So calculating all kinds of numbers that kind of pin our distribution together. And I wanted to show you some example of some basic plotting routines because if you're like me and you're a visual person then of course you can't really imagine how a normal distribution will look like which has a mean of 12 and a standard deviation of three. But of course when you make a plot then it's very easy to see what's happening. So we will be talking about how to make box plots. I added a slide about violin plots which I prefer more than box plots. And you can make them in R as well. So why not do violin plots as well? And we will of course be talking about things like histograms and the image and the heat map. And we will continue on this next week because next week we will have plots, plots, plots that's the name of the lecture. And those that the whole lecture will just be about creating nice plots and plots that are suitable for publication. All right, so with that out of the way let's start with assignments from last week. So the answers for lecture three. So throw in chat how you wanna do it if you want me to program it live or if you just want me to show you the answers and go through them one by one. Because both are an option. I programmed the assignments like years and years ago so it will be fun to see when I read my assignments if I can come up with the answers myself which of course should be the case but of course it might not be the case. Let me switch at least to the R window. So that is clean, that is perfectly fine. Let me switch to notepad plus plus. Okay, so this is the very fun example that we will do at the end which is moving Obama in the plot window and that's what we're going to do afterwards just to have a little bit of fun with stuff that we can do. So no suggestions in chat if we wanna do it live or if we would just wanna look at the answers. I think I will just start a new script and we will just start doing the answers like this. So answers, lecture number three. It's not lecture, it's assignment number three. Let me move my window a little bit because it's on top of the chat box now. I would prefer live. Yeah, I would prefer live as well, right? Then I can do something instead of just copy pasting stuff into R and showing you guys what happens. Welcome by the way, Leonardo. So yeah, first things first, this time we had some data to load in, right? So if you have data to load in, you first have to move somewhere to your hard drive. So in my case, I put everything on my one drive. So I actually, let me just copy that. So this is my working directory, right? And then I have to save it and I'm going to save this as answers three underscore live dot R and if I give it the R extension in notepad plus you can also see that it will do the code highlighting. So of course, let's add a correct header, right? So it is copyrighted by me. I work at the Haub Berlin and today is 12, 5, 2022 and that's what we're going to do, right? So, and this is first written at this time. All right, we don't need to duplicate that. I just want to copy it. So just very basically moving to where I extracted the zip file because from Moodle you could get the zip file and there's a couple of files in there that we want to load in. All right, so we already did zero A which is just downloading the data file and zero B is unzipping the data. I hope everyone was able to do that in Windows. It's relatively easy. Just right click and say extract here, press next, next, next finish and then it will do that. All right, so question number one is read in the different text data sets, TXT, FASTA using the read table or read CSV function. So let me actually show you guys the data files, right? Because that's the first thing that you need to do because if you want to load in data then you have to of course look at the file first. So let's open up the first one. So this is data one. So what we see here of course is that it seems to be a matrix which is good so we can use the read table function. We see that we have ensemble gene ID, chromosome, start positions and positions. So very much similar to the data file that we discussed before, right? So before we discussed that a data file like this we have to just make sure that all of the different columns have the proper types, right? Because of course a chromosome is not a character. It is a factor, statistically speaking. And the same thing holds for the strand because the strand only has two options while for example, chromosomes have multiple options but they are options, right? You can't be chromosome 3.5. So we have to make sure that we load in the data in a correct way. All right, so let's go back and this is called lecture three data one and it's actually in a sub-folder because it is assignment three. So let's just move one assignments three data. That's actually the folder where they're in. Let's make sure that R doesn't complain about it. So we can go to R. Yes, so that makes perfect sense. That all works. So the first file is called lecture three space. No, dot, dot data one dot txt. So that's the file name. And first things first, I'm just gonna put a read.csv around it even though it's not a comma separated file, right? Because from notepad plus plus, if you look at the file, you'll see these little arrows. These little arrows means that this is a tab character. So we are going to directly say that our separator in this case is slash tab. One thing that we see more is that here we see that the first and the second column are quoted, then we see a number while in the first row everything is quoted, right? So that means that the first row is our header. So we are also directly going to say that. So header equals true. And then I'm going to say data one and just store it in there. All right, we're going to do save it, gonna copy it, then we're gonna go to R and we're just gonna paste in the whole thing. All right, so now I want to look at data one, right? And I always do it like this. I just do one to 10, one to 10, undefined column selected and show me everything seems to be loaded in correctly. The question is, are all of the classes okay? Because have we want to, for example, have chromosome name be a factor? So we can check that, right? So we can just say, because we can ask the class of the whole object that should be data frame, because not every column has the same class. So it's not a matrix, it's a data frame. So we are just going to then say, well, I want to apply to the columns, well, to the columns, the class, and then it should tell me what the class of each column is. So it says the ensemble gene ID is a character, character, character, character, character. So that's of course not correct, right? Because that would mean that if I would take the third column out, start position, then it would not allow me to do start position plus one or something, because hey, it loads it in as characters. So does it actually do that? No, it actually read it in as a integer, which is okay. So why does it say character, character, character then? I don't know, might be a little thingy in R. But let's make sure that we actually have every column loaded in correctly. So we go back, right? So we look at our data file and then we say comma call classes is, and then we have to specify our column classes and I'm going to move to the next line. So the first column is the name, which is of course just a character. Then the second column that we had is the chromosome, which is of course a factor. Then the third one is a numeric. This fourth one is also numeric. And of course this is really annoying, but it's better to make sure that we tell how we want to have our data loaded. Then we have our strand. So strand is again a factor. And then we continue on the next one. So let me just move this a little bit like so. So I'm going to say call classes is and I'm going to just put this one on the next line just so that it's okay. And then this is closing the read CSV factor. And then the last couple of columns, I think are all numeric, right? So this is the MGI ID, which is numeric or enough character, another character and another character. So everything until the end is character. So there's three characters. So I'm just going to say repeat character. And then we're going to repeat this three times until the end and we're going to close the bracket of the C and then I'm going to check that this one closes the read CSV function. So that seems to be all perfectly fine. So it's a lot of additional work just to make sure that the types are correct, but should all be okay. All right, so let's load in the data file. So let's go back to R, issue the command unused argument. That's true because call classes is written with a capital C. Let me fix that in the script quite quickly. And now it should work, right? So the call classes here has a capital C. All right, so if we look at data one, one to 10, then of course it should still, one to 10. It should still look the same, right? Double point. And now at least we should be able to say chromosome name or name and that's this. And now it directly tells me that there are levels. So that means that it indeed made a factor out of it. So it's really good, really good. So that's more or less the first one. All right, let's move on to the second data set. So the second data set is a little bit sneaky because it's a faster data set, right? So it has a name of a gene and then the sequence that belongs to this gene. So it's not really a matrix, right? It's more like a vector, right? Because we have multiple sequences and each of these sequences has a name. So there's two things that we can do. We can just be dumb and use the read CSV function and just say, well, we're going to deal with the fact that it's a one column matrix later. Or we can say, no, we're going to use the read lines function. So let's just use the read lines, right? Because it's just lines of text. So we're just going to say read lines and then I'm just gonna copy paste this so that I don't have to type it. So this is data two and this is data two dot fusta. All right, so we're just gonna read lines. I'm gonna say comma n equals minus one, read it to the end and put it in data two, right? So don't scroll it past the screen. So let me go to R, see if that works. And then we have our data two and this is how it looks like. So there's one thing that I can still do to make it a little bit better, right? Because now the first element is the name, then you have the sequence, the name, the sequence. But of course, when we go back to notepad plus plus, we can say, well, I have here even and odd numbers, right? So the even are the sequences and the odds are the names. So I can do sequences is the even numbers of data, right? So we learned that in lecture number one, lecture number two to get the even elements. So I'm just going to say data two, select from data two. So from, I want to have the even. So I want to have a sequence from two to the length of data two. And I want to step by two and these are the even numbers, right? And then using these even numbers, I can then select the even lines from data two. So let me see if this works. So go back to R, paste it in and now we should have sequences, which is just the different sequences. And of course, I can do a little bit better because I can also add the names to it. So I can say that names of sequences is data two. And then I can just copy this thing and say, well, now we want to start at one. So we want to have the odd numbers. So start at one, then step by two. So that's one, three, five, and so on. All right, let's copy, paste this into R and then we should be able to have sequences. So we now have our vectors. And now of course, if we want to have the names, we can use the names function to extract the names as well, right? And now they're named sequences. So this is much better because now I can ask for a certain name and then get the sequence back. All right, so that's data number two. All right, let's open up data number three. So data number three is a VCF file. So VCF files are variant call format files and they are used a lot in bioinformatics, right? So they are these, what does the N equals one in read lines do again? So the N is the number of lines that I want to read in and it's not N equals one, it's N equals minus one. So I'm telling R, read everything until the end of the file, right? So don't, I'm just minus one just means everything. So if I would say N equals 10, then it would load in the first 10 lines. If I would say N equals 20, it would load in the first 20 lines. And if I would say minus one, it will just load in the whole file. It's kind of like everything until the end. All right, so data three is VCF file. So first let's look at the VCF file, right? So we start off and it says file format VCF version 4.2. Then it has a whole bunch of information, more information and then all of a sudden we see something which looks like a table, right? So our table starts at line number 55 and the header of the table is in line 54. So there's one little issue with this file format is that it uses this hashtag character, right? So the hashtag character in R is the character for comments. So we have to make sure that we don't run into an issue. Of course we could just say, well, this is the header, let's just delete this, but that's not done, right? You should never touch raw data files. You should never edit them. So don't start modifying raw data that you get because this is data that comes from a program which reads in sequencing data and then at each point in the genome it tries to determine if animals are reference or if they're alternative. So as some animals will have a C at that position in the genome, other animals will have a G and then all the way in the back we find the genotypes of the animals but that's not really important. We just wanna load in this file. So first things first, we need to skip 53 lines before we start loading in the data because the first 53 lines, they are information about the file like which command was used to create this file, and what is the version of the reference genome that was used. So all of this data is data which is useful but for loading in the data itself it's something that is like annoying, right? It shouldn't be there but we can't just delete it because then we lose this information. All right, so the thing that we're going to do is just say read CSV because I like to read CSV function. Again, separator is tapped, that's true. We do have a header, right? So we want to say that the header is true and then we want to add the parameter and the parameter that we wanna add is say comma first skip 53 lines, right? So skip 53, all right. Then we go all the way here and this is called data3.vcf. All right, so let's just copy paste it into R see if R can actually figure it out for us and see if it loads it incorrectly. I'm betting that it won't. So data3, one, two, 10, right? So let's just look at it and then it says, oh, data3, sorry. Looks pretty okay, looks pretty okay. It just, like you can see here that it made a mistake with the chromosome, right? It says x.grom. So that's not correct. We have position ID reference also. The rest seems to be loaded incorrectly and here of course we have this x5073 and this is because R does not allow column headers or column names to be starting with a number. But of course we might want to force R to do that, right? So what we can do to make sure that R doesn't get into our way is say check.names and I will put that to false. So I'm going to say do not be smart, just load in everything or load in the column and the row names as is. All right, let's go back to R, see if that fixes our little issues. Data3, one, two, 10. All right, so now we see that it's hashtag grom which is fine ish and now we see that indeed the x is gone from 5037, right? So because in the end this animal has the name 5073, it's not called x5073. So if we have multiple data files and the animal is always called the same, then of course we don't want R to be smart and in some cases put an x in front and of course we want to get rid of this hashtag as well. So let's get rid of the hashtag and how do we do that? We can just say call names, data3, right? The first one should be called grom and then that should fix our little error, all right? So make sure that we set that correctly. So now we have grom, pos, id, ref, alt. Perfect, so works. And again, like there's no row names, right? They don't have, like if the id column would be used then we would probably set the id column to it. One of the things that I notice now is that actually here you see these dots, right? So these dots are actually missing values in R. So let's go back and tell R that dots are actually missing values. So I will say comma NA dot strings is C and I'm going to say that the dot is a missing value. All right, let's copy paste the code again. Go to R and fix the chromosome, or fix the name. And now indeed we have our file loaded successfully, right? Because now it says the id is NA, so missing instead of a dot, because a dot is not a proper value for a missing value. We need to tell R that we want to use or we want to interpret dots as being missing values. All right, that's data number three. How much more do we have to go? We still have data four, all right? So let's open data four. All right, so this is a little bit of an interesting file, right? Because the separator is different, so the separators dot comma, we have a header. Here we see one, two, three, four, five, six, seven, nine, 10, so we do have column, or row names. So we want to tell R that the first column in the file R, actually the row names and not that PVV four, this column, this column does not contain one, two, three, four, something, right? Because PVV four starts one, one, two, one, two, two, instead of one, two, three, four, five, because like all of the other columns start the same as well. All right, so let's do that. So we take data number three, we just copy it, then we say data number four, and this is data four, and this is txt extension. We know that the separator in this case is dot comma, the header is true, we don't have to skip because it starts directly, and I want to make sure that row names is one, right? Because the row names are in the first column. All right, so let's see how this looks, see if everything loads in correctly into R, so let's paste it in, and there needs to be a dot there. So let me fix that, it's row dot names. So now when we look at data four, we look at one to 10, then that seems to be correct. So yeah, that's it. And of course we have NAs, and in this case the whole thing is numeric, right? There's only numeric values in these different things. So in theory we could just make sure that R knows that, so we can just say as dot matrix and then say data four, because I think if we go to R that the class of data four is a data frame, doesn't have to be a data frame because every column is the same type, so we can say that we want to have it as a matrix, so let's make sure that we do that and force it to be a matrix, right? Then we can also use functions which take a matrix like principle component analysis. All right, so let's test that, see if that is really true. So now if we would look at data four, one to 10, it will still look the same, but now in theory when I ask the class of data four, it should now tell me that it's a matrix or an array. And now of course I can do more with it, right? Because now it knows that the whole thing is numeric, every column is more or less equivalent to the other columns, so that's really good. Is there anything that we need to fix in the names? Let me go back, let me actually check the data file. So it's ARX minus one, if we go to R, the first column is ARX dot one, so we don't want that. We don't want R to eat up the names of the columns because like we might have another file which describes the ARX one thing, right? So we want R to not be smart and not eat up our row names. So here we again want to say check dot names is false, just to make sure that it doesn't eat up the row names or that it doesn't eat up the column names. All right, so let's copy paste, go to R and now data four, let's just say one to 10, one to 10, little piece, and now it seems to be correct, ARX minus one, HH dot 335C minus goal, let me check if that is correct. And yes, HH dot 335, yeah, minus, good. So this is now, and now the data in R looks exactly the same as the data that we have in our file. All right, data number five, making good progress. So data number five is a VCF file again, very similar to the previous one. This time it has names, so we could use this column as the row names were it not from the fact that some of them have missing values, right? So the thing here is I'm just going to do the same thing as that I did before, and that means just copy paste in this thing, right? And now say data five dot VCF, here we go, data five. We look at the file, how much nonsense is on top or it's not nonsense, but it's not useful for loading it into R, so we want to skip 59 lines. So we're just going to say skip is 59. The NA dot strings is still the dot because they're still using a dot as a missing value and this should then work and just make sure that we change the row name because there's still a hashtag I think in front, yes, so hashtag grom. So that's not how we want it to be. So data number five and then we go to R, we load it in and then we look at data five, look at the first 10 elements and looks fine, looks perfectly fine to me. Of course we could specify the column classes and then make sure that R knows exactly which class there is but in this case I'm not going to bother with typing them all in. That's how you load in data, right? Just look at the file, see what the separator is, see how much you need to skip on top, look at the header and it's generally like a process which takes a couple of times, right? Because you have to do a first try, see if it works, then you have to do it again. Data number oh six comma separated file, let me open it up for you guys. So this is how the file looks. So again, there's a header. So it tells me that it's from the OMIM database. In this case we might actually have proper row names, right, because all of the row names are there and they seem to be unique. The missing values seem to be minus, minus, minus for some reason, yeah, so minus, minus, minus seems to be the missing value in this file and for the rest it separator is comma, as far as I can see. So right, so it has a header, it has row names and for the rest it's just a very basic comma separated file. So we're going to say that the separators comma which R will complain about because the read CSV function has the separator to comma. There's a header, we are going to say row.names is one and we are going to say na.strings is C and we're going to do minus, minus, minus because that seems to be the missing value. We want to load in data number oh six and it's data number oh six dot CSV, CSV. All right, data six. All right, let's go to R, see if that looks okay. So we load it in and then it says duplicate row names are not allowed. So there is something in this file which has the same row name, right? Because here we see that it looks to be unique but apparently it's not, right? So if somewhere in this file it's not unique which seems a little bit strange because looking at it it does seem to be but let me check if that's true, right? Because I really wonder. So we say don't use the row names, right? Because there seems to be duplicate row names and then we go back to R, load it in without specifying our row names, so data six, one to 10, one to 10, doesn't have 10 columns, that's perfectly fine. Ah, and now we can see I forgot to skip, right? Because if we go back and we look at the file then we see indeed that on top there is a bunch of stuff that we want to ignore, right? So we want to skip 17, so let's do that. So let's say comma skip is 17 and now we're going to try again, row names equals one, load in data six, go to R, copy paste, and now we have data six, one to 10, perfect. So now we can actually use this row name to select something and here we see ambiguous position slash slash minus, minus, minus. So it didn't do all of them because this is not a missing. Well, it is a missing value, right? But it says ambiguous position slash slash minus, minus. So it doesn't recognize this as being a real missing value because it says ambiguous position. But seems to be okay. We do want to here check the names, right? Because if we go back to Notepad then the first column is called probe set ID, AFI SNP ID, right? So it has spaces. So again, we want to say comma check. Check.names equals false. Right, so don't be smart. Don't start modifying my names. So let's go back to R, load it in data six, one to 10, one to 10. And now it seems to be better, right? So it says AFI SNP ID, DB SNP RS ID, DB SNP log type, chromosome, so strand plus, cytoband. So everything seems to be loaded correctly now and we have the proper column names. All right, next file. So file number seven. All right, so file number seven again looks very similar to the file that we had before. So here we see zero, two, minus one. Here might be the missing value, but it doesn't tell me. So I'm not gonna treat minus one as a missing value because it might be a real information in this case. Although it does seem to be a missing value in a way. But we can't be sure, right? So here we would have to call the guys that made the data set and then ask them, is minus one a missing value or is it really a data value? But since we don't know, we are just going to ignore it. So I'm just going to copy this, right? So we're going to say data number seven. Data seven, this has a TXT extension. Our separator in this case is tab, I think. Let me check. Yes, tab separated file. Header is true. Yes, we have a header. First one seems to be the row name. So I'm just gonna keep, row names is one, row names is one, check names is false. Yes, because we have numbers as row names, so we don't want R to be smart and put like an X in front of it. All right, so this should then work. And then let's go to R and load in data number seven. All right, so if we look at data seven, then that seems to look okay. So I'm a little bit unsure, minus one might mean missing value, might not mean missing value, but this is something that you have to check with the guys that made the data, right? You just have to ask them, is minus one really the missing value or not? You would. All right, so there was question number one. Are there any remark suggestions so far? If not, then we just go to question number two. So question number two is about the lauraipsum.txt, right? So we want to read it in line by line, using a file connection and a while loop. So fortunately I gave you guys the code, right? Because let me actually clean this up a little bit, right? I don't want data to look messy, right? Or I don't want one code to look messy. So the first thing that we are going to do is we're just going to say question one, right? And this is a little bit of an annoyance because it's like this one can be a little bit shorter. So wrap numeric, two, right? Instead of saying numeric, numeric, factor, factor. So this is kind of the minimum that we can get it, but I do like these things to be on one line. I'm going to just do it like this. So I'm going to open up the brackets, then I'm going to say core classes and then this one is going to go like this, right? Then this is file two, file three, file four, file five, six. Well, it just fits on the thing, which is fine. So that's okay. All right, so then we are hashtag question two. So Laura Ibsen file. So let me actually take the slide at which there was loading in line by line. So yeah, let's just let me do it. Like, so I can say line.number because I want to keep track on which line I am. Initially I'm on line one. Then I want to say t file for having a pointer, right? Because we have to open up the file. So we can do that using the file command and then we want to load in the file lauram space ipsum.txt. We want to read through the file. So I'm going to say specify read. And now we're going to use this magic incantation. So while, and then I'm going to say line is read lines t file comma n equals one. I want to load one line. And now I want to check if the length of this, right? So is larger than zero. And then this is going to be my while loop. But you guys had this code on the slide. So you could have just copy pasted it. But it's not that hard. Like it's coming up with this kind of code is like, hey, you want to use read lines, read one line, put it in the variable line, and then check if the variable just created actually has a length larger than zero because the length equals to zero means that you're at the end of the file. All right, so then we want to, let me get the questions. Read in the text file, user connection and a while loop. Okay, so we just want to load through it line by line. So I'm just going to say line dot n is line dot n plus one, right? So that we just keep track of how many lines we had. In theory, we could also cut the whole thing, right? So we could cut the line with an enter behind it to the screen. So then it will roll the whole file across the R window. So let me actually close this one, this one, this one, this one, this one, and this one. And let me open up the Lorda Ipsum file, which is just generated, right? Lorda Ipsum is the standard for when you want to talk or when you want to test or have some, how do you call it, some example text, right? So it's a lot of Latin generated. All right, so let's go through the file line by line. So let me check. In the end, it should print 151, right? Because we have 151 lines in there and it ends with festi boolem a lid doey. So let me make sure that that actually is what happens. All right, so let's go to R, right? Based in R code. And then I made a mistake because Lorda Ipsum file cannot be found. Did I make a typo somewhere? All right, so let me do a dir Lorda Ipsum.txt, Lorem, okay, so, see? So I'm just gonna copy it here, go back to Notepad, make sure that I update the file name and I'm just gonna copy, save, go to R again. All right, okay, supply name does not match X. Why is that error in length? Supplied argument name line does not match X. All right, so does it do the read lines? Yeah, that's, and then line is, why would that not work? And then I'm going to say line, that's the line that we read in and then the length of this line is one, which is larger than zero. So why would it error out on this command? Supplied argument name line does not match X. It's an interesting, interesting error. So why is that length, let me go to Notepad, right? So here we have length. So this gives us a value, this is close properly. The while is closed, larger than zero. This is a conundrum. Let me just be dumb and copy paste it again, right? Sometimes it's an issue. So which line.n are we on? So we're still on the first line. So the first time that it executes this command, it comes up with a error. The error isn't really non-saying, right? So error supplied argument line does not match X. If we just call it X, is that good enough? Probably not. So I'm going to Notepad++, change line by X and then see if that will help what we wanna do. Because from the command, right? You can see that line gets colored like purple, like cut. So this is actually a reserve name in R. So it might be that it complains about that, that it doesn't see line as a variable name that I created, but that it tries to call the function line. But no idea what's going on there. So let me see. And then indeed it does. But then it cuts X, which is not the line that it's supposed to read, right? I was expecting the Lore-Ibsen file to just load through, but line.n. So it did do 152 lines, which is the amount of lines which is in the file. The question is just why does it not assign X? Good. So the reason why it gave this error is that here we are doing an assignment. And then we're directly asking the length of the assignment. The big issue here is that we have to add an additional pair of brackets surrounding the assignment. So it treats the assignment and then asks for the, so it never does the assignment properly. So we just have to add an additional pair of brackets to make sure that the assignment happens and then it asks for the length. Instead of asking for the length of the assignment function. All right, so now let's go to R, see if that is going to help us. And yes, so in this case, it did load it incorrectly. I think in the example on the sheet, it actually has the double bracket. But that is just because of the fact that it tries, like it has this built in PEMDAS, right? So this order of operations. So an assignment operation is an operation which goes before calling a function. No, after calling a function. So it will not do the assignment, it will first call the length function. So that's the issue here. Is that without adding brackets surrounding the assignment, it will not do the assignment first, but it will do the length version. It will do the length first. And that of course is an old-sensical operation. We might actually get around this by just doing the assignment like this, but I don't know. But extra brackets just because R wants it, because we want to force it to first do the assignment of the line to the variable X and then ask the length of it. All right, next question. So the next question is copy the code from assignment 2A. Okay, that's good, we can actually do that. So this is question 2A, we're just gonna copy the code. So we're gonna copy the code, which is 2B now. All right, adjust it so that each line in the file, the number of words on the line is counted using the string split function. Use the cut function to print out to screen, line X contains Y words, where X and Y are the line number and the number of words on the line, respectively. All right. So now we want to cut. So I want to print to the screen line, right? So I'm gonna say line comma line.n because that holds our line number. Of course, I want to put that before I put line number one higher, right, because first line is one and we already put line.n to one. So line n, and now I need to figure out how many words it contains, right? So I have to say line X comma contains and then words. Right, so now I have to figure out how many words there are on the line that I just read in. So the line, I'm just gonna call this line again because I think that that's a much better name for the variable. So now I have to figure out how many there are, right? So if I would go to R, let me first run this piece of code again. So just that we have the line variable defined. So I'm just gonna run our code. It will print everything to the screen and I will type line and this is the last line, right? So how do I now find out how many words there are? So fortunately, words are very simple to recognize because if you have a word, then there's a space and then there's the next word, right? So I can say, okay, so I can do two things. Like I can do the, I can, for example, count the number of spaces in the line or I can say split the line by spaces, right? So I can say string split line and I want to split the line by the space character. So now if I do this, then you see that it gives me back a list because here we see double bracket, double bracket one, right? So the first element of the list is containing all of the words that there are. So in this case, there's 106 individual words in this line. But of course, if I ask for the length of this, it will tell me one and why will it tell me one? Because string split does not care if I give it one line or if I give it 10 lines or if I give it 100 lines, right? Because the function itself is generic. So if I give it one line, it works, but if I give it 10 lines, it will split each of these 10 lines. And that is why it returns a list. And the first, because I'm only giving it one line, if I would have given it two lines, then of course the length would have been two, right? If I would say line and then I would add another line to it, for example, AAA space BBB, right? So a very short line. Then now it will say that the length is two, right? So how will that look if I do this in R? And you can see that the first one contains the split of the first line and then the second will contain the split of the line that I just added, right? To the using the combined function. So this is the reason why we have to, why we can't just ask the length of the string split. What we have to do in this case, if we want to know how many, how what the length is, we have to say from the first element, right? Because it's a list with a vector in there. So go into the list, take the first element from the list and then count how many elements there are in the vector. So in this case, on the last line of the file, there's 106 words. So I'm just gonna copy this. I'm gonna go to Notepad and then I'm going to say, oh, that's nice. So I can just now define something called N dot words is and then it's the length of the line or it's the length of the string split of the line, but make sure that we look into the vector or that we look into the list that is returned, take the first element of this list and then count the number of elements in the vector. And then I can just say N words, right? So that should be okay. Alrighty, so let's run the code and see if we can, let's actually add up the whole number of words. So words dot total, initially we have not counted any words and then how we can add an additional like words total is the words total plus the number of words on this line. Just to remember the total as well. Why not? Since we're doing it anyway. All right, so let's go to R. All right, it looks kind of okay, right? Don't like the fact that the comma is, that there's a space between the comma and the line number. That's not how it's supposed to be. So let's fix that as well. So we go back to Notepad and here there's a space and we can just say in this case, separator is nothing, but now we have to make sure that we put a space between contains, so line, space, line number, comma, space, contain, space, number of words, space, words. And just to make sure that it's correct English, right? We're professional, so we want stuff to look good. So we are not going to agree with this where R kind of makes the arbitrary decision for us that after the number, there needs to be a space. I don't want that. All right, so now it's proper English. So line something, comma contains this many words and now I can actually ask for the words.total. So in total, there are 13,224 words in the whole Laura Ibsen file. And of course, we now know exactly how many there are per line and there's 151 lines. Good, all right, so that's question number two. All right, then we are there with BioMart, right? So install the BioMart package. So that's something that we can just Google, right? So if we go and we go to Firefox, right? And we would Google install BioMart, right? Then here we can go and then we can actually see that here we see to install this package, we can just copy paste this code. All right, so copy paste the code. We are going to go back to notepad++ and then we're going to say hashtag question three. And then this is the install. I don't like that this is on two lines. I'm just going to say it like this, right? And I've already installed BioMart, so I'm not going to execute the code because then it will install it again and it will take a couple of minutes to download the whole thing. So this is question three A and then we are going to do question three B. So load the package and lists all the marks available. All right, so we're going to say library, BioMart, capital R, and then we're going to say list marks, list mark, I think it's list marks with an S, but let me go into R, make sure that that's okay. So there's actually four available, right? So we have the ensemble mart, we have the ensemble mart mouse for some generation, for some reason, so we have genes, the mouse strains, then we have variation and then we have regulation. So this is just the genes on the genome. These are all the different mouse strains that are available because of course there's many different types of mice compared to humans, right? There's only one type of human, but mice come in very different subspecies. The variation database contains SNPs, so single nucleotide polymorphisms and the regulation database contains information about gene regulation. But the question was list all of the marks, we did that, so that's perfectly good. So then we go to question 3C. So let me copy this down, down, down. So now we are going to do question 3C. All right, connect to the SNP database for mouse. All right, so the way that we do this is then we are going to say bio.mart, right? Because we have to assign it to somewhere is how did we connect again to bio.mart? We connect to bio.mart using the use.mart or get.mart. One of the two. I think it's use.mart, use.mart, right? Wasn't the lecture, like I have, I don't have the previous lecture open in PDF, but I think it's use.mart and not get.mart. All right, so we want to connect to the SNP database for mouse. Okay, so from R, let me go show you guys R, we already know that we have the variation database, so that's ensemble.mart.SNP, right? So that's the name of it. So we're just going to use that one. And then I am going to execute this command, store it in bio.mart because I have no idea what the name of the mouse database is called, right? Because it, no clue. So I'm going to connect to this mart and I'm going to say list data sets and the bio.mart, right? So list all of the data sets that are available. So we have some, so we have cows, goats, dogs, zebrafish, horses, cats, chickens, and so forth, and so forth. So where is the mouse? Where is the mouse? Where's the mouse? Musculus, right? So you have to know the Latin name in this case 20. So 20 says mouse short variants. All right, good. So musculus SNP. So we're going to copy this because this is the name, right? Here it says the dataset name. So we are now going to update our call and say comma data set is, I want to connect to the mouse dataset, right? Not to the goats, not to the humans, not to others. All right, so let's copy this, go back to R, connect to the bio.mart, see if it works and worked, perfect. So now we are connected. Okay, now the question is, load in the SNPIDs.txt file, okay? So SNPIDs.txt. Let me open it up for you guys in Notepad++. So this is how it looks. So it has a whole bunch of SNPs in there which we can query, right? So we want to have more information on these SNPs. So we are going to say read.table. Now we're just going to say read lines. Read lines, the file is called SNPIDs.txt and I'm going to call it SNPD, or no, SNP names, right? Just get some good variable names. All right, so let's see if this works. So let's go to R, read lines, get a warning message in complete final line found. So let me check. So SNP names, there are 96, 97, 98, 99, 100. Let me go back to the file and the file has 100 in there as well. So it complains for the fact that there's no enter on the end of the file but we're not touching the file because that's not done. Data is data, so we just load it in. We do not manipulate data in a text editor because this is something that we get from a company or from a program, so we don't want to edit it because then we have to write this down, right? Because research needs to be reproducible. So if we manually start editing our file then it has to be documented somewhere but if we don't manipulate it using a text editor but just use R to fix the little errors then that will be better. And in this case there is no error, right? Because it reads in all 100 elements it just gives me a warning. So in this case I'm going to make the active decision to ignore the warning that it gives me. All right, so next question. So this was 3D actually, so this is 3C slash 3D. And now we're going to go and do question 3F. All right, query BioMart for the SNP IDs and retrieve refSNP ID allele chromosome name chromosome start. So let me actually just copy what we want from the Word document or from the PDF document. So this is what we want, right? So I'm just going to combine all of these things into a vector and in BioMart we have a mart, right? So a mart is the thing that we are querying. We have the attributes, that's the thing that we want to retrieve and we have the filters and that is the thing that we're going to give them, right? So these are our attributes. So I'm just going to say ATTR, I'm just going to call it attributes, right? So attribute that's also my ATTR or to retrieve. That's a better name. So this is the things that I want to retrieve, right? And BioMart calls this attributes, right? Just as an example or a reminder to myself. So I want to say get BioMart, right? So I want to use my bio.mart variable to connect. Let me actually go to R and quickly look up the getBM function, right? So it's a long function. And that's one of the nice things about R, right? If you want to see which parameters there are to a function, you can just type in the name of the function, don't call it, right? So don't use the round brackets, just give the name of the function and R will print out the whole function for you. So the first thing is attributes and the mart is called mart, right? So let's go back and then say the mark that I want to query is this, right? I want to say attributes is to retrieve, right? Because that is what I want to do. And then I need to specify the filter, right? So I have SNPIDs. So let me go to R and say, well, how do I tell BioMart that I want to, that I have SNPIDs? Then I can say list filters. List filters, this filter, filters probably, bio.mart, right? And then these are all of the things that I can use to query SNPs. So I can use the chromosome name to start the end, the band, the chromosomal region, variation source, SNP filter, right? So SNP filter is filter by variant name, right? So that's number nine, number nine SNP filter and we have the names, right? So we have the RSIDs. So we can just say use the SNP filter, all right? So let's go back. So the attributes are this, then the filters or filter is this, and of course this is a string. And then what do we want to retrieve? Well, we want to retrieve our SNP names and in BioMart that is called, let me scroll up a little bit. I think it's called values, attribute filters values. Yes, so values, okay, very good. So now I'm going to say here all the way at the bottom, filter is SNP filter, values is SNP names, right? So those are the things that I have. So it starts becoming a little bit too long, so but this is my SNP data, right? So the data that belongs to it and I'm just going to align this a little bit. Oh, going to align this a little bit so that the code fits on the screen and also fits on the screen when seeing a terminal or something like that. All right, so let's see if this works. So let's go to R, get BioMart. And now I'm going to say my SNP data seems to have worked. So for each SNP it now got the proper allele coding, right? So this is the reference, this is the alternative. So the reference is the relative to the reference genome. This one is located on chromosome one at this position and all of them are actually located on chromosome one. Good, so this is the BioMart assignment. So all done. Loading them in, specifying the filters, specifying what we want to retrieve and making sure that we had loaded in. All right, so for now it's 307. So I've been programming for an hour and we still have three assignments, doesn't matter. We're going to do a short break and after the break we'll be back and we will just continue doing the assignments and it's going to be fun, going to be fun because the long running analysis is one which is a long running analysis. So that's something that we can just run in the background and then we continue reading binary data and then the advanced one, which is really, I like that a lot. And then we're going to have some fun because then we're going to have stuff move in R and we're just going to have this image and we're going to use random sampling to have the image move from left to right. All right, so let me set up the break for you guys. So we want some music, of course. Selected some new music. The first break is going to be foxes, I think, and I will be back in around 10 minutes. So we'll continue at like 318, like 320, we'll be back. All right, so enjoy the music. Let me turn that on. There it is, and then turn myself off. All right, made it back in time. Cats in dog shape, yes, cats in dog shape. Cats in dog shape, that is true. They are so sweet. Anyway, let's continue, right? We still have a couple of assignments left, so let's just continue doing the assignment. So next assignment was long-running analysis. Okay, so let's start again. So hashtag question for A. All right, so let's do this. Set your random C to a fixed value, set it to generate the same matrix with random numbers each time, and create a matrix holding a random values to calculate correlations on, around 10,000 by 1,000. All right, so I'm going to set my seed to 42, which is my favorite seed, and I'm always going to do that, and I'm going to generate normally distributed numbers. I want a matrix, right? I want my matrix to be 1,000 by 1,000, and that means that I need to fill it with 10,000 times 1,000 numbers. All right, so we call this big M for big matrix. All right, so let's go to R and set our seed and generate our matrix a little while, and it says actually warning Lord Epson, right? So this is because I did not close the Lord Epson file when we were reading through it. So the warning message has nothing to do with the big matrix. So let's check the big matrix, one to 10, comma, one to 10, comma, one to 10. So these are my values. It seems to be perfectly fine. They're not repeating. It seems to be perfectly random. And of course, I want to check this again, right? Because I want to generate the same matrix again. Look at the numbers and make sure that it checks out. So the first number is 1.37 and 1.37. The number here is 0.06, and here it's also 0.06. So indeed, setting the seed means I generate the exact same random numbers again. Alrighty then, so that was question 4A. 4B, decide where to store the results on disk, the location and file name. So I'm just gonna be lazy and I'm going to say, I am going to store my results in something called results. So I'm going to say cut. I'm going to cut nothing in my file and my file is going to be results.txt. Let's call them correlation results, right? So core results.txt. So this is something that I only want to do once, right? Because once I start computing correlations, we're going to use this file as our temporary file, right? Or as our results file. So we're going to just add to this file the new correlations that we're going to calculate. This is gonna take like an hour in total to calculate all of the correlation. So we might wanna have a break in between or the power might go out. So we want to make sure that we can repeat later on in case the power fails. So we check if this file is empty using an if statements or the try catch function. If the file is not empty, we need to load in the data from the previous computation. All right, so we're going to say if file.exists and I'm just gonna store the file name in a variable. So I'm just gonna make a variable called fname, put in the file name and then I'm gonna say file name, right? So if file.exists file name, then what do we want to do? Well, we want to load it, right? So read.table, let's just use read.csv. We want to read fname, right? That's the file separator is going to be a tab because that's what I'm going to do. And this is called mdata, right? And else, if the file does not exist, I am going to make it. So I'm just gonna put this into the else branch, right? So if the file exists, I'm going to load it. If not, I'm going to make it. So let's go and run this in R, right? So when I go to R and I run this, this file should not exist. At least I don't think it exists. So once I copy paste this into R, it should take the else branch and it should make the file for me, right? So now if I do a dear, then now it should have created a file called cardresults.txt and indeed it is there. So if we will now rerun the code, right? Then it will say error in read.table, no lines available in input, right? Because the file is just completely empty, which means that I can't use the read.table. So I need to write a try catch surrounding this to make sure that when there is nothing in the file that I do get back a matrix, right? Because in the end I want to make a correlation matrix, this correlation matrix is going to be 10,000 by 10,000, which is way too big to store into ROM memory. So I have to make that. So I need to do something like try catch. So I need to try catch. What do I want to try catch? Well, I want to try catch loading in the data. And then if there is an error, look this up because you guys just had the example, right? And because the example was on the slide. All right, so try catch. So load my data, load this CSV file. If not, then I want to return a matrix. My matrix is going to be filled with NAs and the matrix is going to be zero by, or no, it's going to have 10,000 columns. 10,000 columns and zero rows, right? Because that's initially when I haven't done anything, this is the way that I want to return it. And I'm just going to do it like this and I'm going to assign the result from the try catch into M data, right? So the try catch is going to try to read the file. If there is an error, then it will return an empty matrix. So I close this one, the if is closed there. So this is all perfectly fine. And this comma actually, it can actually go to a single line. And it just doesn't fit. So let's do it like this, right? So this is the command to execute and if it fails, return empty. All right, so let's see if this works, right? So let's go to R. Let's see if we typed in everything correctly. And now M data, right? Because there's no error. So now M data should be a matrix which has 10,000 columns and is completely empty. So if I ask for the dimensions, then indeed zero elements, zero correlations computed. So now we have our empty matrix that is what we wanted to have. All right, so next step is now to start calculating our correlations. So now, because we have this M data, right? We now know how many correlations we already did. So we can now just say, so four X in one, two, the number of columns of our M data, right? And we don't want to always start at one because if we had already more, then we want to continue from there. So we want to say we want to have the maximum between one and the number of rows. So the number of rows that we already had in M data. And then I want to go to the number of columns of the big matrix. All right, so let me explain this again. I'm going to say four X. So where I need to start is either at one or if there is already stuff computed, I need to start at that position, of course, plus one, right? Because if I already did 15 correlation computations, then I want to start at the 16th. So, and then I want to do this for every column of the big matrix. All right, so what do I want to do? Well, I want to calculate my correlation. So I'm just going to say big matrix. I want to take column number X out of the matrix and I want to calculate correlations of this to every other column in big M. So I'm just going to say correlation this to this and these are called cores, right? So just to make sure that it works, I'm going to go to R and I'm going to manually set X to one, right? And now I'm going to do my correlation computation. So just the way that I did it, so I'm going to take the first column from big matrix and then correlate that to all of the other columns in big matrix with itself as well, right? So the expectation here is that the first number is going to be one because something correlated to itself is always one. And then the other correlation coefficients are going to be more or less around zero. I think it shouldn't, it doesn't have to be. So if I would print this, of course, this would calculate like 10,000, right? So the vector, so the length of this is going to be 10,000. So let's check that 1,000. Let me see, dim big M. Ah, 10,000 rows, 1,000 columns. Sorry, I made the matrix the wrong way around. But that's fine, we can do 1,000, that's no issue. So it will do 1,000 correlations and it will give me back the vector, right? So now the thing, I just calculated the first one. So what I want to do then is to store this into the file, right? So I'm going to take my correlations and I'm going to say, well, paste the correlations together using the separator top, right? Because I told here that I'm going to save them in the file using the top separator and that's what I'm going to do. So I'm going to cut this, adding a new line at the end and I'm going to say where do I want to write it to? Well, I want to write it to my F name, which is the correlationresults.txt and then I want to append, right? Because I want to not overwrite the whole file because otherwise the file will always have one line in there and I'm going to say append is true. So add this to the file that we already had. And then of course I want to also say cut and I want to say done, right? Because I want to make sure that we have some kind of tracking. So I have done x out of and the amount that is number of columns because that's the amount of columns that we're going to do and then we're going to say slash new line, right? So let me open up this coreresults.txt file for you guys and then we can see what happens, right? Initially it's empty. So while the program is running, it should start writing out the correlation coefficients to the file. All right, so let's run everything again. Let's set the seed again. Let's also regenerate the matrix and let's go to R. So we're going to generate our matrix and now it will say cut done, done, done, done, done. All right, so that's good. So while it is running, we're going to go to R and it reloads the file and then every time that I move to it again, it reloads it again. So we've done 50 now and still 50. Nope, we've done more. So we've now done 111. All right, so let's go back to R and in R, imagine I get cold, like shut up your computers because the whole network has been hacked or power failure. So at that point I just press stop. So now how do I continue? Well, I can continue using the code that we just wrote, right? So if I would now say, and so now what I would normally do is close R and then we can continue because the data is going to be already stored, right? So if we look all the way to the bottom, we did 215. We check in R, so it said done 214, which is okay. And then we just wanna continue, right? Because now we are going to want to continue. So we shut down R, we reboot the whole computer. And then once we have, now we want to continue with R. So how do we now do this? Okay, so we are just going to do the same thing as that we did before. Let me show you what I'm doing. So I'm going to set my seed again, just regenerate the matrix. Then I'm going to load in the file and we're first gonna stop here, right? So just to make sure that everything's correct. So I'm going to do this. So now it should have gone into the read CSV and it should not have gone into the error. So now if I ask for the dimensions of mdata, then mdata will tell me that there are 240 lines in the file which is not entirely correct because if I would look at the first 10, then it says undefined column selected because it actually did not do what it did not. So it separated it by spaces. Yeah, it separated it by spaces. So that's not correct. So I'm going to say, read in the file using spaces. I think this should have been, yeah, it should have been paste separators tab and we also want to collapse on a tab, right? So we are also going to need to collapse on tab. All right, so let's delete the content in the file because we messed up our code because I do wanna use a tab for a separator. So we're just going to run the code again, start our computation, start writing to the output file and then we are going to stop after a couple. So we done 21. So when I go to notepad++, it will ask me if I want to reload it and we can see that the correlation matrix has been done up until line 22 and that's because when I stopped it, it didn't print yet, but it did print to the file. So we're going to go to answers three again and now we're going to do the same thing. So we're just going to copy paste in the first part to make sure that it loads in our data that we have already done. Regenerate the matrix, load in the data that we had so far. So I'm going to ask for the dim of mdata. So we did 21, is that correct? 210, one, 210. No, this is not correct because you can see that it treats the first element as the header, right? But we did not add a header to the file. So we have to say here that comma header is false and then we're going to just do this first part again and paste it in, very good. And now we can do mdata flop and indeed everything fine, correlation of one versus one is one. We could actually here just add proper column names and also write them to the file, but in this case, we don't have them. So it now loaded incorrectly, right? So now I want to start computing correlation from number 11 on, right? And that's where this magical max one and row mdata plus one comes from, right? Because we want, we have 12, no, we have 11. Let me see, we have in total 22 rows already computed. So now it should say that this now ends up being 22, plus one is 23. And of course, the maximum is now going to say four X in 23 to the number of columns of big data. So let's just continue the computation. So let's go to R, continue the computation and then you see that it starts at 23 again. So we get cold again, grandma's in the hospital, please come now, shut down your computer. So we're going to shut down the computer and when we look into notepad plus plus, we see that it continued writing to the file. So we now did like 90 out of a thousand. So we come back the next day, we do the same thing again. So now we're just going to copy paste in all of the code again, and we're just going to continue the analysis at this point. And then we start from 91 all the way and then we could quit it again and then we could start again by just redoing it. And every time it will just use this core file as a temporary file. So I understand that this is very complex, but this is something in programming that happens a lot. A lot of the data that you guys are going to work with in the future is going to be much bigger than like the memory that is available to R. And also the runtime of things is going to be much longer. Like Windows has an update once every week, your computer needs to reboot. So if you have an analysis which runs for like four or five days, then you need to make sure that your analysis is continuable, right? So that you can stop your analysis and that once you come back, you can just continue doing computations where you left off, right? And these correlations are not too bad, right? Because they're going relatively fast. We're already done one fourth. Yeah, but it's just an example, right? I could make the matrix as big as I wanted to because I'm not using any memory. My computer just started blowing smoke because it's doing all of the correlation computations, right? But that's the nice thing about doing it in this sense, right? So to have a temporary file where you write your results and then if something goes wrong and your computer catches fire, you can just continue from the same point and you don't lose anything, right? So we can just quit it, start it, quit it, start it. And it will write everything to the disk. It will only load stuff from the disk when it is required. And in that sense, you're not bothered by the fact that R only gives you a couple of megabytes of RAM or a couple of gigabytes of RAM. Good, so very, very difficult question. I hope you guys just spend some time on it. You had most of the code in the slide. So if you started off by just copy pasting the slide for the long running analysis, then it would have been perfectly doable. Because then you would only have to fill in the correlation code. So, hey, if we go to notepad++, we see that it already did a lot of computations because we were at like 425 already. Good, but something that you're not going to use directly, but something that in the future is going to be more and more important because data sizes grow every day. So this is not for B, this is probably for C, for D, yeah, so it's for B, for C, and for D. All right, good. Next topic, much difficult, or just as difficult, right? So we have this BMP file, which we now want to load in using binary data. So let me go to R, let me see how the file is called. The file is called image2, I think. Let's use image2 this time. So I am going to say I want to get the file info, right? Because the first thing that we need to know, hashtag question five, so the first thing that we need to know about this file is how big it is, right? So I'm going to say image.info because when we use the readbin function, we can't use the n equals minus one. We have to specify how much bytes there are in the file and how much we want to load in. So in this case, I have no idea how big the file is, so I'm just going to use R to get the size of the file so we can load it in. So if I go to image.info, then it tells me that there's one, two, zero, so 120,054 bytes. And this 54 bytes is, of course, because of the fact that it's a BMP file. It's not a directory modified times and these kinds of things, so you don't really have to worry about that. The only thing that we really worry about or that we want to worry about is the size. So let's just use the readbin function to load in the whole file, right? So we're going to say read binary. We're going to read image two dot BMP and that's it. And we are going to say n equals image.info and what do we want to get from image.info? Well, we just want to get the size. How do we want to read it in? So we want to read it in as raw file format and then I'm going to store this into image.data, right? So I'm just going to define a new variable. All right, so let's go to R, load in the image data, invalid n argument. Why is that? Right, because it's a little matrixy thingy, right? So we're just going to say as numeric on it and then that should be fine as numeric. I'm going to align them a little bit better and is as numeric, what is raw and then this looks a little bit better code wise, right? Because in the end, we want to have code that looks good as well. All right, so let's load it in. Go to R, go to R. All right, so now we have our image.data, right? And this is just the hexadecimal coding of the file. So we see that the file, the first 54, look a little bit different than the other ones and that is because this is the header, right? So the first image data, the first tells you that it's a, I think this is BMP and then it has some magic header so that Windows recognizes this as a real BMP file. All right, so then what do we want to do? Well, we want to get rid of the header, right? Because that's the first thing because we don't need this header. So we don't need the first 54 bytes. I'm just going to say minus C12F54. All right, so I'm just going to make a little vector. The reason why I use this C around this, right? Because it's kind of useless is because I want to have all of the numbers being negative, right, for throwing them away. So it should be minus one, minus two, minus three, minus four, minus five. So I'm just going to use this little trick where I make my vector and then use the C function around it so that the minus applies to the whole vector. And then I'm going to restore this into image data. And just to check, I'm going to ask for the length of image data because that should now be 55 shorter than the size from the file, right? So if I would just add this next to it, then I would see that the length here would be 54 bytes smaller than the length of the total file. All right, so let's load in the file. Let's go to R. Seems correct, right? So 120,000, 54, and here we have 120,000. And of course, 120,000 is logical because when you would open up the file, you see that it's 200 by 200. And of course, 200 by 200 is 40,000. But for each pixel, we have three bytes, blue, green, and red. So of course this then needs to be multiplied by three. So and then indeed we would end up with 120,000 kind of color components. All right, so the question load in the provided B and P side. Don't forget to throw away the first 54 bytes. So this is question 5A, loading it in. Question 5B is throwing away the header. And then we have question 5C. Question 5C, oh, there's actually a question 5B number two. So I made a typo in the assignments. Extract the red slash green and blue channel from the B and P image. All right, so this is again where we can use our sequence function, right? Because we have our image data, we can just, oh, let me see. So we can use the sequence function, right? So we can say from image data, what do we want to select? Well, we want to have a sequence from one to the length of the image data. And we are going to step by three because the first blue color is going to be at one, four, seven, and so on, right? So this is going to be blue. And then we're going to do this three times. So we're going to do, it's RGB. So green is in the middle and then red is in the last. So two and three. So one, two, and three, right? So, and just go to the length every time stepping by three. So if we do this and we go to R, then I will type blue. Let me see the first 10 blue color components, right? So if I go here, then the first should be 49, 50, 51. So we threw away the first 54. So EB, and then the next blue color is going to be F2, F3, F2. All right, so EB, F2, F3, F2. So it seems that it extracted them correctly. Of course, this is not, like the image is not a vector. It's a matrix. So let's convert the blue component directly to the matrix, right? So just say, well, blue is a matrix. Give it the values and then say, it's a matrix of 200 by 200, which is the dimension of the image. So we're going to do a matrix, a matrix and a matrix. And every time we are going to convert this to matrices, which are 200 by 200, because that's what the dimension of the image is. So now I have my blue matrix. And now, of course, when I look at one, two, 10, one, two, 10, oh, one, two, one, two, 10, then it's still in hexadecimal coding. So how do I solve it? How do I go from hexadecimal coding to like normal coding, right? I can just use it as numeric, right? So for R, it doesn't matter. Hexadecimals are just numerics, but I can force them to be numeric. So when I go to Notepad, I can actually just around the whole image data with the sequence, right? So this is my input vector. I can just say, convert this whole thing to an as numeric vector. And I can do that for all three of them. So I can just say, as numeric, as numeric, and I just have to close the bracket here, here and here, right? And here you see again, that bracket highlighting helps a lot just to make sure that everything is correct. All right, let's go back to R. Look at the blue matrix. And now we see that it has really got numbers, right? So numbers from zero to 255 for the color components that are in there. Good. Next question, make an image, right? Create an image of the different color channels using the image function. So here we can use the image and we're going to say blue. And then this is the blue component of our image. So this is the blue part of the Angry Bird. Then we have the red part of the Angry Bird, of course, so we can color the red, right? So the Angry Bird itself is red. And of course, we might want to change the color scheme, right? Because normally you would not know that this is low and this is high. So of course, here in this case, we have red colors, but we could say gray.col and give us like 10 gray colors. Gray.colors, I think. Let me see. Open up the help for gray.colors. X and Y. Okay, so we're going to say image red. Col is gray.colors1 to 10. Right, so now this is the red component of the image displayed using 10 gray colors. Of course, we can add more colors to it if we wanted to to kind of get more like smooth and we could use less colors as well. So we could also do it like with three gray colors and then it kind of flattens the image, right? So you can already see here that we could start mixing and matching the different colors together. All right, so that's loading in image data and using this image data to make a plot in R. So I find this really, really fun. And I use images, not a lot, but when I can use images, I try to use images in R, so that's what I do. All right, so next question, right? Let me first show you guys something else, right? Because we now have this matrix, right? So let's do something interesting with this matrix, right? Imagine that we want to have this thing, this image that we now have that we want to have it move. Which would be, so moving the image to the left would mean dropping the first column and putting it on the back, right? So just taking off the first column, putting it on the back, taking the first column, putting it on the back, right? So if I would want to do this, then I can easily do this, right? So let's continue using my gray colors because they are pretty nifty. So let's use like five gray colors. But now I could say, so I could say four X in one to 200 because there's 200 different columns in my matrix. What do I want to do? Well, I want to make an image, right? And then I want to take the red color component and I want to take the first column and then I want to see bind that to the back of the red matrix without the first column. So I'm just gonna drop the column from the, I'm gonna drop the first column from the red matrix and I'm gonna put it behind. So I'm gonna take the first column, make it the last column, right? And since I'm doing this 200 times, every column will be moved to the back. And then of course I'm going to store this back into red. And I'm going to do sys.sleep and I'm going to sleep for like a 10th of a second, I would say that would be good enough to see it moving, right? Because like it would take 20 seconds for it to move from left to right. So let's go to R, right? And let's see how this would work. So let's make this one a little bit smaller, right? So that the plotting image is next to it. So here we do the images and you now start to see that it's kind of moving but it's very flickery, right? Because it does it too quickly, right? So we can say, well, I don't want to do it this quickly, I wanna have a little bit more time to look at the image. It's actually not doing the, it's moving it from top to bottom. I would have expected it to move the other way around. I guess that I'm trying way the first column. Oh, it transposes the matrix. Ah, that's annoying. All right, so let's go to notepad++, right? Re-extract the green, blue and red components. So I'm just gonna paste that into R so that we have reset our red component list and I'm going to say C bind, now I'm going to row bind, throw away the first row and then add the first row like this. So I'm going to take the whole matrix, take the first row, put it there. So just, and then I'm going to sleep a little bit longer so that the image is in frame for longer so that we can actually make sure that we see something. Eh, it's still a whole bunch of flickery, but now it should start moving, right? So if we would wait for a little bit on this, then after a couple of rounds, it should move all the way out and in the end, it should be something like this, right? So now we see here that indeed, like the image moved a little bit. So what I wanted to show you guys and that's why I made this little code that's called moving Obama, which does exactly the same thing as what we did before, right? So we have the image, the info, we take the data, we throw away the first 54, then we take red, green and blue and then here we use the pointillism style. So that's the thing that is the next question. Are 3D movements possible even though I think that would be too complicated? Yes, 3D movements are pretty possible as well in R. The only drawback for 3D things in R is that you have to give an X, Y and Z component, right? And the issue is that the X and the Y component in R need to be sorted. But besides that, you can just use the pers function, so P, E, R, S, P, so for perspective function and that will make a 3D plot out of a three-dimensional matrix. So if you have a matrix, which has an X or Y and a Z component, then you can just use that. In theory, we kept this image here, right? So we could use pers, right? Because it's a perspective image. There's one to 200 in the X, there's one to 200 in the Y, and I can say use the red component like this and now it will draw the image, right? And of course, this looks a little bit strange because you're not looking on top of it, but we can do phi equals 90 degrees, then it will rotate it. And then this is more or less when you look on top of the, or let's do 80 degrees, right? So here, now you see kind of the angry bird, right? Because we're looking at 80 degrees, slightly rotated, and we have phi and the other parameter is, I think theta is the other angle. So if we say theta is 15, right? Then we also move a little bit and then we see here kind of the angry bird. And I think you can even use the color. So I can say color is red component, and then it will use kind of the, well, it doesn't do it very well, but well, we can actually mix them because we can say RGB, color is red, green, and blue. I think it should understand what I wanna do now, right? Because we have to divide these by 255, green by 255, and blue by 255 as well. And now it should, yeah, that doesn't look too good. It's not too nice. Oh, because the red component has actually shifted. Let me re-extract by, right? Because we were moving it around, so one of them is moved and the other components are not. So let's just do it like this, and now we can use the perspective function to make it a real angry bird again. It's a little bit weird. Red, green, blue, that should be fine. Let me put the phi, let's put the theta to zero because that looks just wonky. The colors seem off. I think that the colors are in the wrong direction, but you can do this, and of course you can actually have this three day thing move as well, right? Because we have the phi and the theta that we can play with, right? So we could just take this piece of code that we have and I'm just gonna forget about the colors. Well, let's just keep the colors in. So I can do a perspective plot, and then I can say, I'm going to say four X in one, two, 180, right? So turn it 180 degrees, and instead of specifying the phi, I'm now going to specify the X, and then I'm going to say sys.sleep, 0.1. And this might be too heavy, right? Because making the 3D plots might take more time than, I think it's really weird in a way that it does. So now you look at the side, and then it moves it a little bit, and then does the whole plot, but it's not fast enough to keep up with the 3D graphics. But then you can do the same thing with 3D plots, just like you can with 2D plots. All right, so let's stop this. So stop it, stop it, stop it. So it does the plot every time. So what I wanted to show you guys is the moving Obama, right? So the pointy-lism thing is the same thing as what we did before. And so I'm going to extract the red, green, and blue components, and the only thing that I'm going to say now is I'm going to compute all of the locations for each of the color. Yeah, no, that's kind of the thing that I like about ours is that it allows you to be very creative, right? So here I'm doing kind of the same thing as what we did before. Oh, you can actually set the window to not be buffered, so then it will draw faster, because it won't wait for the image to be finished. So we're just going to make our own plot window of 200 by 200, right? I'm going to grip all of the components together into the colors. I am going to randomly sample 1500 pixels from the image. I'm going to draw these points with these colors, and then I'm going to say X, and then every time that X is divisible by two, right? So every even iteration, I'm going to move it. So I'm going to take a scan line and move the scan line one further. So what happens, and I think we're doing this on image one, so this is the image from Obama, and then what you can see is something like this, right? So it first makes the Obama image, but it also moves the image every second tick, right? And of course it looks a little bit blurry, and that's because we're not drawing, or we're not updating all of the pixels, we're only updating 1500 of all of the pixels out of it, right? But you can see that it moves from A to B, right? And you can use the same thing to morph one image into another image, right? So you just do an iteration, go and change, that's a load in both images at the same time, first plot the first image, and then use random pixels from the second image to override the first one. And then it slowly, slowly morphs from one into the other. So you can take like a picture of Obama and have it slowly morph into a picture of Trump or the other way around. And this is like, it runs very well in a way, right? But because we're not updating all of the pixels every time, and that's why you see that it's a little bit blurry, so here you see this like line where it doesn't update all of them in the same go, but it's something that you can do, right? And this makes programming fun. And if you do this with bioinformatics data, right? If you have, for example, a EQTL plot or something like this, then people actually think that you're working because you can say, no, I'm looking for it like the best way to show my data. And it stops in this case after a thousand iterations. And we could do the same thing with the Angry Bird or any image that we find online. Good, so I'll put a couple of scripts on and since I've been talking for another hour, already on doing the assignments. We're spending too much time doing the assignments live, I think, but yeah, because in the end, like two hours is kind of the amount of time that I want you guys to spend on it, right? And we're already talking, making 3D images for like 15 minutes and stuff. But this is kind of the way that we, that you can do plotting in R, right? And you can do very nifty things. You just have to think about what do I want and how do I kind of achieve it? And I love the like moving Obama and of course you can do this with the Angry Bird as well. So let's just do a round of Angry Bird moving as well. So here we see the Angry Bird being drawn by using random pixels. And then every time we use the same system, right? So we drop the first column, put it on the back, drop the first column, put it on the back. And that makes the image move because every time we redraw the image and not the whole image, but we redraw a bunch of them. And if we want to make this a little bit better, right? Then in theory, I could not sample 1500, but I could sample like 5,000 pixels every time. So and then it becomes a little bit more clear, right? It's less blurry, but it moves a little bit slower because it has to update much more. All right, thank you very much. See you next week. Yeah, yeah, sorry, Frog, Frog, that you have to leave before we actually start with the lecture. I'm very sorry. But now you can see it moves slower, but it's more clear. So I took the 1500 because it makes it move in a reasonable speed. So both of them there, right? In the end, programming is fun. Program is creative. And that is what I want to show you guys. So it's not all statistics and data analysis. No, you can make animated GIFs in R as well if you really want to do that. And that's kind of what I want to show you guys. Good, so those were the assignments from last week. I will, of course, upload them to Moodle. I will also upload them to my own website so you can get the code there. And you can get the data there. The data was already there, of course, for the assignments. And I think I already uploaded the presentation. So for now, I'm going to show you guys the overview for today. Let me close my R window because it's actually still rendering the Angry Bird. And I am going to switch to the presentation myself. So lecture time. So I will be back in a couple of minutes, somewhere around like 10 minutes again. And then we will do the lecture. Lecture is not too long. Like I said, it's only 40 slides. So we should be done in an hour. So we should be done at around 5. So with that, let me go to my music, start the music. That's really loud. I'm sorry. And the next is going to be, don't know, don't know. No idea what animal I selected. Damn, that's really, really loud. Put it down a little bit. So enjoy the second break. And I will see you guys in around 10 minutes. And I made it back in time. Made it back in time. All right, so let's start with the lecture. So today, we're going to talk about descriptive statistics. And I'm going to make it a little bit fast since we're almost near five. And we spent too much time. So the first thing that I wanted to show you guys is the API question that was asked last week. Because people wanted to know if there are any APIs for social sciences. And then we will just talk about descriptive univariate statistics, which I think everyone should be aware of. But I just wanted to remind you guys that it's the basis of the rest of the statistics. So that's why we do a lecture about it. So APIs for social sciences, there are a lot. And they are not harmonized for some reason. So for example, if you're interested in data from Germany, you can go to the Federal Statistical Office of Germany, which is called the status. And you can obtain data via the web interface. So you can just go to their website and just click around and download data sets. But they also have this API and SOAP API. Now I forgot actually to type it in. I actually closed the Firefox window altogether. So let me set that up for you guys. And then I need the link. So the link is genesis, see this one. So let me show you guys my Firefox window. So this is how it looks like, right? So this is an API. It is a structured data API, right? So it's all like computer readable. It's not used for humans or it's not like suitable for humans. And here it just tells you what you can get, right? So you can, for example, get from metadata. You can get time series. And then it tells you what you need to provide, right? So you can get time series data. You have to provide a username, a password, then a name. This is probably the name of the data set that you want to query. Then the area, which is the area that you want to look at. And then the language, and then that's the language in which you want to get the data back. In this case, the default is German, right? So it has all kinds of different endpoints that you can query for data. So, yeah, and of course, this is not used for humans. So humans should not be looking at this, only computers should do it. And the nice thing is, is that for this federal statistics office in Germany, you actually have an R package to query it. The R package that you can use to query it is called Wiesbaden. And this package connects to four different databases. So it connects to the regionalstatistik.da to Genesis at the status, which is the main federal office. Then we have the Landesdatenbank in Nordrhein-Westfalen. And then we have Bildungsmonitor.da, which allows you to query all kinds of data regarding schools and a number of people that got their diploma and when they got their diploma. So the big issue with the German data is that for all of these databases, you need to register. But after registering, there are free, except for the main database, the status database, because they will charge you 500 euros per year to make queries to the database. So since I don't want to pay 500 euros, I thought like, okay, so if the Germans have this, then the Dutch probably have this as well. And one of the nice things about the Dutch government is that they have this big push for open data. So I looked and the Dutch office for statistics actually provides open data. So they have the same kind of data as what is in the German database, but they have it, of course, in the Dutch language, but they have, everything's free. So you don't even have to make an account. So their package is called CBSODataR, and of course you just have to install it, so install the packages. And then you can just do a library call to make all of the functions active. And if you say CBS underscore get underscore talk, this means that you get the table of content, right? So if I ask for the dimensions of this table of content, it shows me that there are 5,038 different data sets that I can query. So I queried one of them, so the 83765 net database, which is actually the database with criminal statistics or part of the criminal statistics. Let me see if I actually know I did not open the file. So let me actually close this one, close that one, and let's go to Notepad because I did make a small example for you guys. So let me find where I stored that APIExample.r, right? So this is the code that is also on the slide. So let me also show you guys the R window, which I also stupidly closed. So that's what you get for running out. You do this, so here we have our R window back, right? So make it a little bit bigger so that it fits on the screen nicely. And let's just first look at the table of content, right? So just so that you guys have an idea of what's in there. So it takes a little time to load in the table of content because it is actually a pretty big database. So if I look at the dim of talk, I can say table of content, table of content, show me the first three columns, right? So here we have the date that it was updated, the name, and it actually doesn't fit on the screen, but there's also a description. And let me show you the first 10 data sets. So here it just says in Dutch, what's inside of the database, right? So this is jobs from employees. Is the CBS Package a Chrome download? Yes, yes, you can just get it. So there's like people getting fired because the company went bankrupt. You have like number of jobs which are not filled. So a fuck a tour is something that is a job offering that's not being filled. So this is open and then, of course, there's many, many more. So there's like 5,000 in there instead of the first 10, I can show you 10 to 20, right? So there's all kinds of different data. So this is for example, the Arbeids volume, which is called like job volume. So it lists how many people are working in certain areas in our economy in Holland. And if you go further down like 1,000 to 1,010, and then you can find that there's also information on education, the amount of students, what kind of education they're following, which year they are in, if they have a migratory background, which generation they are and these kinds of things. So I wanted to show you guys the data set for crime. So let's just download that. Takes a little while to download. Again, it's a massive data set. Well, it's not that massive, but at least all of this data is there. It's available royalty free. You can query it. You don't need to make an account. So it's just free data to analyze. And I bet that there's a lot of interesting statistical artifacts that you could find when you do this, right? Because it's just free data. So you can write papers on it. The only thing that you have to do when you want to use this data in a publication, you have to credit the CBS and of course use it, right? So this is the data set that I just downloaded. You can see that it has like a whole bunch of different columns, but had like the main columns that I wanted to look at is for example, the number of people, right? So if the number of people living in a certain area, so this is the area Holland. So this is the whole country. And then the first is a gemeente, which is like slightly bigger than a beseerk. So anything which has a city hall is called a gemeente in Holland. Then you have a wijk, which is kind of a beseerk. And then you have a buurt, which is kind of like a viertel, right? So it's like different layers. So the first thing that I did is, so I downloaded the data set and I only look at buurte. So only at like viertel areas, right? So not the whole city, not the whole country. No, so the first thing that I do is say grep buurt in this sort regio too. So because that is the column which separates like the different sizes of the area. So the grep command can be used to search for something. So I'm just grepping and this will return the numbers of the rows that match. And then I'm just doing a subset, right? So I'm just taking my data and throwing away part of it. So if I would do this in R, then now if we look at S data, one to 10, one to 10, right now we see that we only look at buurtel. So these are all different little areas. So the country is divided into all kinds of sub areas. And you can see, for example, that in this area, which is called AA and Hunze, there are 3,395 people living there, right? And there's many of these little areas because AA and Hunze is actually a gemeente. So that's like a big area. And then this is subdivided into very little areas. It also tells you how many males, how many females there are living. And it also tells you which ages they have. So the thing that actually interested me was that they also have violent and sexual crimes in there, right? So you might expect that, for example, if you have an area which is very dense in population, right? A lot of people living very close together, you would actually expect that there would be more crime, right? So you can use this data to kind of analyze this. So what I did is I just say, well, make a plot on the x-axis, you put the population density and on the y-axis, we plot the number of violent and or sexual crimes, right? Because that's a single category. So if you do this, then in R, it looks kind of like this, right? So I had a very strong hypothesis saying that the more people that live together in an area, the more crimes there's probably going to be, right? So when I make the plot, it looks a little bit like this. And this kind of surprised me, right? So you see that actually the more people are living in an area, it seems that there's kind of a negative correlation in a way, right? So it seems that violent and sexual crimes actually seem to occur in smaller areas in the Netherlands, for some reason. So of course, you see that this is a very wide range, right? So this is like 200, 400 and 600. So of course, what you would normally do is you would take the y-axis and you would want to take a log out of it. So a logarithm, right? Because in the end, we want to have a little bit more insight in the graph. So if we take the log, then we see this, right? So we see that now there's a lot of areas where there's no, then there's a couple of areas where there's one, and now we start seeing that indeed the bottom, it seems that bigger areas have no areas where there's no, there's always, so if there's like 35,000 people living on a square mile, there's going to be at least some violent crimes. While if you look at smaller areas, there's areas where there's no crime at all, but there's also areas which are relatively high crime-ridden, right? So in theory, we could use this data to figure out what the most violent area of the Netherlands is, right? So if I'm moving to the Netherlands, I could use this database to figure out, should I buy a house in this area, yes or no? Because I want to buy a house in a safe area, and I don't want to live in a very dangerous area. So of course, this plot originally looked a little bit weird, right? But I said like, there's these two outliers on the top. Let me actually show you guys. If you don't logaritm it, then you see that there's a couple of like relatively big outliers here at the top, right? So there's two areas in Holland where there seems to be almost double the amount of crime based on the other really crime-ridden areas. So I was just wondering which two areas these were, right? So what I did is I just say, well, take my sub-data and now show me the areas where there is more than 600, right? So, and what are these areas? So what are the most dangerous areas in Holland? And then we don't need all of the numbers. But then it seems that the first area is located in Amsterdam, which makes sense. It's the capital of the city, so you expect most violent crimes there. And then the second area is actually in Breda. So Breda is a relatively small city in Holland. I think it's in the south. And the thing that surprised me the most is that when you looked at the number of people living there, the number of people is very, very low. There's not a lot of people living in this area. So in the end, if you would let the whole dataset roll, then all the way in the back, you see the number of violent crimes. So in the first area, there were 740 violent or sexual crimes reported in the second one, 638. But then when you look at the areas are relatively big and where are the number of people? So there's like 40 households in the first area. And in the second household, there's 25 households. So in total, there's only like 50 people living in the first area and 65 people living in the second area. So these are areas where there's no one really living. So when you take the code, you can actually look up on the map where it is. And if you do that, then you figure out that the first area is actually the garden in Amsterdam, let me get that one. So this is actually a, how is it called in? This is called, the Buurt is called Teleport for some reason. And it's located in Betrijvertrein-Sloterdijk. So Betrijvertrein-Sloterdijk is of course, it's an area where there's companies. It's a company area, right? It's not a living area. In chat, Breda must be a trailer park. So let's check that out, right? Because that's actually something that you can do. So you can just fill in the number and then it's called Emmer. So it's a very, very wide area, 242 acres. The Buurt is called Emmer. So you can just look it up on a map. But those are the areas where most violent crimes are reported in Holland. So one is an area which is more or less where there's only like industry. And the other one is a massive, massive area. So it's probably a dumping ground where people just dump their bodies, right? And if you find a body somewhere, then the police will say, okay, so violent crime took place here because they have no indication where it could otherwise be. But of course, this is a very coarse overview. Of course, we can actually go back and we can say, well, take everything with more than 200, right? Because I want to, oh, I'm sorry. Take everything more than 200. And I only want to know in which city this is. So that is called in the data table, it's called gemeentenaum one. And then you can see that these are the names of the different gemeenters. And then you can make a table, right? Because R just has the table function. And you actually G sub away the spaces by nothing because it has spaces at the end of the names. So here we have the table, right? So all of the most violent areas in Holland, six of them are located in Amsterdam. Three of them are located in Rotterdam. Three of them are located in Haarlem and Mir, which is an area in near Haarlem, I think. And of course, we can make it a little bit more populated, right? When we say give me more than 100 violent or sexual crimes, but you can see that Amsterdam is winning by a large margin. So living in Amsterdam is relatively dangerous. Then of course, the next most dangerous area in Holland is I think Rotterdam with four times being in the list. So four areas, so four viertels in Rotterdam are listed as higher than 100 crimes. And you can see that other, and the thing that I found actually funny is that Flachwedde, which is a very, very, very small area in the north of Holland, very close to where I'm from, is actually like, there's an area which is relatively rough. So there's relatively a lot of violent and sexual crimes there. But you can use this database for that, right? And you can do all of these types of analysis. So another one, which I found relatively interesting is for example, the average income per inhabitant and then the amount of deaths relative, right? Because you would expect poor people to die more often than rich people because they, so is there a correlation? So you can make the plot, hey, you can look at the plot and you can start reasoning about if poor people, I just copy-pasted it wrong. No, close the plot here. Sorry, there's an additional little closing. So if you look here, then you indeed see that, so this is the average income. This is, I think, in thousands of euros. So you see that indeed death is relatively low if you earn more than 100,000 euros per year. You see that most of the deaths occur on a modal income. Of course, this has to be scaled in some way by the number of people in, no, because it's already relative, right? So you can see that if the more money you earn, the less relative chance you have of dying. The same thing holds if you don't earn anything, but this is probably due to the fact that people who are young tend to have a very low income. People who have a very high income or a modal income, so an average income, are more or less in this group, right? So there's an age factor at play here as well. But you can see that there's a pretty good bump here in the middle and you can see that indeed it seems to be that people earning like 40 to 60,000 euros a year have a much lower chance of dying than people who earn like 20 to 30,000 euros per year. So very interesting data set. Feel free to play with it. And there's a lot of these data sets. Like I showed you guys, in total, there is 5,038 of these data sets. So a lot of data to play with, all available for free. If you find really interesting correlations, then you can publish papers about that. So you can investigate all kinds of different social factors with things like income, migratory background, death rate, crimes, education, all of these things are located in this database. And I made another slide that did this analysis, right? So then look at only Burt or Gagans or Fiertels, then do violent crimes. Or you can look at relative deaths. So it's all in Dutch, which is a drawback, because you have to learn how to speak Dutch to use it. Misha said, but modal income is the largest group of people of the population. Yes, that is true, but this is relative deaths. So it is relative already to the amount of people in the group. So the only real big effect that's still left in this plot is, of course, the age effect. Because people who are young tend to earn less money than people who are in their 40s. So and you also have total deaths. And then you can scale it yourselves. But the data is there. You can do all kinds of statistical models on there. And probably you could spend your whole scientific career just analyzing this data. Of course, the problem here is that it's only for Holland. But hey, you could probably analyze all of this data. The rest of your life, right papers have a pretty good scientific career, just using the open data that is provided by the Dutch government. And of course, this is very valuable data. These people collecting this data takes a lot of money, running the infrastructure to save everything takes a lot of money. So it's a very interesting one. One of the things that I found interesting as well is that if you look at, and I don't want to, no, I'm just going to let you guys look at it. Because I do want to do the lecture before five. So a lot of data, lots of interesting stuff, and calculate your correlations, do your linear models, and these kinds of things. And you will find some very, very interesting things. All right, so first the definition, statistics. Because we now saw that we can get a whole bunch of data for free. So when we talk about data analysis, we talk about statistics. Isn't the whole thing also available in English? I didn't find a way to actually get it in English. But I had not looked at it in detail. Because I just wanted to make a quick example and not spend like six hours looking through the whole database. But you might get it in English as well. The Dutch government is pretty good. So I don't know if the R package actually supports doing it in English. Doesn't seem to be. Doesn't seem to be. But there's probably a way to put it into English. But again, it's an R package. There's a help, and there's probably a way to turn it all into English. So statistics. Because we have data. Data is not the problem. There is so much data in the world that all the scientists in the world can't even analyze it. So with this data, we want to do statistics. Because statistics is the study to collect, analyze, interpret, and present this organization of data. So today we're going to look at univariate analysis, which is the analysis or the looking into a single variable. So looking into things like age or looking into body weight or looking into crime. Well, crime is not one thing, but looking into murder. So univariate analysis is the analysis of a single variable. Once you start analyzing multiple variables together, then we call this bivariate or multivariate analysis when you analyze multiple variables. So the first question, of course, is if you have a bunch of numbers, the first thing that people always calculate is the average, the mean or the middlewares in German. And a lot of people don't realize this, but there's a lot of different means. There's not just the average that we standard use, there's other averages as well. So normally this would be a question for all the people listening. So if you know how many different means there are, then throw it in chat. I will take like seven seconds so that people can guess. So one, two, three, four, five, six, seven. And there are actually three. So there are three different ways to calculate a mean. So if someone says to you the mean value is 15, then you have to ask the question, which mean? Do you mean the arithmetic mean, which is what we normally call the average? It's just summing everything up and then dividing by the number of numbers, right? So it's the numbers that we measured divided by the number of numbers that we have. But you also have something which is called the geometric mean, right? So the geometric mean is calculated by multiplying all of the numbers together and then taking the square root or the root, the nth root, where n is the number of numbers, right? So for example, on the same numbers that we calculated the average being 42 using the arithmetic mean, we can also calculate the average to be 30 when we use the geometric mean. And then we also have the harmonic mean, which is also called the subcontrary mean. And this is when we divide the number of numbers by the fractions of the measurements. So when you write down, I calculated the mean and the mean is 15, you have to tell people in your publication, which mean you mean. So make sure you mean, make sure the mean you mean is the correct mean. And why are there differences? Because the average, of course, is very logical, right? So if we have numbers which follow a kind of normal distribution, then we calculate the average and the average is the middle of the distribution, right? There's more or less an equal amount on the top that there is on the bottom, not exactly because the mean is influenced by the distribution itself. But the geometric mean is used when we compare different items that have different ranges. For example, think about a company. A company can have two measurements. For example, the environmental sustainability, which they get a score on from zero to five, and then they have a financial viability score which ranges from zero to 100. So of course, if we would just take these two numbers, add them together and then divide by two and do that for multiple companies, then of course the financial viability would have a much bigger impact, right? Because you can have a score of 90 on your financial viability and then the environmental sustainability doesn't really matter, right? Because it's gonna be 91 to 95 divided by two. So this is why we use the geometric mean. So we use the geometric mean when we want to make a single figure of merit, so a single number on a company, but we have very diverse measurements. So the range of each of these measurements is different. And then we use the geometric mean, so we multiply the numbers together and then we take the square root, right? And by multiplying numbers together, you don't run into the fallency that some numbers have a different range. So the harmonic mean is a situation is used when you have rates and ratios. And then the harmonic mean is the truest average, the average which is most true. Because we're talking about statistics so nothing is really definitively true. It's the closest to the real value. So I always use this example which I just stole from Wikipedia and that is like if you have a car, right? And this car travels an unknown distance at 60 kilometer per hour and then it travels the same distance at 40 kilometers per hour, then what is the average speed across the whole thing that it traveled? So in this case, the harmonic mean is of course the most truest mean because the car drove across the whole area with around 48 kilometers per hour. If you would calculate the standard average, you would say 50 kilometers per hour. But of course that's not true because it took more time for the car to travel distance X at 40 kilometers than it did at 60 kilometers, right? So the 40 is getting more weight. And that is because we are talking here about the same distance, right? So we have here a ratio and a rate, right? So we have a ratio which is kilometers per hour and the rate here is, so we have kilometers per hour and we have here an unknown distance. So if the vehicle travels for a certain amount of time at speed X and then the same amount of time at speed Y, the average speed is the arithmetic mean because then we are not looking at a ratio because we then multiply the 60 kilometers per hour by time and we multiply the 40 kilometer per hour by time. So then we don't have this per hour thing anymore, right? So there's a big difference in how to answer these questions. And this is why in many questions you have, or in many movies you see a train leaves the station traveling X because it's a very common problem to pose to students and the students have to figure out that indeed when you talk about unknown distances, right? So it's a distance multiplied with a distance divided by time, then it's actually the time component doesn't fall out. So the average speed goes down because the lower speed has a much bigger impact on the average than the higher speed. So in R you can only have the arithmetic mean function. So the arithmetic mean function is just called mean. You give it a list of numbers, so vector and there are two important parameters. So the trim parameter, this is the fraction of observations that are trimmed from the upper part and the lower part before you compute the mean. This is to kind of get rid of outliers because you can imagine that if everyone earns 30,000 euros and then there's one guy that earns 6 billion, then of course the 6 billion will have a massive influence on the mean. And then you have NA.RM and that indicates whether missing values should be stripped before the computation proceeds. So the median, very similar to the mean, is a number which separates the higher half of a data sample from the lower half. The median can be found by arranging all values from lowest to highest and picking the middle one. In case of an even number of values, the median is usually defined to be the arithmetic mean of the two middle values. And that is because if you have an even amount of numbers then there's two middle values when you have an odd amount of numbers there own. So again, R provides the median function. So if you wanna calculate the median of a distribution, you just call median and then give it a list of numbers. It also has the NA.remove parameter. It doesn't have a trim parameter. So it's not like the mean where you can trim and that is of course because trimming is nonsensical for the median, right? Because the median just has the higher 50%, lower 50%. If you throw away 5% here and 5% here then the median wouldn't change. So of course there's no trim parameter to the median because it's the middle value because throwing away on the sides of the distribution doesn't really change and it doesn't change the median. Then another number which we often report for central tendencies of the data is the mode. So the mode is the value that appears most often in the data set and the numerical value of the mode is the same as the mean and the median when we talk about a Gaussian distribution. So when we have a normal distribution it means that the mean is the same as the median is the same as the mode. Of course the mode will be very different on the median and the means will be very different when distributions are highly skewed. And the nice thing is a sample can have multiple modes, right? You can have multiple numbers that occur equally often in a data set. All right, so then when we talk about dispersion, right? So the mean, the mode and the median are there to fix the middle part of the distribution more or less. Right, so in a paper you write, we find have we measured body weight of mice we found that the average mouse weighs 20 grams plus or minus and then you give a dispersion measurement. So there's multiple dispersion measurements and the most common one is the range. So the range of a data set is known as the difference between the largest and the smallest value, right? So if the smallest mouse in our data set was 10 centimeters the biggest one was 50 centimeters then the range is 40 centimeters. So in R you have the range function that gives you the minimum and the maximum value but of course the range itself is defined as a single number. So if you want to have the range according to the mathematical definition you have to type diff range x and this is the range of the input. Furthermore, when we talk about medians we also often talk about quantiles. So because the median is more or less the middle part, right? So you have the lower 50% higher 50%. Of course we can do the same thing when we kind of group our data into 25% batches so to speak, right? And then we're calling it quantiles. So a quantile is when you divide your order data into Q essentially equally sized data subsets and this is called Q quantiles. So you can have like four quantiles, five quantiles, six quantiles, seven quantiles. It just depends on how many groups you make in your data. So quantiles are data values marking the boundaries between the consecutive subsets. So a quantile, so you can have a five quantile which means that the first quantile is 20% of your data then 40%, 60%, 80% and 100%. And to compute quantiles there are nine different ways of doing it and you don't have to know any of these. Just know that quantiles are not a solved problem. Based on your computation, quantiles can be slightly different between different computational methods. R provides one method. So if I just draw a uniform distribution of a thousand numbers and I ask for the quantiles then of course the zero percentage quantile will be at zero. The 25% quantile will be at like 0.25, 50% at 0.5, 75 and 100. So here we ask for the four quantiles, Q quantiles. Often when we describe data in papers we also want to give the variance. So variance is a measurement which gives you a number on the spread of the numbers. So it's the spread of the numbers relative to the mean of the distribution. So a variance of zero means that all of the values are identical. There is no variance and variance is always non-negative. You can't have a negative variance. So in R you can use the var function. Be aware that variance comes in two different forms. They are very similar but if you have measured the whole population, right? For example, you did a questionnaire and or yeah, well, not a questionnaire but for example, you measured all of the mice in your mouse house. Then you need to use the first formula because you did the whole population. If you wanna generalize your mouse living in the mouse house to the whole population of mice or to all mice then you have only a finite sample, right? Because you cannot measure all mice on the world. So what you do then is then you do an n minus one on the bottom and then this is called the finite sample variance. So if you measure the whole population there are no degrees of freedom that you lose but when you do a finite sample you lose one degree of freedom which is denoted by n minus one and this minus one comes from the fact that you also have to calculate the mean, right? So it's the multiplying all the numbers so the square root of the numbers minus the number to the power of two divided by n. So it's relative to the mean. Besides variance, we also have the standard deviation and this is probably one of the most common ways of reporting your data saying my data had a mean of five and a standard deviation of two. So it is a measurements that's used to quantify the amount of variation or dispersion of a set of data values. A standard deviation of zero again indicates that points tend to be very close to the mean and in R you can have the SD function to calculate standard deviation or you can just take the square root of the variance because standard deviation is more or less defined as the square root of the variance. There are better measurements like the average absolute deviation but I don't wanna go into that because the standard deviation is more or less the standard way of doing it. You also have the standard error so the standard error corrects for the number of measurements that you have. The standard deviation doesn't do that and of course because the variance comes in two different flavors, standard deviation also comes into two different flavors so you can have a population standard deviation or you can have a finite sample standard deviation as well. All right, so when we talk about distributions we always have to talk about outliers, right? Because outliers are defined as observations or an observation that is very distant from other observations, right? And that can be, there can be real reasons why there is, why there is and something which looks like an outlier and one of the things is that there can be variability in the measurements so we can have for example a very heavy tail distributions which we call kurtosis. So we will come back to kurtosis but it might be that you have a phenotype which looks relatively normal when you start looking into the population but there are a lot of people which are exceeding the standard deviations on the top or on the bottom. Of course there's also outliers that come from experimental error, for example you pipeted in too much in your cup or too much chemicals or you did another experimental error then that can cause outliers as well and there are also recording errors where you for example misplaced the comma, right? So sometimes when you're writing down stuff in the lab or you're typing them on a computer when you have your gloves on you type your comma wrong. So instead of saying this mouse was 30 centimeters the comma goes and the mouse was three centimeters, right? So you type three comma while you actually wanted to type three zero comma zero you type three comma zero zero, right? So these are very common errors. So is an observation an outlier? This is a very subjective topic and this is very subjective to what is causing the outlier. So for every outlier in your data you need to investigate can I find where this outlier was produced, right? Is it a comma error? Is it an experimental error? Or is it just that I have a very heavy tail distribution? So a lot of people are more or less a normal distribution but some of them are very different from the other ones. And generally variability in measurements are not considered outliers. They are considered real measurement values, right? If you happen to find a mouse which happens to be 70 centimeters then it is a mouse which is much, much bigger than the rest. So it could be an outlier but it might not be because if you go back to the mouse house you can look at the mouse and if it's really 70 centimeters like this size of a rat then it could be that it's not a real outlier, right? Because it's really that big. But if you have a real outlier which is caused by experimental error or recording error then generally you want to get rid of this value. And there's two major ways of doing that. One of them is trimming. Trimming means taking these outliers so these extreme values and considering them as missing so just replacing the value with an NA. And then there's Windsorizing. And Windsorizing is one of my favorite terms in statistics because it just means I measure the mouse which was 70 centimeters. And that's impossible because mice generally are between 10 to like 40 centimeters or something. Well, not 10 to 40, like more like five to 17, I would say. They're not that big, right? They're mice that are like this. But Windsorizing means that you take your numbers and you just say, well, this mouse 70 centimeters I don't believe it. I'm just going to write down seven because I believe it's a comma error. And this is a perfectly acceptable thing in science. So in science, you can do like 100 measurements and then you can look at your measurements and say, I don't like this measurement. I'm going to change it to whatever I want because there's no real rules on how you should Windsorize your data. So this was invented by Charles P. Windsor who lived at the beginning of the 19th century. No, 20th century, beginning of the 20th century. And he came up with this idea that in science, sometimes there's values which you don't like. So you're just going to change them to whatever you want. And that's perfectly fine as long as you write it down, right? As long as you mentioned that I changed this value that I measured to this value because I like it better and it fits my hypothesis better, then that's a perfectly valid statistical thing to do. It's just questionable in a way, but it is perfectly valid as long as you write it down and as long as you mention it in your publication, saying that I Windsorize the values to get rid of outliers. Good, so when we talk about normal distributions, we always want to talk about the shape of normal distributions. I hope this is visible because it's an image which I had to invert the color from. So skewness is the measurement of asymmetry of a distribution. So we can have a long tail or we can have a fat tail and it can also be undefined. So if we want to analyze this skew of a distribution, right? So we can have a negative skew or we can have a positive skew. So it looks fine. Okay, good, good that people can see it. I think it's a little bit blurry. I don't know what happened with the image. But hey, in R you can just load the site library and then this type skew of your numbers. So X here is a single factor of measurements. And then it will tell you if there's negative skew or if there's positive skew. So and that means that in positive skew it means that there are more values above the standard deviations than you have. And negative skew means that there's more numbers on the negative side of the normal distribution. So kurtosis is another thing that can happen to normal distributions. Kurtosis means that there is a peekiness in your data. So that means that instead of the perfectly normal distribution that you see, your data is still normal but it is the standard deviation is not big enough, right? So it's very peeky, right? So and that's called positive kurtosis. So positive kurtosis means that there are more individuals more or less close to the mean than you would expect based on a number of values that you have. So negative kurtosis means that it's the opposite, right? So that individuals are not as closely surrounding the mean as that you would expect it based on a normal distribution, right? Because the shape of the normal distribution is more or less fixed. So we call this mesokurtic. So a Gaussian normal distribution is a mesokurtic distribution because there is zero access kurtosis because it's the normal distribution. Then you have leptokurtic, which means that there's positive kurtosis leptokurtic, which means that there's positive kurtosis and then we have platykurtic distributions which are the negative kurtosis distributions. Again, in the psych library, you have a kurtosi and this will give you a number. Positive numbers means that the distribution is more or less pulled up and negative kurtosis means that it's pushed down, right? So why would you wanna calculate like kurtosis or skew numbers? And this is because many statistical models assume a normal distribution. So they assume that your data is either Gaussian distributed or that the error term after fitting some factors is following a normal distribution. So because of this, because if we have this assumption of normality and this normality assumption holds, we have more statistical power because we don't have to switch to nonparametric statistics. So if you wanna test in R, if something is normally distributed, you can use the Shapiro test. So the Shapiro-Wilk test of normality and this is built into R. So you don't have to load a library, it's in the standard R and it's called Shapiro-Dot test. This only works for around measurements when you have up until 5,000 measurements, I think. So above 5,000 measurements, the Shapiro test doesn't work and you have to switch to D'Agostino's K-square test and that is based on sample skewness and sample kurtosis. So if you have a distribution which is kind of normal but kind of skewed, then you can use this D'Agostino K-square test and this is provided by several external packages. I think the Psyche package also provides the K-square test. So one thing to note about the Shapiro test and I showed you a little example here is you can test the normality of your data, right? So if I draw 100 uniformly distributed numbers and I do a Shapiro test on them, then it will be a significant P value. If I draw 100 numbers from a normal distribution and I do my test, then the value is not significant. So it's the opposite of what you would expect. So significant means, so if you get a significant Shapiro-Wilk test, it means that your data is not a Gaussian or a normal distribution. So this is because the H0, the null hypothesis is that it is a normal distribution while the alternative is that it is not a normal distribution. So that's why the P value needs to be interpreted but interpret the P value correctly because if you get a non-significant P value, it means that it's normal. If you get a significant P value, it means that it's not normal. Good, so plots. We're going to talk much more about plots in the next lecture but I did want to show you guys some things about plots which I think are useful because plots are very useful for visually exploring data. So you can look at the distribution, you can look at outliers and you can look like other data weirdness what might be going on. So in R, there are four, five major plot functions. So you have the plot function. So the plot function actually does two things. It creates a window and then it calls points on the vector that you provide and points is nothing more than a function which creates a dot slash line plot of the data in X. So I can only call plot once but after I've called plots, I can call multiple times points and that will just add to the window. But when I call plot, it clears the window and then the numbers that you give it will be plotted. But if I want to add then more to it, I can use the points function. You have the access function to determine what is written down on the X, the Y, the Z and the W axis because in R you can have more than, so you don't, you can have an X, a Y but you can also have the opposite axis. I would never do that because it makes plots very hard to interpret but the axis can be specified. So the side one is the X axis, two is the Y axis, three is the Z axis and four is the W axis and four is the Y axis. So you can add a plot that's on the top of the plot. You have to provide add, so at which values do you want to have an axis tick? So where do you wanna have your ticks? And then the labels are the things that are written under each tick. When we talk about plots, we also have to talk about legends so you can use the legend function to put a legend or add a legend to your plot. Furthermore, we have the image function but it is for two-dimensional plots so it kind of creates a heat map like plot, a 2D plot of the data and here you can use breaks to set the color boundaries. So you can say everything from zero to one, make white, everything from one to 12, you color purple and from 12 to 16, you make it black. The breaks are open-ended on one side so on the positive side, so that means that if you say, so the last break is everything higher than that break as well, right? So you have to always specify one more break than you have colors. So because you need to go through the whole distribution. So some example, so a box plot that we see here is a box and whiskers plot so you see the notches here and the notches are really useful. When I make box plots, I always put in notches and notches are based on the more or less computation of the standard error or based on the standard deviation and it is a very nice visual way of seeing if there is a significant difference between two groups because if the notches, so the notch here is this thing and it runs all the way to here, so the notch starts here and then ends here so this is the median and this is the upper bound, lower bound and if these two things overlap like here, this notch overlaps here because it's inside of the notch of this, there's this little notch here. So if they overlap, there is no real evidence that things are different but if the notches do not overlap like here then you expect or then you almost know for certain that the two medians are not the same so that there is a significant difference between these two groups. So it's a very quick way to kind of get an idea if there are any significant differences in your data by just setting the notch parameter to true, so you just call the box plot function notch equals true, it will add the notches and you can visually see if there's a significant difference between one group or the other. We also have the far with parameter, if you set the far with parameter to true then the horizontal size of the boxes will be based on the number of measurements in there. So the fatter a box, the more measurements there are, well if it's a very skinny box then there's only very few measurements in there. So it also gives you an idea of how much data is in each of the groups. So if you wanna show this a little bit better you can use violin plots. So I love violin plots myself, publications also love it. So it's the same as a box lot but it shows you the distribution much better, right? Because a box plot is very useful to show when your data is a normal distribution. But if your data is not normal like here, like we have a bimodal distribution with more or less two normals or three normal distributions woven into one, then it's much better to show this violin plot, right? Now we can see that although the average is zero, the mode of this is minus five. And here we can see that for group number one, the average is also around zero, slightly higher. But we can see that the mode here is around positive five or negative five. I don't know which one of the two is exactly bigger than the other one. But you can see that it shows the shape of the distribution. So the wider the area, the more measurements were in this area, right? So there were a lot of measurements that contain five and there's only very few measurements that contain like the value of 11. So here is how you can do it. Violent plots are not standard to R. So you need this violin plot package. So you of course need to install it first and then you can load it using the library. And as an example, I generate some normal distribution. So three normal distributions in one vector. And then I generate a vector zero, one. And then I just say make the plot. So take my values in X and then make two groups and the groups are located in Y. And the first one gets to be orange. The second one gets to be purple. If you wanna get more or get a better idea of your distribution, then you can use the histogram. So I think everyone knows what a histogram is. So it shows you the amount of measurements or the density of the measurements within different breaks. So you can represent either densities or frequencies. The breaks you can set yourself. So you can say I want to have a break or I want to the first box or the first bar should be from zero to five. The second bar should be from five to seven and a half. And then the next one should be from seven and a half to 10, right? You can set how wide or the width of each of these individual bars in the histogram. It has a plot parameter because sometimes you don't want to plot the histogram because the histogram returns all of the data from the histogram. So sometimes you don't want a plot to be created. You just want to use the histogram function to compute things like the normals or the mode, the median, the quantiles and the amount of samples in each of the different groups. If you wanna add one of these curves then you can use the curve function. So the stuff that the histogram returns can be used as an input to the curve function to draw a nice curve across your histogram to show that this is the data distribution. So images and heat maps are functions in R. So image creates a grid of colored rectangles and the heat map is the same as an image but the heat map also provides automatic clustering and you can add colors to the sides of the plot. So if you have different groups then generally you would want to use a heat map and not an image, although the image is much more flexible in what you can do with it. So if you wanna make plots then in R there's a lot of things which you need to configure and you can use the par function for that. So the par function allows you to set or query graphical parameters. So here, for example, I say use the font lab, is two, the font axis is two, the magnification of the axis is 1.5, the magnifications of the labels are 1.5. So you can set all of these plotting options using the par function. So you can set, for example, magnification, the font family, you can set the margins, you can set the amount of rows, the amount of columns that you want in your plot, the size of each dot, if you wanna plot the x-axis or the y-axis or if you don't wanna plot them. So everything about plots in base R is configurable and you configure it using the par function before you start plotting. So how do you know plot in R, right? How do you make an image which you can use for publication? Well, you start off by creating an output device. So for example, I'm saying that I want to have a PNG image. So I use the PNG function, I give it a file name, I give it a width and I give it a height. Then I set up the parameters. Like for example, I want to use the font label two, right? So that means that the font family is going to be two and I've want font axis to be also twice the size. I create an empty plot. So in this case, I make a plot which is from zero to 100 on the x-axis and zero to a thousand on the y-axis and I say type equals none because I don't want to add any numbers to it yet. And then I just use the points function. So a lot of people here run into or a lot of people in R run into issues because they always use the plot function but the plot function is bad because it does two things. It makes a window and adds stuff to the window. And generally you want to start off with an empty plot. So creating an empty plot, specify the x-range, specify the y-range and then just say type equals none. That will make an empty window for you. It will add the axis to it but it won't plot anything yet. And then of course you can go and use the standard functions to add points or lines or arrows. So call the points function yourself. Once you're done with your plot, you just say def.off. And at this point, it will save the plot that you created into the PNG that you designed as the output. So plotting is a kind of five step strategy. Open up an output device, set the parameters. So what font type do I want to use which other things do I want to set like families or should it be bold or italic or these kinds of things. Then make an empty plot window and then add stuff to the plot yourself, points, lines, arrows, and then use def.off to save the plot to the hard drive. Because all of this is in memory until you call the def.off function. If you want to have multiple plots you can use the par function to specify the number of rows and the number of columns. So here I say MF rows, so give me two rows, two columns. So that's have four plots, two rows, two columns. And then when I call the plot function, I can then call the plot function four times. Each time it will add a plot to the plot window. And it will fill them. So if I say MF row, plot number one will be here. Plot number two will be there. If I say MF call, then plot number one will be here. Plot number two will be here. So that just determines if you want to do it row-wise or column-wise. If you want to make more complex plots, so you want to make for example a panel plot, then you can use the layout function. So you can use the layout function to make, for example, something which looks like this. So you give it a matrix and you specify one, one, zero, two. So and then two by two, right? So this is a matrix which has a one, a one, a zero and a two in the matrix. And then if you show the layout, then it looks like this. So the first plot will go on the top and use the whole top area. The second plot will go here and this one cannot be used because it's set to zero. So you can't plot in this area now. Of course, we could have put a three here, right? And then the first plot would have gone here. The second plot would have gone here and the third plot would have gone here. So you just have to play around a little bit with the layout function, but it's really useful if you wanna make like multi-panel plots where you, for example, have the mapping, the histogram, and then some other like box plot or something that shows the data distribution. So just as an example, or so just during the assignments, there will be a lot of coding effort when you start or when you want to do it with for loops or when you want to do it with while loops. So in the assignments, you should, in this case, in these round of assignments, you should try to use the apply function because that will save you a whole bunch of typing. You don't wanna type like a massive amount of for loops. For example, if you wanna calculate the column means, right, you can say apply to a matrix that I have loaded in to two means to the columns, the function mean, right? So the apply function is defined like this, like apply to a matrix, margin, one is rows, two is columns, one, two is rows and columns, but we always use one or we use two. And then we have fun, which is the function that we want to apply to the rows or to the columns. And this will save you a lot of time during the assignment. So I just want to remind you guys that don't, well, you can always do the assignments using a for loop or using a while loop, but it will be much shorter code-wise if you try to use the apply function. So if you fail to use the apply function, feel free to use a for loop or a while loop, but remember that it's not the optimal solution in many of the assignments for today. So of course we can use apply and subset together. Again, this is for the assignments. So if you do the assignments, if you want to get means for a certain number of columns, use the subset function, right? So you have my matrix, from my matrix, I want to select column one and column two. And then to this subset, you can directly apply to the columns the function mean, right? And we can also use this to filter. So the subset function we discussed in the previous lecture, I think. So from this subset, and so subset from this matrix, everything where the temperature column is lower than 25, and then only select column one and column two, and then calculate the mean, right? So you can use the subset directly in the apply function so that you don't have to store an intermediate matrix, which will be a little bit of a hassle by storing them in the assignment. Good, so that's what I wanted to say about the assignments for today. So next week we will focus on making plots. And I will only show you base R because I do not use ggplot too. I never learned it. When I started programming, there was no ggplot. When I started programming in R, there was no ggplot too yet. So I don't know how to make plots using ggplot. I would love to learn, but I just can't get the way that ggplot works in my mind. But we will focus on making plots suitable for publication. And that's what the whole next week's lecture is going to be about, making beautiful plots that you can use in PowerPoints and that you can use in presentation. So join us next week, definitely, and ask a lot of questions. Ask for a lot of examples every time. Hey, if you're in the stream and you think, oh, this is difficult, show me an example, then we can do that, right? So attend the lectures because that just gives you the option to ask questions. Think that's about it. So next week, we will continue talking about plots, about the different parameters, how they affect your plot, and it will be all base R. If you guys really want to learn how to use ggplot too, then also drop me an email, put a comment under the YouTube videos. And if there's enough people that really want to have a lecture about how to use ggplot, then we can do that. But I think, in my opinion, base R allows you to make all of the plots that you need. And we'll discuss how to do that. Good, so for me, that's it for today. I hope it was not too boring. We lost a little bit of people, but there's still some people watching, so that's good. So for my side, if there are no questions or requests for me programming something or me doing anything else, then this is about it. And this is what I wanted to tell you. So today we talked about descriptive statistics. What does it mean? What is a median? What are standard deviations? And why do we use these numbers? We use these numbers to tell other people how our data looks like without showing them an image. Of course, showing them an image is much clearer and showing people a violin plot. What is the difference between base plots and ggplot? It's a whole different philosophy. Base plots in R, so our base graphics are based on the artist's palette theory, which means that you do layer after layer. So hey, you first make an empty plot, then you add some points, then you add some more points, and these new points are going to be on top of the old ones. If you draw an arrow, then this arrow is going to be on top of everything which is below. So it's kind of a layered, so you make a plot and then you add layer by layer the same way as that you would paint on a canvas. So you would, well, ggplot has a very different way. It uses a grammar to specify how plots should look like. So that means that you just tell it, I want to have a heat map and this heat map should have this and it should do that and it should group by this. And then it's one call. So it's not subsequent layers. It's just more or less a specification of how the plot looks like. And then ggplot makes it fit into the window that you have. Well, in base R, you sometimes get an error saying that what you're trying to do does not fit into your plot window. Why, well, ggplot will do the scaling for you. Hey, it will adjust, for example, the font size so that everything fits nicely and that stuff doesn't start overlapping. But in base R, you are the one making the plot. So you are the one responsible for that. And base plots are totally free. You can start with an empty canvas and you can put anything on there. Like the stuff that we did today with the moving of Obama, that is something that you can never do in ggplot because it's painting on top, off, on top, on top, right? And ggplot doesn't allow you to do that. It allows you to make multiple plots the same way as that we did, but it doesn't allow you to kind of smoothly transition from one thing to another. So there's a difference of philosophy behind it. And ggplot just looks more beautiful without spending that much time on it. So it's, I always say that ggplot is for people who are too lazy to learn base R. It's not entirely true because like you need to spend a lot of time learning ggplot as well. And if you're good in ggplot, you can use it in a lot of things. It's just that for me it's another package that I have to learn another help file set of like a hundred pages that I have to read through and then have to master, which I don't feel like I need to do this investment because I can do anything that I want using base R. And in the end base R is, in my opinion, it gives you more freedom and it allows you to do more things. All right, so we're down to five viewers at the moment. So I would like to thank all of you guys for being here, for listening to me for what is again, almost three and a half hours. So thank you guys for staying here. We have to think about if we want to do the assignments live, I think it makes sense, but it only makes sense when people ask questions, I think, because otherwise I can just show you my answers and just copy-paste it in as well. And yeah, that like, hey, if it's interactive, then it makes sense, but if I'm just coding, coding, coding, then it's just two hours of me coding, which is, which might be interesting for some people, but for me it's more fun for me to just show you my assignment then and then we can discuss why did I choose this. Thank you for the lecture. Thank you for being here. Like that's why we do it, right? Without you guys, there wouldn't be a lecture. So thank you guys for being here. Thank you for registering and thank you guys for showing up every week. And then I will see you guys next week. And as said, next week we will have plots, plots, plots, and even more plots. So thank you for being here and I wish you a very, very good evening. The weather here is still beautiful. So if you have the opportunity, go outside and get some fresh air. And I will see you all next week.