 Start recording, good. All right, so welcome everyone to the fifth lecture, which means that we are one third of the way through of bioinformatics for plants and animal sciences. The stream will start soon. No, we started streaming. So for today, I made a basic R introduction for you guys so that you guys, when next time we have R in the assignments, you can be better prepared and well, enjoy it more because programming is an enjoyable experience, I think. All right, so the overview of today is using R as a basic calculator, the different types and the type system in R. One of the things which will screw you up is the type system in R. R has many different types and it auto-converts from one type to the other type and I've been programming in R now for, I think like 14 years or something since my master, it's that long ago that I did my master. I'm getting old, but the type system is really difficult and because of the auto-conversions going on, it will screw you over many, many times and even if you think that you're capable of handling it, you just get screwed over even more. So, 10 years, 10 years since my master but I started at the beginning of my master so that means that I've been programming in R for like 12 years or something, so it doesn't matter. So we will talk about types, about indexing, about variables, how to make scripts, clean coding, one of my pet peeves because clean coding is something that really makes coding better. We will talk about control structures and escaping. So it's more or less the first two lectures or the first three lectures of the normal R course fused into one, so we have to get a move on and really start doing it. But first, I wanna go to the previous assignments. So let's go to Notepad++ and let me open it up for you guys and then show you the thing as well. All right, so these are my answers. I will put them on Moodle, of course. The thing is, the way that I start is I always start doing something like this. So I have like a couple of hashtags which are, so these are comment characters in R. And then I just say answers to the assignments of lecture four and the guy who made it is me. Normally you would put in a copyright statement as well, but since these are assignments, there's no real need for copyright statements. So my R scripts always start with a set working directory at the top because you need to move to where you start your folders. So in this case, it is on my D drive, in a folder called D drive, then in projects, lectures, bioinformatics and animal breedings. And I'm still loading the old data set from the 2016-2017 course. And then I have a folder called assignments in that. So the first rule or the first line after that is loading in the data generally, unless we have to load some packages, but loading packages comes later because there wasn't a separate assignment for that. So the first one is to just load in the microarray data. And then I use the dim function, which stands for dimension to get the dimensions of the microarray. It's always good to ask for the dimensions after loading it in. So you know that it loaded in the whole data set. Sometimes you load in a data set which you know has like 20,000 rows, so 20,000 lines in there. And then when you ask for the dimension, it only gives you like 400 or 500. So then it stopped loading halfway through. That sometimes happens in R, especially if you have like a comment character, like a hashtag somewhere in your data set, it might break based on the hashtag. Or there might be a hidden character like a slash zero or some Mac OS X or Linux character in there, which will break the loading of the data set. So the first things that I do when I load in a data set is always ask for the dimensions. And in this case, also the assignment asks you to do the dimensions, right? Because we wanted to know how many samples there were, so how many columns there are in the file and how many mRNAs have been measured using how many probes. All right, so let's open up the R window for you guys and show you guys the R window and then don't do the notepad window. So let's start by setting the work ring directory loading in the microarray data. And then it tells me that there are, so the first one is always the rows and these are the columns. You can have multidimensional matrices as well in R, so we could have like four or five or six dimensions. But in this case, we only have two dimensions, so it's really a matrix. And well, the answer to 1a is that there are six columns in our file, so six different samples and there are 22,283 rows, meaning that on the microarray that we use, there are 22,283 probes. Could be more, sometimes the same probe is on a microarray a couple of times and I think also this data set is filtered for the standard control probes. You always have a control probe called dark corner and bright corner and those are the negative and positive controls, so they might still be in here, they might be removed. But there's generally a lot of kind of control probes in there. All right, so then the next exercise was to make a little bit of a boxplot and then I will make the window a little bit smaller so that we have room for the boxplot. So the code was actually in the assignment, so someone had a problem with that, right? Test of starters was you the one that got invalid argument for the boxplot? I think so, but the boxplot should be relatively straightforward, so this is how it looks when I do it. So I have LAS and then I have C-E-X dot axis equals 0.7, so this is the magnification of the axis so you can see that here everything fits and then you have LAS equals two and LAS equals two rotates the stuff. So did you use the exact same code or did you use something else? It might be, did you copy the code from the Word document? Let me try that, let me open up the Word document and copy-paste in the code from the Word document. No, I don't get an invalid argument then either, so I don't know. Do you still have the code that produced it? Because there shouldn't be any errors there. So the next two questions were to figure out what these two parameters do, copy it from the assignment. I'm copying it from the Word file, you can't see that of course, but when I copy it from the Word file then it still works perfectly fine. Wait a second, I'm looking at the Word document, I gave you guys the PDF. Let me open up the PDF and copy it from there. Documents, Doc-X, Biomphematics, assignment for PDF. It might be that there's like a hidden PDF character thingy in there. So let me select this and then copy it and then paste it in. No, it still looks exactly the same. Are you using Windows or are you using another? Because PDFs actually can Windows. So let me see, if I copy an additional like 2A or something? No, I don't get that error. Normally you would get an invalid error if you type things in wrong, but even then it would just not, it would not do anything, it won't give you an error. So I'm really wondering what's going on there. Was it the first boxplot or one of the other boxplots that you have to make? It's really weird. Dim microarray was also 362 in mine. Ah, okay, so then something got messed up when you downloaded it or when you loaded it in. Well, try again, I would say, and there should be 22,800, so it might be that when you downloaded it, the file got corrupted, the microarray file, and of course when the microarray file is corrupted, it might mean that there's not enough space there so that it kind of stops loading halfway through. But yeah, so if we put the CX.axis to 0.1, right, we can see that the axis here becomes smaller, right? So if for question 2B it was make the boxplot with and without specifying the CX.axis, what does CX.axis do? So the thing that it does is make the axis magnification bigger or smaller. So you can put it to two and then it will become like really big, you can put it to one, which is the default and you can put it to 0.7, like I did in this case to show the entire thing. So the LAS turns the axis. So you can see that when I do it with LAS equals one, it's like this. When I do it with LAS equals two, it flips the axis around. So it just rotates it 90 degrees. All right, so then before we start, we need to log transform the data because when you look at the data here, you can see that the distribution is really weird, right? So you have a lot of values which are very close to zero and that's why you see the boxplot and then you have a massive amount of outliers on the top. So the outliers on the top go up to 100,000 plus and that is of course because this is intensity value. So they are Lumen or the amount of intensity that you get. So light intensity doesn't really follow a normal distribution. It's more like a Poisson distribution. So before we can do anything with the microarray data, we have to log to transform it. So you do that by using the apply function. So I will show you the answers again. So let me show you the notepad++ window. So that is of course the next assignment. We have to move through the R to the notepad++ window as well. So here you see the, so here's the LAS orientation of the axis, CX dot axis is the size of the axis and then we want to log to transform. So what we do here is we use the apply function. So the apply function is kind of a for loop. So in this case, we apply to our microarray data set. Then we have two which is to the columns not to the rows. So one is rows, two is columns in this case, the log to function. So we're just going to take the data set and log to transform it. So when we go to R and we do that and we make a new boxplot, so let me copy the boxplot code as well and then we go to the R window. So when we do that, we now see that we have a different distribution, right? So now we can see that the boxplot looks a lot more normally distributed. So you see that most of the arrays, the median of the arrays is around 10, sometimes a little bit lower, sometimes a little bit higher. And you see that the range now has changed. So some values are actually below zero, which is bad. But what you can see here is that the range is around like from like 0.5 all the way up to like 16. But if we would look at a single one of these, so instead of looking at a boxplot, we could do a histogram. So we could do that, right? So say histogram, microarray, log, and we just take the first microarray, then we see now that this looks more or less like a normal distribution, a lot more like a normal distribution than the raw data looks. And of course we want to do that because of the fact that most statistics that we want to do in the end are parametric statistics. So we want to use things like t-test and of course for a t-test, I need to have a normal distribution. All right, so what happens when you do microarray, log, microarray, one, log two? The same because taking the log two value doesn't really depend. The thing is is that it just will change the ordering, I think, because if I go across the rows, right? So I do it, this is via the columns. So now if I ask the dimensions of the microarray, log variable, then you see that it's still 2,228 rows. And six columns, if I would do the same thing but go via the rows, if I now ask for the dimensions, you see that it flipped it around. So now the thing which used to be in the rows is now in the columns and the stuff that used to be in the columns is now in the rows. So if I want now try to make a box plot, then that would crash R because then I'm trying to make 22,000 box plots and that will just not be possible. It will not be able to fit on the screen. So you will get a whole bunch of little box plots. But the thing is is that if you go by the rows or by the columns, the values don't change because taking the log two of the value 10,000 will always be the same value. No matter if you iterate through rows or two columns, it will just be the exact same value. But when you do it by the rows, it actually flips them around. So the columns become the rows and the rows become the columns. Of course, you can easily fix this because you can just have the transpose function deal with you. So the transpose function is in the rest of the lecture. So if you want to transpose it, then again it takes the matrix and flips it on its side. And now we have 22,000 rows again and six columns. So, but that's the difference, Jan. The difference is that it will flip it around. All right, but now we know that the distribution looks a lot better, right? So, and of course I just asked you guys to look at the box plot. Wait, I have to first do it like this so that it's in the right dimension. And then I can just make a box plot. And now you see that all of the distributions look more or less similar. You see that the first two arrays have more outliers than the rest of the arrays. So they have some outliers on the top, some outliers on the bottom. And the other arrays don't really seem to have that. You also see that the variance around the median is less in the first two box plots. The other box plots have a little bit more variance. So they are slightly differently scaled. All right, so then the next step would be is because of course this could just be due to the fact that some of these microarrays have been done on different days. Because you don't run all the microarrays on the same day, you actually should. But in the case of having a microarray company or a company that does microarrays, they could spread it out over several days. So every day has its own little variance, right? So there could be that the temperature is a little bit higher, the humidity in the air is a little bit different. So that means that on a day-to-day basis you have some variance in scanning the same box, the same microarray. So if you take the same microarray, scan it on day one, then scan it on day 10, then the values would be shifted a little bit because of the environmental conditions. So one of the things that we want to do is get rid of that. And for that we can use this pre-process core package to normalize the quantiles. So I will first load the library. So it's just loading library pre-process core. I think I already installed it, so that should not be an issue. The code to install the package was also in the assignment. So I hope that that works for everyone. Sometimes if you have a very old version of R, you can't install this package. But then you just have to update your R version. And then of course we take the microarray log. Let me show you the Notepad++ again. So hey, of course what we then do is we take the microarray log variable, so the log to transform data, and then we normalize the quantiles. I think there was a little error in the assignment because the assignment said that you should do it on the microarray data. And of course doing it on the microarray data doesn't really work. So there was a little bit of an error in that assignment. So I'm sorry for that. I try to redo most of the assignments every year to make sure that all of the errors are out there. I made a note. I think someone also mailed me, so thank you for that. So I made a note, and in the next version of the assignments, the error will be fixed. So normalize quantiles on the microarray log data, and then the thing is in R we have to put back the column names and the row names because the normalize quantile function, if we would just look at the microarray log, right? And we would just look at the first five rows and then the first five columns. It looks like this. So you see here the measurement values. Oh, sorry, you're not seeing the R window. So if you would index it, right? Because it's a matrix, you can just use the square brackets and I'm asking show me the first five rows, show me the first five columns. And then it gives you this like little matrix. But then you can see that here are the measurement values and you see here the names of the samples. And then here you see the different probes. So the probes have names like 1007 underscore S underscore AT, which actually means that this probe is originally an Arabidopsis thaliana probe, which is a plant model plant. But when we do the normalize quantiles function, it will not copy the row names and the column names onto the new variable that you assign it to. You could, but normally it doesn't. Sometimes it does. But I always put them back originally. So I just take the column names and the row names from the previous variable and then put these on the microarray log QNorm. So quantile normalize data. All right, so let's put this in R and then make a new boxplot and see what happens. So let me show you the R window so that you're there live. So now what you see what it did actually is it now made the median of each boxplot the same because now every boxplot has the same median. The difference in variance in the different boxplots has also been standardized. And you can see now that also it has the same amount of outliers on the bottom. And there are no outliers on the top. But what this function just does, it just says that, well, go through each of the microarrays and make sure that every microarray has the same average value kind of and the same variance. So hey, it removes this day to day variance. It doesn't know exactly what it removes. It just takes the names of the, or it just takes the microarrays and it just shifts the microarrays so that they all have the same median. So what it does is it calculates the overall median and then just divides every number or it just subtracts the difference from the microarray medium to the overall median. And then it standardizes this stuff by calculating the overall standard deviation and then dividing every value. So it just takes the values and shifts them around every bit. Of course, values which used to be low on microarray one are still low on microarray one. So if I would, for example, plot the values of the microarray log values, right? Against and take the first microarray. So the first sample and I would plot those against the microarray log QNorm. First microarray, then you would see a relatively straight line, right? So it just took the values and moved them a little bit. So you can see that values that used to be five in the old array are now lower than five. So it just took the data, pulled it down a little bit and you see that it's not a straight line but you can see that a value which used to be low is still low, a value which used to be high is still high. So the ordering of the array, so the ordering of the probes didn't change. So that's kind of what it does. All right, so the answer to what do we see now is we see that the range of the data is similar for all samples. So all samples now have the same median value, they have the same variance, so you can compare microarray to microarray and if you don't normalize them, does it do it by probe? Kind of, so it calculates the overall mean of the data, it calculates the overall standard deviation of the data. It does the same thing for each of the arrays individually and then it starts by going, taking the distribution and then just doing a transformation on the distribution. So it just says the difference between the two medians subtract that from every probe, right? So then every probes get pulled down or it gets pushed up and then the standard deviation works the same but that's a little bit of a more difficult transformation because every value needs to be adjusted a little bit. But in the end it does it for each of the probes on the array. So every probe on the array gets touched, it gets changed based on how far it is away from the current array that you are looking at and the global median. So that's kind of what it does. And in the end, of course, you have a picture where you just see, okay, so all of the arrays now have the same median. So this will kind of remove any effects, for example, if you would put too little DNA on the array, right? Because if you put less DNA on one of the arrays than on the other array, then of course, the array with more DNA on there will just have higher intensity values. And this has nothing to do with biology, just this just has to do with you not being good at pipetting, so putting like difference amount and that always happens. You can have the exact same amount of DNA no matter how well you are in pipetting. All right, so let me look at the assignment. So that's the normalization. There are many other different ways of normalizing. So you could normalize it yourself, but this is the easiest way to do it. All right, so what is the major difference between this box plot and the one from question 3a? Well, this one is normalized. It has most of the variants or the differences in variants are removed and the differences in the median of the arrays are removed as well. All right, and then I want to do just one of the additional questions. So the additional question was to do a little bit of clustering and I just gave you the code for that. So I just wanted to show you how the results should look. So all of the additional. So what you see here is when we do a clustering and we do it, what we do is we ask it to make a distance matrix. We have to transpose the matrix. So we go from rows being columns to columns being rows. And then we calculate the distance matrix or the distance matrix is then calculated for each row against each other row. And then we do HACLIST, which is a clustering algorithm which then using the distance makes a clustering and then we can plot the different clusters. So what you can see from the data here is that there are four samples which are relatively similar, right? So they are relatively equal to each other and then there are two samples here which are also relatively equal to each other but different from the other four. So looking at the data here, it seems that there are more or less two major groups in our data. One group containing two micro arrays and the other group containing around four micro arrays. So that's kind of what this tree shows you. And then there was a scale variation. So finding the highly variable probes. I think I left that question in, right? So the scale variation, since we can't really do any t-test, right? Because we have, well, we have a group of two individuals versus a group of four individuals. And for a t-test you need at least three individuals in each of the group. So three versus three is the minimum sample size that you need for a t-test. But you can still look at the scale variation. So that means that you apply to the micro arrays, to one, to the rows, a function of X. And then you look at the variance of X divided by the mean of X, right? So if something has a high variance and a high mean, then this is weighed less than if you have a high variance and a low mean. So you just look at the ratio between the variance of the data and the mean of the data. So you get stuff which is highly variable. So if we do that and we do that in R by just copy pasting in the code that I gave you guys, then you will get this scaled variation, which is just a vector. Let's look at the first 10. So for each of the probes, it will tell you what the ratio between the variance and the mean is. So for example, for the first probe, it's 0.006. For the next probe, it's 0.009 and so on. And of course, like the variance is always or generally much smaller than the mean of the array. If we would plot this, of course, then this would look like this. I think I can still plot that. It's only 120,000 points. And so you see here that across the different probes on the array, we see kind of a wavy pattern. I don't know exactly why, but you see that there are some probes which are relatively having a high variance compared to their average value on the array. And of course, we can then take these and make a nice box plot of these so we can say which ones are the ones which are highly variable. So what we do is we ask R, take the scaled variation vector, look which ones are above one, ask which ones those are, and then just take the names of those. And those are the probes which we call highly variable. So if we then look at highly variable, then this will just contain probe names that are above the one line. So we just draw a straight line here through the data and we say everything which is higher than one is an interesting gene because it is variable in our data set. And then of course, we can make a nice heat map to kind of look. So when we do that, then we see that it looks more or less like this. So again, we can see the same structure in the samples on the top. We see that there are two samples which are very similar. And then we see that there are four samples which are very similar to each other but different from the other two. And then here you see the probes and you see then the expression of the probes in across these. So you see here that there's a group of probes which is high in the two samples here and which is low in the other four samples. And then of course here we see the same thing but it's less high in the second one but it's still, if you look at these like four, it four to six, it seems that they are high in this sample and low in the other sample. So you can't really do statistics because it's only six samples and you can't do a t-test because of the way that they group but you still get a little bit of an idea. So then you could look at which genes are these probes targeting and then you would take the names of the genes and then look to see if there's anything known about why these genes might be very different between the samples that you put on. All right, so create and look at the heat map. What is the thing we can learn from this? So my answer to this is we observe the groups we saw before however now in the second group, there's a typo in there, so let's fix it. Now in the second group we see that there are again two groups. We learn which probes and genes on the micro ratio are different between our samples. So that was the idea. So just a little bit of basic R so that you guys could look at it. I think there's also a question about the, about the structure prediction. So I didn't do the structure prediction. I hope that everyone was able to do that. If you were not, then let me know. If you want me to do it for you guys, then I could do that. It's just that it takes a long time. The RNA molecule which codes for the SARS-CoV-2 spike protein is relatively big. So when I did it at home, I had to wait like five minutes for the website to do the structure. But I think it's relatively simple, right? It's just using the RNA fold web server and just putting in the code that you found. But if anyone has issues with it, then we can do that. But I really like to start with the assignments or no, with the lecture. We're done with the assignment. So I really wanted to start with the lecture because the lecture is long and there's a lot of R that we have to go through. All right, so those were the solutions or at least my solutions. I will put them online and test the SARS. Do check what's wrong. And if you can't figure it out, then send me an email and we can schedule a date to kind of remote desktop in. What I would advise you first is to try to re-download the data set. It might be that when you downloaded the data set, something went wrong and it got cut halfway through. It happens sometimes. Sending stuff over the internet has a, and there's no error checking or not a lot of error checking. So it could be that that's the thing that's doing that. All right, so R as a calculator. I think that most people know that R, you can just use it as a calculator. So if you will just type in stuff, then it will give you the answer. So hey, if I type in one plus four, it will give me five. If I do five divided by 10, it will give me 0.5. There are some special operators. So if you wanna calculate the exponent, so five to the power of two, you have to use this roof symbol or you can just use the multiply, multiply operator. So two times multiply means to the power of. So this is five to the power of two and this is also five to the power of two. Just remember that the decimal separator in R is always the period, never the comma. And that is something that goes wrong in some countries which don't use the decimal separator to be, like in American system, you usually use the comma for separating thousands and the dot for separating the decimals. In Germany, it's usually the other way around. So you use the dot for thousands or space for thousands and a comma for the decimal separator. But it's never, in R, it's always a dot. So if 0.5 is half, 0 comma five is just an error. And you will run into that sometimes because if you get an Excel file from someone from Zimbabwe, for example, it might be that he is encoding his numbers with commas. So then you have to recode it using a more sane system. There are some special numerical constants in R like E and F for infinite, NAN for not a number and NA for a missing value. So values are allowed to be missing. Missing values propagate. So if you calculate the mean of a set of values which contains a missing value, then the mean will also be missing. You can get around that by using NA.omit or RM.NA equals true, but NA's propagate. And there's a good number why NA's propagate because in theory, when you really have a missing value, right, then the mean is undefined. So it's not, you don't know what the mean is. So if there's, if you have 10 measurements and one of them is missing, then you don't know what the mean is. So that's why the NA's propagate. About the Euclidean division and the Euclidean division remainder, I always show a single slide or two slides actually, which kind of explain this. And Euclidean division and Euclidean division remainder are very useful in programming, especially when you start writing your own loops and doing multicore programming to kind of put stuff on different CPU cores if you have a multicore CPU. So then you use Euclidean division a lot and also Euclidean remainder. So the way that it works is it's similar to normal division. So when I learned how to divide stuff in school, they taught us to do it like this. So if I want to divide 100 by 39, I put 100 in the middle and 39 on the side. And then I have these, these, these kind of brackets here. And then the first thing that I do is I try to fit these, this number into this number by blanking out the last zero. Right? So can I divide 10 by 39? No, I cannot. So then I have to add a zero. So had this teaches me that the answer will be one number less than 39. So it will not be above 10. Well, that's not that important. But what we see is if we have 39 by 100, then we can put 39 twice in 100. And then we have 78. And then this will look like this. So, and normally you would then start continuing the computation. So you would then put a comma here, put a new zero here, pull the zero down and then look how often 39 would put into 220. And then you would just continue this until there was no division remainder. But then you would get not a, you would get not an integer number. You would get a floating point number. So you would get two points, something in the end. But here for this example, had the Euclidean divisor in this case is two because 39 fits twice in 100 wholly. And then after that, you have 22 remaining. And this is the Euclidean divisor remainder. So in this case, we're not doing the nonsense by putting another zero here, pulling it down and then continuing with fractionals. No, we just look to see how often 39 fits into 100. So that's twice. So you can divide 100 by 39 two times. So you can take out 39 two times. And after you've taken it out two times, there are 22 left. So what does this relate to multi-core programming? Well, in multi-core programming, this would mean that if I would have 100 items that I need to calculate and I would batch them out into 39 elements together, then I could use two computer cores, right? Two computer cores are full, so doing 39 elements. And then I need a third computer core, which would do 22 elements. And then I could do all of them in one go or I could do all of them in parallel on a multi-core computer. This is not that useful yet, but just be aware that R has special functions. So it has the Euclidean division, which is percent slash percent. And it has Euclidean division remainder, which is percent percent. So it will come up sometimes. All right, of course, R has a whole bunch of other things which are built in. So there's a whole bunch of built-in constants, like letters, written uppercase. And those are just the 26 uppercase letters of the Roman alphabet. So this is just a vector, which you can use to select from. It also contains letters, small letters in lowercase, which are the lowercase letters of the standard Roman alphabet. We have month.up, which are the three-letter abbreviations for the English months. And we have month.name, which are the English names for the month of the year. So if you ever want to do anything with programming and you're wondering what is the seventh month, then you can just do month.name, square bracket, seven, square bracket close. And then it will tell you the name of the seventh month. So sometimes that's useful. And sometimes you need to, especially when you're dealing with data, which is measured across the year. Another very useful built-in constant is pi. And pi is the ratio of the circumference of a circle to its diameter. So I think everyone knows how to calculate the surface area of a circle, which is two pi r. So that should be okay. But it's just a built-in constant. So you don't have to multiply with 3.14, 15, and these kinds of things, no. In r, you can just say two times pi, and then it will just have more or less an infinite precision on pi, which is better than just using 3.14. R also supports imaginary numbers. So everyone who's ever wanted to calculate some spring constants, like imagine that I have a weight which is suspended from a spring and there's a thing with water down there and I let go of the weight, which then drops into the water and goes up and down. Then you have this kind of dampening curve and this dampening curve, you can only compute using imaginary numbers. So r supports imaginary numbers, but not by default. You have to put plus zero one i, and that makes so. If you ask for the square root of minus one, it will say this is not a number. But if you ask for the square root of minus one plus zero imaginary part, right? So this is the same as minus one. Then it will now tell you that, oh, indeed the square root of minus one is i, or minus i, of course. Can someone ban that guy? Yeah, got him, got him. Very good. I like that, like wanna buy followers. No, we're here to get followers the old fashioned way. So imaginary numbers are supported in r and of course we were not gonna use them soon and in bioinformatics I think I used imaginary numbers like three times during my entire PhD and after, but sometimes you need to. So sometimes imaginary numbers can be really useful. Hey, yeah, I know, but you were not there, so I already banned the guy. I can do the ban hammer as well, so not only that, but I can also make you VIP or revoke your VIP status and stuff. Anyway, so r also supports basic trigonometry functions, so sine, cosine, tangent, arc sine, arc cosine, and the arc tangent, which is really nice when you wanna do trigonometry, like doing triangles and circles and these kinds of things. So hey, you can just do that in r and they are named by their normal name. They are functions, so you have to put like round brackets, so sine of five is, you just switch out log by sine. So you have to use the round brackets. So the log of five is just a natural logarithm of five, which in our countries, we would say this is the ln, so the natural logarithm, which is E as a base. So this is poorly chosen. So log five, I always think 10 log, but 10 log is actually log 10 and then use the five. So this is the base 10 logarithm of five or this is the natural logarithm of five, so it uses the E, so it uses E to the power of, like X, right, so exponent one is E to the power of one, exponent two is E to the power of two and so on, so those are in there. So E is not a building constant, X is just a function to get the E number, which is 2.8 something, I think. So I hope that everyone paid attention in math, but all of the math is just available in R. R is just a big fancy calculator. All right, so R follows the standard operator precedence, so the order of operations in R is not left to right or right to left or the way that people normally would do it first. So if you have a big sum, right, then first you do exponents and roots, then you do multiplication and division and then you do addition and subtraction, so you often see like these Facebook posts where people say, what is the answer of 10 minus three times two? And then, well, go. What is the answer to 10 minus three times two? Come on, people, you can do it. Just put it in chat, don't be scared. It's just one of these Facebook tests that your aunties and uncles always try to look smart on. All right, Testosaurus says four. Are there any other answers? Or do we all agree with Testosaurus? Like you all think that the guy named Testosaurus the answer correctly? Or is everyone already asleep? If everyone's asleep, we're gonna have a 60 second ad break right now, but let me see. Is there anyone else in chat beside Testosaurus? Yeah, not too many. Like Jan is still here. Like the rest of the people have already left, I think. Four, ah, very good. Thank you. Skurrita also thinks it's four. All right, very nice, very nice. Sandra also four. So everyone thinks it's four. No one comes up with 14 being the answer. They will be under the Facebook post, right? Under the Facebook post, you would see 14, 28, like 37. We all went to school, yeah, sure, sure. But this apparently is very, very difficult. But yeah, you first do the multiplication and then yeah. Thank you Jan, yeah, Jan, very good. No, but yeah, you first do the multiplication. So three times two is six and then 10 minus six is four. So, but look through your Facebook history. You see a couple of people that actually go and answer stuff, which is very different. Of course, the real answer is of course four plus zero i, which of course, if you are dealing with imaginary numbers, is the only correct answer because you never should forget the imaginary part of the number. So four plus zero i is actually more correct. Anyway, so this is the order of presidents. So to remember that, I think everyone in high school learned, please excuse my dear Anne Siley or in Dutch we have a different one. And I bet Germans have a different like donkey bridge kind of thingy to do that. Anyway, operator presidents, it's in our, and it will bite you sometimes if you don't take care of it. You can always add like round brackets to force presidents. All right, so now we start with the real programming stuff or more programming stuff. So if you are dealing with R, then R, when you start R, you get something which is called a session. And everything in R, which you load into your session is in RAM memory. So this is the random access memory, which is more or less, if you buy a computer, it tells you that you have 16 gigs of memory in there. They mean random access memory. There's a lot more memory in your computer than just the 16 gigs, which is plugged into the main board. But in R, everything goes in there. So if you are filling up a vector, so my computer is like 16 gigs, I think, or perhaps even a little bit more, but after 16 gigs, it will just say, I am out of memory. So be aware that that will happen when you start dealing with big data sets. But you can manage your session. So something that we already saw is the set working directory command, which will allow you to change your working directory or where you are on your hard drive. You have the get working directory command to get the current working directory. So when you start up R, you are generally in C users, Denny slash documents, something like that. But of course, I always want to go somewhere else because I don't store all of my data there. I store my data on my D drive or on my E drive or it's on a flash drive. And then you have to just use the set working directory. And this goes wrong a lot of the times. A lot of the times, many of the things are just like, oh, I can't read the table. Then people are in the wrong working directory. That happens a lot. All right, so we can do a dear command and the dear command shows us the current files in the current directory. So it will show us what's on your desktop or imagine that you are on the C drive. If you type there, then it will show that there's a Windows folder, there's a program files, there's a users folder, sort of standard stuff on a C drive in Windows. In the session for R, you have a similar command, which is LS. So the LS command shows you currently which variables and functions are loaded into your R session. So these are not files, these are variables which are defined. So let's go back to the R window. We can do that quite quickly like this. So here I can do a dear, right? So when I do a dear, I see that these are all of the files that are in my current working directory, where I did my lecture. And if I do an LS, I can see what's currently loaded in my session. So I have a variable called clusters. I have a variable called highly variable. I have a variable called microarray and so on. And you also see here, scaled variation. Scaled variation is of course the vector with the variation numbers. So it shows you which variables are there. So of course, if I would just type X, then I would get error object X not found because of course, there is no variable called X. And sometimes this can help you debug stuff, right? If you get an error which says, oh, I cannot find this object, then you could just use the LS function to see if this object is actually loaded or if you just made a typo and you can compare this to the list, right? So if I would make an error and type microarray without a double R, then it would say, oh, this thing is not found, but then I could do an LS and figure out, no, it's actually called microarray with double R, so it helps. We can install packages into R. We did this for the preprocess core package. The preprocess core package actually does not come from the standard R repository. It comes from the bioconductor repository. So the bioconductor repositories is separate from the standard R repository. If you wanna install stuff from the standard R repository, you can do that using the install.packages function. You give it the name of the package that you want to install using quotes because it expects a string and then you can load the package after it's installed by using library. If you wanna save an object, for example, you loaded a big matrix of data and you want to save that as small as possible, then you can save an object as a binary file and this will also allow it to be loaded into R very quickly, especially for the people who had not a very good computer and loading in the microarray dataset took some time. If you don't have an SSD, but an older hard drive and loading in this microarray dataset takes a little bit of time, then you can actually save the microarray object. So you say, save microarray comma file is microarray.rdata and then it will save a file to your hard drive called microarray.rdata and you can load that in using the load command. So when you then say load microarray.rdata and this thing will load in much quicker because it's only like one-tenth of the size of the original text file. So if you're working with big files, then the save and load command can help you save a lot of time because this will save it in a binary format which R can just read. If you wanna save all of the objects, which I would never advise you to do, but imagine that you're working in R and you've done, like you've defined variables and all of these things and all of a sudden you have to go and you don't know where the windows, the windows update is already shouting. I will start my windows update in like 10 minutes and hey, you really need to go and you don't want everything to be lost. You can type type save.image and then you just give it a name. Always name it with an Rdata extension and this will save all of the variables. So it will just take your entire LS and go through each object in LS and save it into one big binary format. If you wanna quit, then you can use the queue function and then you can add the string no. This will quit the R session without saving the current environment. I never save the current environment. It is a source of many, many bugs. If you close R, it also asks you, do you want to save your current? Like think no, you never want to save it because then the next time that you start up R, it will start loading in all the other variables that you already had. And when you are programming, you want to start with a clean environment, right? We want to have reproducible research. So something that you did yesterday should not affect the code that you are running currently. So never, ever save your R session. So if you quit, just type queue with the word no in there with the air quotes and then this will just quit R, not saving any of the stuff to the disk. All right, how long have we been talking? 53 minutes. All right, so we will do a short break. Like I said, I will probably run a 60 second ad break and then you guys can enjoy the old animated GIFs. So I'm hoping that that's not bad. I didn't have time. I was talking to a couple of our PhD students the entire morning. I was planning on finding new GIFs for you guys to enjoy but you just have to do it with the old GIFs from last week. So I'm sorry about that. I will stop the recording and then I will start the break and I will see you guys in like 10 minutes.