 All right, then. I think we should be live, so if you're in chat, let me know if you can hear me. Let's just do a quick audio check. Should be fine, should be fine. A little bit tired, but I've almost got weekend. This is the last thing that I'm going to do this week, and it's weekend. So I'm again getting errors from YouTube. Okay, good. So people can hear me, but it's so weird. I do get this error continuously saying that the audio stream is bad. I hope it's good. I hope there's not too much like noise in the background and other stuff, because there's still people walking in and out of my office. Although people should know that I'm streaming today, but we'll see. All right, so I got my mail, which is also good, so everyone should be notified that we will have the lecture today. Loud and clear, perfect. All right, so we are up to three viewers, which includes me and my moderator, so apparently we're not the only two watching. That's good. That's good. I'm excited about it. I'm excited. It's a short lecture, I think. I have 47 slides, and I hope that it's going to be enough to keep you guys entertained until like five. Really curious to see how long it will take. And of course we have the assignments, right? So we always have the assignments, so we'll have to do those as well. So I'm up for some programming. I actually programmed this morning. It doesn't happen too often that I get the opportunity still to program at work. I think most of my job consists about managing people and talking to people. So excited about it. So four viewers. All right, welcome, welcome everyone. Data analysis using R. So yeah, we're actually running a little behind schedule I saw. So I don't know what was going on, but some reason last year this lecture was at the second of June. So it might have to do something with the holiday days that we have. So the public holiday days, but we'll just see. Good. So if the soundcheck is fine, then we will switch to the lecture layout so that people can see me as well. All right, so good afternoon. Good afternoon. So yeah, algorithms and functions today. I'm very excited because like for the exam I need to have questions, right? And I don't like having only programming questions, which makes the exam relatively boring and also relatively limits me in the amount of questions that I can ask since programming on paper is really hard. So I always like to have a couple of lectures that are there so that we can have at least, how do you say it? Have at least some more like knowledge questions. So we'll have to see how it goes. But I've been doing these ones a lot. I've changed it a little bit from last year. So it's not going to be exactly the same. Of course, a lot of the slides are still shared, but we'll just see how it goes. All right, so I said 47 slides. So this is slide one. Again, why the sun with the eye patch this time? I didn't know because like there's no sun at the moment. It's kind of rainish, raining weather. And then I made a black box, which represents an algorithm, but it's not very well visible that it's a black box, which is just a shame. All right, so first things first, let's start doing the answers for lecture number six. So as always, we can do this in two ways. We can just do it live, me programming or we can just, I can just show you guys my answers. But I think that based on the feedback that we got last time, it's better that we just do them live because I think people like seeing me program for some reason. And I think it's much better when I go through them live because then you can see kind of how I think about the assignments and what I should do. So let's switch to notepad plus plus and we will start with creating a standard header in R. So this is going to be answers to the assignments and number six. And this is made by me and I am paid by the how Berlin and we are doing this today, which is nine six twenty twenty two. All right, so let's save this so we get some code highlighting. So let's find a suitable position to put this for me so that I can find it back. Let's just put it on my one drive where I keep all of the other stuff as well. Documents doc X R course and then this is called answers six minus live dot R. So directly when I give an extension, you can see saving it giving an extension uses the code highlighting. And of course, I'm always going to start with using a set working directory because I need to load in files. I think Nicholas Marutano. Hello. Hello. Welcome, Nicholas. Welcome to the lecture. So we're just doing the assignments from last time. Let me actually grab the assignments as well since I did not open them. So this is the R course and then we had assignments number six. I think and then I actually have two versions of the same file. But they should be very similar. So if I'm answering a question which was not on the assignments, then just remind me in chat that like you have the wrong file. I have the same assignments twice. All right, so please note this data is from a current analysis and is not public yet. That's not true. It actually is. Did we not do this last time? I don't know. Let's just continue. Right. I think no, no, no, because I did get some questions. I'm confused because I got questions during the email. So I already looked at the assignments like twice this week. So let's, let's figure out. So where did I save my files just to set my working directory somewhere. So that's in my one drive. So that's where I set them. And then I want to just use a read.table because we had two files that we need read.table. So what do we want to read in? We want to read in assignment number six and we have arrays and array data. So I got some questions about how the data is structured. So I will just load them in and then explain to you guys because in the. In the assignments, I mixed up two terms, which for people from biology are more or less the same terms, but from some people that are not from biology, they were, well, they were unfamiliar with the second term that I use. So I got a question about that as well. Right. So we have two files for this assignment. The first is some micro array data. And then we have the file called arrays dot txt and the file arrays dot txt actually contains the description of the micro arrays. So it is the kind of the meta information, right? Because like micro arrays themselves, they have a lot of meta description because it's just if you have a matrix, you only have a column name. And of course, there's much more that we know about the columns, right? Because in the columns, we generally have the individuals. All right. So let's just paste this into R and see what happens. Let me actually move this thing out of the way so I can see my R window as well. Good. All right. So let's load in the array data and the secondary file is called arrays, not array. All right. So that's a very small file. So first, let's see what we see, right? So let's just ask for the dimensions. So there are 55,821 entries in the array data in 17 different columns. And if we look at arrays, then we see we have 17 columns or 17 rows as well. So every row is a description. So let's just look at a little bit of the file. We could use the head function, but I'm generally using one to 10, one to 10. There are 10 columns. So the first thing that we can see is that when we load in the file, there's something weird because we get these V1, V2, V3 names, right? So that means that I have to tell that there is a header there. So let's switch back to Notepad and add saying header is true, right? Because that's what we have in our file. And this is arrays. All right, very good. So we just copy paste it back in. So let's go back to R, copy paste it in. Oh, that's wrong. Let's do this, right? And now when we look at arrays, we see now that indeed the headers are fixed. One of the things that we could do, right, is because we see here that we have this Atlas ID column, I kind of want that to be the row names, right? Because these are the file names. So these are the original files that we got from the company. And then here we see the Atlas ID. And that's just the company that did our arrays. That's called Atlas. And then the ID that they gave our samples, which is more or less the ID that we gave to them. So let's make sure that we load it in. And I'm going to save row.names equals two because in the second column, we find the row names. And now when we look, we see that we have one column less, but we now have the row names over here. So perfect. So I'm just going to add this to my file in Notepad as well. So I'm just going to say row names equals two. All right, let's go back to R. Let's look at the array data. So the array data, just say first 10 rows. And then we observe this. So one of the nice things is that it actually auto detected that we have a header. So the header is fine. The row names are fine. And sequence is also perfectly fine. So seems to have loaded in correctly. So not a lot of work on loading in the files. So that's good. All right. So that was question one, question zero, load in the provided data sets using either the read table or the read CSV function. All right. So then of course, when we have micro array data, I think here the confusion came in with some of the students is that here we see numbers, right? So these numbers are intensities of spots which are on the micro array. And we also call this the expression, right? So here we have a probe and because we measured this probe in a certain sample, this is called the expression of the gene, which is a kind of a misnomer, right? Because we're not measuring the whole gene. We're just measuring a little piece of the gene. So that's why we call this. So this is the expression of this probe in this sample. And of course we have the positive control, which is the bright corner. And we have the negative control, which is the dark corner. So we have positive and negative controls on each of the arrays. And we can see that it makes a difference, right? So for this sample, the positive control shows us 19,000 while in this sample, the positive control is only 17 and a half thousand. So there can be a big difference in the expression on the array. Good. So that's our array data that we have. So next step is to kind of normalize it. But of course, first, before we go and normalize it, let's just have a look at the data, right? So that's kind of the thing that we always want to do. We want to get a bit of a feeling what kind of data we have and what is the structure of our data. So the way that we can do this is just make a simple box plot to show what's going on. So we can just say box plot and then we just throw in the array data. The big issue here, if we look into R, we see that the array data has one column and this first column is not a numeric value, right? So it's called sequence. This is the sequence of the probe that was used to measure the expression. But you can't make a box plot of this, right? Because box plots require numeric values. So there's two ways of doing this, one which is relatively quick and dirty and the other one which is relatively elegant. So you could just say, comma, minus one, right? Throw away the first column. But this is not a scalable solution, right? If we would add more micro arrays or if you would add multiple information column about the probe, then this would not scale, right? The next time that you get a similar data set, but now this data set has two columns of information then this is not going to work. So what I always like to use is use the metadata, right? So the metadata is loaded in arrays and if we look at the R file, right? Then we see that the row names of the metadata are the column names of the array file, which contains the data. So let's just use the row names of the arrays and use that to select the proper columns, right? Because that is scalable because now if we would add an array, then we would add another line in the metadata. So this works, right? It works, but it's not scalable. So I would rather do a boxplot and say, well, which columns do I want to select? Well, I want to select the columns which are specified by the row names of the array file, right? And this is scalable, but now if we would add more, then it would automatically detect this and it would automatically start making boxplot of those. So let's go to R and look at the first little boxplot. And what we see, this takes a while though to make because of the fact that there's so much data. So what we see is that we see that there's a massive range of intensities, right? Of course, the positive controls will probably be somewhere on the top. The negative controls will be all the way at the bottom, but it doesn't even look like a boxplot because of course the distribution of the individual arrays is just an intensity value. And intensity values generally don't follow a normal distribution, right? If we make a small histogram for the first array, right? So I'm going to just say instead of a boxplot to make a histogram and instead of using all of the row names of the arrays, I'm just going to take the first one, right? Then we see something like this, right? So we clearly see that there are values which are massive all the way up to 30,000, but we see that most genes on the array are actually not expressed, right? They have very low intensity values. We could zoom in a little bit more. So if we would say comma breaks is 100, then we see that it almost follows kind of a Poisson distribution. So this is not a normal distribution. So the first thing that we want to do and I explained that in the lectures that when we have microarray data, which is intensities, so this is a single color microarray, what we want to do is make sure that we log to transform the data because by log to transforming the data, we do a normalization of rates. So we make the distribution that we have more like a normal distribution. So let's do that. So let's go to notepad. So I'm just going to delete this line because it's a little bit ugly. So this is our first boxplot. And then we are just going to say normalize the data. Normalize the data into a more Gaussian distribution. And how are we going to do that? Well, we're just going to take this distribution. We're just going to take the log 2. So I'm just going to take the log 2 and I'm going to say array data, row names array, right? Because I still only want to select the columns which have values. And then I'm just going to say X data, right? For expression data. So this is just the expression data because I'm making a subset. So I'm kind of dumping the column and directly doing a log to transform. So let's go to R and in R, we now have expert data. So we can use this to make our boxplots. So we can just say make a boxplot of expert data. And now we see that distribution start looking better. We can see the same thing when we look at the histogram of the firsts. So let me take the first one, right? So, yeah, sorry, comma 1, first column. So what we see now is that it's still not a normal distribution, right? But it looks already a lot more normal. We still are bothered by the fact that most of the genes are not expressed. But that's okay. At least now the distribution here, it looks like kind of a slanted normal distribution. It might be two normal distributions which are woven into each other. And one normal distribution which has a mean of around 10. So it goes like more or less like this. And then there's another distribution like this. But it looks already better, right? We don't have massive amounts of outliers and the range is also a lot better because now we're dealing with values between 0 to like 17. And originally we had values from 0 to like 30,000 lumen. So intensity values. Good. So the first thing that we did is take a log 2 transformation. Why do we do that? Because it improves the way that our data distribution is. It just looks a little bit better. All right. So then the next step would be to normalize the data. And I found that some people had a little bit of an issue with that because they couldn't install the pre-process core package. So if you were unable to install the pre-process core package, let me actually look up how you want to do that. So let's go quickly to Firefox and then we say pre-process core in bioconductor. And we go here, right? And then it gives us this piece of code to install the package. So I'm just going to try this, right? Because I think I have the package installed, but I don't know for sure. So I'm just going to paste it into R. So it says that my bioconductor is out of date and that's okay. And let's see. Because I'm using a slightly different version of R. Some people had problems installing it in the newest version of R, which is 4.2. I'm still using 4.1. I don't want to update this whole bunch of packages. All right. So I actually get a warning saying that it's the same version that I already have installed. But if it doesn't install, then you have to find a way around it, right? Then you have to find a different way of normalizing it. So I just wanted to show you guys how you can easily do that, right? Because in the lecture we had two different normalization techniques, right? So we, let's first go back to R, right? And let's show you guys because we had this boxplot, right? So here we have the boxplot. And the first thing that we notice from this boxplot is that every array has a slightly different mean. There are also outliers, right? So how can we get rid of the fact that every array has a different mean? Well, the most basic thing to do is just remove the mean from the data, right? So what I could say is that, well, if I have my expert data, right? And I apply to my expression data to the columns function of x, right? Then what can I do? Well, I can take the values of x and then just subtract the mean of x, right? Because if I would do this, then I end up with the overall mean or the overall median of the array to be 0, right? So let's just remove the median because the boxplot also shows us the median, so the effect is more clear, right? So I'm just going to take the values and subtract the median value, right? And I'm going to call this expert data to just so that we can compare the two, right? So I'm just going to apply a function and this function is just going to remove the median from the data. So if we do this and we go to R, then now we have our expert data. Let's just make two plots next to each other. Par MfRow is C2,1, right? So two plots and I'm going to plot my expression data and then I'm going to plot the normalized expression data. So now what we observe, right? Is that the median of each of the samples is exactly 0, right? The only drawback now is that some of the values are starting to turn negative, which is not what we want because we can't have negative gene expression. How do we fix the fact that there's negative gene expression? Well, we could just add back in the global mean, right? So the mean across all of the arrays before normalizing is around 5-ish. So if we would just take all of our values, why are you assigning par to a variable? Oh, because I, hey, Manuel, because I actually want to get back the plot because I can now say par op and then I get the previous settings back. It's just in my system. You don't have to, but par is a function so it returns something so you want to store the return value. Question for me to really long to creating the plot, is this because of our studio? No, the original plot just takes a long time because there are so many outliers. So it has to literally plot like tens of thousands of expression values in the little dots. After normalization, right, it's not a big issue or after log 2 transformation, this is not a big issue because it only has to plot a couple of dots. But when you do the original box plot on the original data, there is like a massive amount of dots that it needs to draw. So each dot takes a draw call. So it just takes a long time in our to do that. I hope that that's clear. Also why I'm assigning par to up because you can actually now say par is up and then it should restore your previous settings that you had because the par function will return the previous setting. And that's just good practice, right? Because generally when you write a plotting function and you're changing global parameters inside of the function and that's fine as long as you reset the state at the end of the function. So that's why I'm assigning it so just so that I can do par up and get back to the settings that I had previously. And in theory, I should have saved this one as well, right? So that this is up. So then that's the two plotting up. It doesn't matter too much. All right, so let's just focus on normalizing our data a little bit without using an external package. So what we can say is we just subtract the median from each array. Let me actually make the plot again so that we can look at the plot while we talk about it and expert data to write. So what we see is that now every array has the same median because we just subtracted the median. The only thing that I want to do is make sure that none of the values are negative. So I could just ask for the minimum value of the whole array after normalization and then just add that back in, right? Because then the minimum value would be zero. But a slightly better approach is to just add back in the global median. So let's let's do that, right? So here we have our little function. So I'm just going to say I have my global median, right? So my global median is just the median of the whole expert data, right? So the unnormalized data, I'm just going to calculate the global median. And then I have to pass this to my function. I'm going to add another parameter called med. And then I'm going to pass this med equals global median, right? So I'm just going to pass it into the function. And then I'm going to say, well, the new values are X minus the median of the array. And then I'm going to add back in the med, the global median. Good. So let's see how this looks. And I'm going to save this as expert data three just so that it works. And we can compare it in R. All right. So let's paste this in object global median not found. That's true because I didn't paste that part in. And then it says global median not found. Need numeric data. Why is that? Because this seems to be, yeah, that seems to be numeric data. So let's just go to notepad, fix a little bit of errors. So I'm just going to make sure that R treats this as numeric data. Might be because of some missing values that it complains, but it's the R type system. Object list cannot be course to go. All right. All right. So then unlist the whole thing. If you think it's a list, it's not a list. It's a matrix. So I'm just adding a unlist as numeric median and then global median. All right. Let's go back to R. See if we can calculate our global median now. So global median should just be a single value. And the global median is like 4.6, which might not be good enough, right? Because I don't know what the, I can just ask for the minimum of the expert data to. And yeah, so that should work. Right. So if we add 4.6 back in, then everything should be above zero. All right. So let's go back. Let's now paste the whole function in. Right. I'm just going to start all over. I overwrite all of the stuff that I did. Oh, no, because then we lose expert data too. So let's just calculate expert data three and then show you guys how it looks after we did this. Did this normalization. Right. So now I can say op par MF row is C is C one comma three. Right. So three plots next to each other. A box plot of expert data. I want to do a box plot of expert data two. And then I want to do a box plot of expert data three. Right. And now when we see, we can see that we do indeed normalize the median. And after adding back in these values, we see that the median of each is more or less exactly the same. Good. And now we don't have any negative expression values anymore. So this is kind of what we want. Right. It's not perfectly normalized yet, but we can still do one more thing to normalize our data. Right. We see here that we have a difference in variance. Right. Because every array has their own variance. This one starts high ends high. This one starts a lot lower and ends a little bit lower than the other array as well. So what we want to do is compute, for example, the standard deviation. And then take the standard deviation out so that every value has its own standard or so every array has the same standard deviation and the same median. So how do we do this? Well, we go to notepad. Right. So and we are going to make our function a little bit more complex because we're now going to say the fourth expression data that I'm going to create is the values that I have minus the median of the values that I have. And then I'm going to divide this by the standard deviation of X. Right. And this is going to work because dividing by the standard deviation, right, will make that every value is, is, is that the standard deviation is going to be one. Right. And then after doing this, and I'm going to add a couple of extra brackets just to be sure. So I'm going to take X, remove the median, divide by the standard deviation, making sure that now every array has a median of zero and a standard deviation of one. And then I'm going to add back in the global median as well. Right. So and then let's call this expression data four. That's perfectly fine. Right. So if we go to R and let's just close the window here and I'm going to do this. Right. So I'm going to forget about plotting the second one because we already had that. So I'm going to make three plots. I'm going to plot the original data. Then I'm going to plot the data that we just normalized by removing the median and adding in back in the global median. And now I'm going to also show you what happens after we do the fourth one. That, that actually doesn't look as good as I wanted it to. It actually seems to have done almost nothing. So does this mean that I do it in the wrong order? I take X, I divide X by the standard. No, I take the median out and divide by the standard deviation. Yeah. So this should be relatively normalized. Let me actually see what I actually sent back to the student because I think I gave them a very similar answer. I think, let me see. Where's the email? Where's the email? Where's the email? Sorry, some time ago. I'm getting too many emails on a day. I have no idea where that email went. Looking for the email. No, can't find the email. So I'm sorry, guys, but so this is more or less the best that I can do, right? Because I've, because in theory, if I would look at expert data for, right, then I would calculate, for example, for the first one, I would calculate the standard deviation, then, oh, ah, that's where it's the problem. Oh, that's standard deviation of the first array is one, second array is one, third array is one, fourth array is one. So it did work, right? From the plot, you can't really see it, but the standard deviations did get harmonized. So at least the first standard deviation is one for each of the arrays. So it doesn't look like much, but this distribution is more normalized than the one before. And of course, this is now the maximum that we can do when we are not using this external library to do the normalization. So I did want to show you guys how it looks when you use the external library as well, because that was actually the original assignment. Right? So here I'm just going to say that this is my expert data to write or expert data dot norm. And I also want to say expert data dot norm to which is then going to normalize dot quantiles from. So this is just a function from the pre-process core library. And what do we want to normalize? Well, we want to normalize the expression data. And I have to make sure that I load the library. So I'm going to say library pre-process core. I think that's the way that you write it. So let's go to R, load the library and then normalize the data. And it says that it expects a matrix, which is fine. We can just say as dot matrix because it's a data frame normalized as well. So let's again do the quantile normal or let's do the box plots. So the raw data, the data that we normalized ourselves. And let's look at the expert dot norm two, which is the data normalized by normalized at quantiles. So you can see that normalized at quantiles does a better job. It doesn't just normalize the standard deviation. It normalizes all of the quantiles to be exactly the same. In the end, it should not matter too much, right? In the end, if we do statistical testing on this distributions or on these distributions, that should not make the biggest difference in the world. If we do the same statistical test on the original data, then of course we are going to make a significant amount of mistakes because now if a gene is expressed exactly in the median of the distribution, then I'm going to find that from one array to the next array, they are differentially expressed, even though they are expressed at the same level inside of the array. Because there's already a difference in median for each of the arrays. So you can do the normalization yourself, not too difficult. It looks unnormalized, but it is definitely normalized. Because all of the standard deviations are the same, all of the medians are the same. So it just means that it's a normalized distribution. And the same thing holds for this one, although I think, and let me check, here the standard deviations of the first array are probably not the same. Well, yeah, so also normalized quantiles normalizes the standard deviations. It just doesn't normalize the standard deviations to one, it normalizes it to 3.9 in this case, but that's just the way it is. All right, and now we've played a little bit with our data, and we started off with having unnormalized data, and in the end we have two different, we have normalized it ourselves, and we have normalized it using the normalized.quantiles function, and both should give us very similar results. So let me actually run the code to make sure that I have all of the variables defined in R, and here there needs to be an s.matrix, which otherwise we will get an error. All right, so normalize that as well. Good, so let's start with the assignment. So we do a log2 transformation, we did that, we used the normalize.quantiles function to normalize the data and create a box plot showing the data before and after normalization. We also did that because we actually show before after normalizing it ourselves and after normalizing it using normalize.quantiles. Of course, we should add some headers on the top to make sure that it's kind of okay. The only thing that I still wanted to point out, if we look at the expert data norm two, so the one normalized by normalize.quantiles, you see that it lost the row names and the column names, which is a little bit of an annoyance, right? So we have to make sure that when we use normalize.quantiles that we put back in the row names and the column names because the function is not keeping those, so we definitely want to do that. So let's just make sure that we do that. So we say row names of expert data norm two are the row names of the expert data and the same for the column names. So I'm just going to duplicate it and say call names is call names, right? Because it doesn't change the order, it just normalizes. So let's do this and then make sure that it's okay. All right, and now we have our metadata back on the matrix that is normalized. All right, next question. We can use plots and correlations on the normalized data to have a quick look at the data. Create plots that compare each array versus the other arrays using correlation. So one plot for array one compared to all of the other ones. In total, you should get 16 plots and then there's a note, use a for loop. Okay, let's use a for loop and calculate correlation. Right, so the nice thing is is that we can just say for X in one to the number of columns in expert data. Right, so what do we want to do? Well, we want to calculate the correlation of expert data towards of column number X versus the expert data. And then I am going to say, yeah, just do it like this. Right, oh wait, normalized data. So do we want to use the quantum normalized data? That's probably the best way to do. Right, so I'm just going to say calculate the correlation of the array that I'm currently looking at towards all of the other arrays. And I'm going to call this course for correlation. And then I'm going to make a plot, right? And I'm just going to plot course. And that's okay, probably. And of course, I want to say op par because we're making 16 mf row is C 4 comma 4. Right, so make four plots horizontal four plots like this and then everything should fit. I'm also going to say that I want to have the name there. So the main of the plot is going to be the call names of expert data, the X one, because that's the one that we're currently looking at. And I'm going to say x oct is non, right? Don't do an x axis. I'm going to do the axis. So I'm going to say axis one at is one to 16, right? Because there are 16, but I'm just going to say and call expert data, right? Because those are the amount. And then I'm going to say put the call names of expert data there. Right, so I'm just going to make sure that the names of the samples or the names of the array are located at the bottom of the plot. Just for me to make sure that everything is comparable and I can compare stuff by names. So hashtag basic exploration using correlations. Right, so I'm just, I just want to know if things which are named the same, right? Because if we look at our, then we see that we have names called HT to 10 HT 2024. And then we have GF 2010, right? So what I'm expecting is that either HT 2020 10 is highly correlated to GF 2010, or I expect HTs to be highly correlated to each other, right? Because just based on the naming, hey, you would expect that either the HTs are related to each other or things are related to each other based on the number, right? That could also be the case. Let's go and make our correlation plots. So I'm just going to copy paste this to R. And then we would get correlation plots, which look weird. Let me see why is that? Let's look at course. That seems to be perfectly fine. So why is it not plotting that? Why is it only plotting a single number? Because this is not a single, ah, it's, you can see again that it kind of messes up, right? This is a color or this is a matrix, which is one row. And the default for plotting a matrix is just plotting the first point for some reason. So in this case, we have to make sure that when we plot course, we actually say plot course, but we want to do this as a numeric value, right? Because I don't want to, right? And now we see the correlations there. All right, so let's fix that in our script quite quickly. So I'm just going to say s.numeric, right? So instead of having course being a matrix, I want course to be a vector, a numeric vector. And now we see something like this, right? So now we can look at our data and we can start reasoning about it. So if we look at the HT10 sample, I'm just going to flip the X's around here. So I'm going to say LAS is 2, CX is 0.7, right? So to just make the X axis a little bit better for you guys, like this. So I'm just going to put them like this. And I don't want the word index here because it's on top. So I'm going to say XLOB is empty. So I'm also going to do that, right? So I'm just going to add the X label. And then I'm going to make it a little bit bigger so that everything has a name, right? And now we can see that when we look at the HT210 sample, it is highly correlated above 0.95. To all of the other HT samples, right? If we look at the second HT sample, it's highly correlated to the HT samples, but lowly correlated to the GONATO, so to the GF samples. If I look at the GF samples, I see the same pattern. So GF samples are clustering or are similar to GF samples. HT samples are similar to HT samples, right? If I would look at HT210 and then look at GF210, then I see that HT210 is very lowly correlated to GF210. So I know that the number is not giving me any information about how things group. I understand now that the first two letters are telling me something about the samples. All right, so this is the code that I fixed, right? So in the process, I actually added the XLOB is nothing and I made LAS is 2, CEX is 0.7 to make the X axis be horizontal and then a little bit smaller so that we could see all of them. All right, so that's question 2A. All right, so now calculate the correlation using Spearman method between the arrays and create an image plot showing how arrays are correlated. What do you learn from looking at the image plot? All right, so let's do the image plot. I actually lost the chat window. So I'm just going to put that on the side in case anyone wants to ask any questions. I can see the chat window without being bothered. Okay, so calculate correlations using Spearman. We could actually fix that here as well, right? We could also have set plot or calculate correlations. Method is Spearman, so just to make sure. But now what I want to do is I want to plot all of the correlations, right? So I'm going to use the same code that I already had for 2A, but now I'm going to remember them. So I'm going to say core M. Initially, we don't have any correlation matrix. So this stands for core M. M stands for matrix, right? So what am I going to do now? Well, every time I'm going to calculate a vector of correlations, and then I'm going to add that to the matrix that we have. So I'm just going to say every time that I've done a plot, I'm going to say core M is, and then I'm going to row bind the current correlation matrix that we have with the cores that we just computed. So I'm just going to one by one calculate 16 correlations and then add another 16, add another 16, add another 16. So that's how I'm going to do it, just so that I can remember the different correlations that we compute. It takes a little bit of a while now, because it now needs to start doing matrix stuff, but it makes the plots and it's done. So now if I do core M, I expect a matrix, which is 16 by 16. And it indeed is, what we can clearly see is that every matrix is highly correlated to itself. We don't have row names, so let's fix that and say that the row names of core M are actually the column names of core M, right? Because the rows and the columns should be equal in a correlation matrix. And let's give it a little bit of space, so let's add this. So now if we look at the correlation matrix, it looks like this. And now we can do clustering, for example. We could use the image function and just say, make an image of core M. And that's very small, let's make it bigger. So what we can see here is that the first four samples are highly correlated to themselves and they are also highly correlated to samples 9, 10, 11 and 12. And we can see that the second batch is, so the HATES are correlated highly to all of the HDs and the GFs are correlated highly to all of the GFs. So I could make this a little bit better because I could say this is an image from one to the number of rows of N row of core M. And I can say the y-axis ranges from one to the number of columns of core M and then the values that I want. So now we have the proper numbering on the bottom. When we look at it, what we can then do is say x-act is normal. Don't give me an x-axis. Don't give me a y-axis. And now I want to add my own axis and I'm going to disable the labels as well. So x-lub is empty, y-lub is empty. So now I have an empty plot with just the correlations. So I'm just going to paste this into Notepad so that we have it. And now the only thing that I'm going to say is again axis, so the first axis at is one to the number of rows of the correlation matrix that we just computed. I want to have, what do I want to have at these positions? Well, I want to have the row names of core M and I want to do the same for the second axis, for the y-axis. In this case I'm just going to say one to n-col, just to make it generic. And here I'm going to say col. I could have just used the row names because the row names and the column names are similar in this case, but that's all perfectly fine. So now if we would go to r, then we would have a reasonable looking plot. I still need to do las is 2, so to rotate the individual axis, let's do that quite quickly, comma las is 2, las is 2, and then go to our plot, go to r again and then rotate them so we can read what it says. Right, so now we can really start reasoning about our data and why do I want to make a plot like this? Well, from the data, and I know the data, right, we know that ht stands for hypothalamus, so those are brain tissue samples. gf stands for gonadal fat, so they are samples which are from your gonadal fat pad, around your gonads. And if I would have mixed up a sample in the lab, if I would have taken a piece of brain, put it in a cup and labeled it gonadal fat, then I would have clearly seen this in the picture, because then one of the gonadal fats would have been highly correlated to the hypothalamus samples, or the other way around. But because I see this checker pattern, I now know that, no, indeed, everything went well. The samples which were measured on the hypothalamus are really hypothalamus samples. The samples measured on gonadal fat are really gonadal fat samples. There are no sample mixups, and that is why we make this plot, just to make sure that none of the samples got mixed up. And of course this works really well if you have two conditions, but it also works really well if you have three or four conditions. So, hey, you can look at the checker pattern and you can say, yeah, we did a good job, or at least the people in the lab did a good job. None of the samples were swapped. The tubes that had assigned gf are really gf. The tubes that were assigned ht are really ht. And we can show this by the fact that they are highly correlated to each other and lowly correlated between different tissues. So this is what we expect to see. Good, so we're really happy. We did a good job. The lab did a good job. We showed that there's high correlation. So that's all pretty good. So the answer to what do you learn from looking at the image plot is that there were no mixups in the lab. Every sample is probably the sample that they claim it to be. At least it's the right tissue. We can't really be 100% sure that this is ht210, but we can be damn sure that it is actually hypothalamus because it looks like all of the other hypothalamus samples. In theory, of course, they could have labeled everything wrong, but that's all of the hypothalamus as gonadophat, all of the gonadophat as hypothalamus, then we would still get the same picture. And then to figure that out, we would need a third-party microarray done on gonadophat, right? Because then we could figure out if that's really true. But I assume that in the lab they didn't make a significant mistake like that. All right, next question. So basic analysis. Using the arrays.txt file, we can split the data into two groups, ht and gf. We already saw that there are two groups. And we see that the same individual has been measured in both tissues. During this analysis, we will not bother for adjusting for the strain of the individual. However, a normal analysis does need to take this into account. Right, so the first things. First is the question 3a. Split the normalized expression data into two groups using the data from the arrays.txt file. All right, so let's do that. Let's go to Notepad. So we have our arrays. And let's go to r. Look at the file again. Just to know what's in the file and how the column names. So we can see here that we have something called tissue. So we just take the tissue column out. So we say arrays tissue. So now we see that there are two different types of tissue. And we can say tissue is ht. So it gives me back a true false vector. And then I can say give me the row names of these because I like selecting stuff by row names because that's that's the way I'd like number one in this file might not be number one in the other file. Right, so I'm just going to say these are the ht samples, right, so to 10 2024 513 and so forth. Right, and then I'm just going to copy this and I'm going to go to Notepad++ and I'm going to say samples.ht Right, and I'm going to then duplicate the code and I'm just going to say samples.gf and I'm going to select for gonadofat. Right, good. So let's paste this into r. So now we have two vectors. The one vector contains names of the hypothalamus samples. The other vector contains names of the gonadofat samples. And now I just want to split the expression data, right? So I'm now just going to say that I have .hd right, so the 8th hypothalamus expression and I'm just going to say use the expert data norm 2 so use the normalised expression data from this select the hypothalamus samples and I'm going to say expert data norm 2.gf give me the gonadofat columns and store this in expert gf. I don't know if it's smart to split the matrix into two different matrices but I'm going to do this first because that's kind of what the question asks, right? Split the normalised into two groups using the data from the array txt file. All right, so then question B is for each probe on the array, calculate the mean and the standard deviation in both groups. So we can do that relatively easy now, right? Because we have our data separated so I can now just say, well I'm going to have something which I want to remember, right? So I'm first going to say something like statistics or stats. Let's just call it stats. Initially I haven't computed any stats. Then I'm going to say for x in 1, 2, the number of rows of the expression data of Hippotalamus, what do I want to do? Well, I want to take the x-row from the expression data from Hippotalamus and I want to calculate the mean and I also want to calculate the standard deviation, right? And I want to do the exact same thing for the gonadophat samples, right? Calculate the mean of the gonadophat, standard deviation of the gonadophat. And I'm going to make a vector out of this. I'm just going to say c, right? So I'm just going to combine these four values together into a single vector. And this is then a s-row for a row with a row of the statistics. Then I'm going to say, well, take the stats and then row bind the to the stats that we had the s-row. So just add the four computed values to the ones that we have and build up a little matrix. And this might take a little while because we have to go through all of them. But I'm just going to go to r and paste it in. Oh, I need to actually do the expression HD, right? I forgot to paste in these two lines. So let's just do that and it will now start computing the means and the standard deviations for each of the arrays, for each of the two groups. And it will make one big matrix out of them. Right? And it will take some time, so we are already going to assume that it's done. So I'm going to say the call names of stats are going to be the mean of HD. The next column is going to be the standard deviation of HD and then it's going to be the mean of GONADAL-FAT and it's going to be the standard deviation of GONADAL-FAT. And then of course the row names of the statistics are the row names of EXPRATE. All right. Good. So right now we add the column name, so we go back to r and it just finished. So now when we look at stats and we look at the first 10, we see for the bright corner the average of the hypothalamus was 14. The standard deviation of the hypothalamus was 0.13. For GONADAL-FAT the mean of the bright corner was 13.8 and the standard deviation was 0.09. Right? And then we have the dark corner more or less the same and then we see the individual probes of the array. Good. So that's what we want to do. That was the question. So calculate the mean and the standard deviation. OK. So then the next step is to perform a t-test between the two groups for each probe. So let's perform the t-test. So I'm going to go back to r and in r I'm going to say OK. So now I want to add a column to stats, right? So I'm going to say stats is row-bind two stats, a new column which is called PIVOL which is actually populated with NA values, right? So just add in an empty column with NA values to our statistics. All right. Let's do that. So let's go to r. So now when we look at the first 10 rows of stats we don't see this. Why is that the case? Row-bind two stats PIVOL is NA. Where did that go? Do we need to do it like this? Huh. Oh, I'm row-binding. I'm so dumb. So I'm putting... I want to column-bind, right? I wanted to add a column, not a row because now there's two rows at the bottom called PIVOL. It doesn't matter. We just have to re-run the code because we just messed up our matrix. So I'm just going to copy-paste in the stats and here I'm going to say C-bind, right? Because I want to have an additional column not an additional row. So just re-run the code and it takes a little bit of time, but that's fine. That's fine. We have some time and we can continue. Alright, so then we go back to Notepad so while it's computing. So now we can say, well, now we have an empty column called PIVOL which will contain NA's for each of the probes. And then we're going to say 4x and I'm just going to copy it from here, right? Because we're going to do the exact same thing. But now we are going to do the t-test. So I'm going to say at position X in the column P-VOL what do I want to put in? Well, I want to put in the p-value which I want to compute. So I'm going to say my p-value that I want to compute is a t-test of the expression ht x, at this row, first shows the expression of the GONATO FAT on the x-th row. And let's see, because I'm just I can just test this in R, right? So I'm just going to in R look at my studs matrix 1 to 10. So now because I column bind I indeed see that there's a column called PIVOL which is what I wanted and I'm just going to make sure I'm going to put x to 1 and I'm going to go to R and I'm just going to do this t-test on the first one because I said x to 1 should be a significant difference because it's 14 with a standard deviation of 0.13 versus 13.1 with a standard deviation of 0.09 So this should be significant and indeed we can see that the p-value is very significant. So now the question is how do I get the p-value because you see that it gives me back this Welch 2-sample t-test thing which I don't really want because I only want to have the p-value back so I'm just going to store this in AA so that I can look at it and I'm going to ask SDRAA so show me the structure of the AA object and then it says that AA underwater sneakily is a list of 10 it has statistics, parameters and p-value so if I do AA$p.value then this should give me the p-value Alright so with that knowledge we can go back and we can say here so the t-test I have something which is called p and from p I'm only going to remember the p.value and I'm going to put that into the empty slot of the column Good so let's do 50.000 t-test or something like that and that will take a while and already look at the next one right? So start a result in a matrix so that the matrix looks like this don't forget to add the probe ID as the row names actually it listed it in the assignments that you can use the dollar p.value but if you wouldn't have known if you get an object back which is a custom R object you can use the SDR function to get the structure of the object and that helps Alright so in the end we now have all of our p-values we have all of our statistics let's look at the first 10 then we have our raw p-values and we see the means, we see the standard deviation so this is something that is more or less publishable because we could just write it out into a table put this table into Excel and this would be supplementary file 1 differential gene expression between gonadophat and hypothalamus of course that's not entirely true because we ignored a lot of things we know that for example we have HT 2010 and we have GF 2010 so we could actually have done a paired t-test because we have a sample in HT and we have the exact same sample in GF so these two things are paired so paired t-test gives us a little bit more statistical power but that's of course you can work on this a lot but in the end it doesn't really matter because we're ignoring one of the big variables anyway because if we look at arrays we see that not only do we have two tissues but we also have different strains so different types of mice so of course this will not be a really paired t-test in this case alright so the question then becomes how many genes are significantly differently expressed between hypothalamus and gonadophat when correcting first four multiple testing using Bonferroni alright so let's test for Bonferroni right so we take our stats dot stats and we take the column pval and then I'm going to just say p.adjust and I'm going to adjust using Bonferroni I think this is how you write it p.adjust without an e and I'm actually writing Bonferroni wrong I'm going to say method is just copy-paste it I don't know what I wrote wrong but this will adjust our p-values and then these are the adjusted p-values of course there's going to be a lot but we only want to know how many are significant so I'm just going to ask the question how many of these are below 0.05 I'm going to ask which ones are those alright so it tells me that these are the probes which are differentially expressed after Bonferroni correction and of course I'm going to ask for the length and if I ask for the length then I can see that 14,267 genes are differentially expressed between iputalamus and gonadofat which of course is not really true because we are ignoring the fact that we have a paired t-test that we have three different types of animals so this is massively inflated because we're not dealing with the strain of the mouse that it comes from it's just to show you guys how this works and then the next one is to adjust using Bonferroni Menumini Hoogwerk so that's called BH with capital letters so I'm going to say BH and now we see that 31,000 right and this is always the case because Bonferroni is much more stringent than Menumini Hoogwerk Bonferroni means that we are not accepting false positives Benumini Hoogwerk says is that we are not really accepting false negatives so by focusing on the false negative rate we of course are increasing our false positive rate but in the end two different answers one case 14,267 in the other case 31,000 and of course this is not true because we are ignoring a whole bunch of other factors that also play a role alright so let's add these two lines of code into our script and then we are done I think with the assignments for today so I'm just going to say hashtag 1 so this is the first one this one was the one that did not match and then here we have this one and in the end that gave us 31,329 good so that was the answer to 3C and 3D alright so I hope this was clear I hope that everyone was able to do these assignments and that at least people tried I know that some people tried because some people send in some questions but these are my answers that I did live I have slightly different answers when I did it before I think I used a little bit more of the apply function but in the end it's very similar and I think the answers are very similar to the answers that we got last time and then yeah so very similar good alright then it's time for a little break and then we will do the algorithms and functions lecture so I will be back in 10 minutes something like that I don't know exactly how long the music will last but I've selected some music and I think the first slides the first break is going to be alligators I think so if not then the second one is going to be alligators but I forgot which one I put first and which one I put second alright so I will see you guys back in how long the music lasts and enjoy the alligators for now so I will see you in 5 to 10 minutes so have a nice break get some coffee and let's start the music good so welcome back everyone who's still there and I hope you guys enjoyed the alligators the Floridian Swampcats I'm now just wondering what the next break is going to be and if there is going to be a next break because like I said we have 44 slides left so in theory we should be able to do it in like an hour would mean no second break but that means that I can start my weekend a little bit earlier and you guys have a little bit good so for today I wanted to start off with a little bit of demotivation because we probably already lost like half of the students which is generally the case in a programming course people sign up for the course and then after two lectures they decide programming is not for me so I'm going to drop out of the course but for the guys that are left or girls that are left like a little bit of demotivation to help you guys get motivated which is strange but it's the way that I like it so I want to talk about it and after that we will have theory so there's going to be a lot of theory today there's not going to be a lot of R code I am going to show a little bit of R code but it's more going to be about the theory of coding so for example what is an algorithm and how can you visualize algorithms what are design patterns and how can design pattern help us to write better code for example the question about why do you assign the return value of the parameter so why do you assign it to something that is just because it is a design pattern right because when you have a function that does plotting you can change all kinds of parameters because you need to because of the way that your plot looks like but then once you are done with the plot you want to restore the settings that there were before right because you don't want people to call your plot function and all of a sudden afterwards everything is plotted in red instead of black or everything has a black background you don't want these things to happen you want your functions to have as little side effects as possible or actually no side effects is the best and this is just a good pattern to design code by I wanted to talk about functions and I wanted to talk about recursive functions because I love recursive functions they are one of the first paradigms in programming and invented by Ada Lovelace the programmer who wrote code for the analytical engine of Charles Babbage and I just like recursion and it's it's a beautiful way of programming because it's very close to the mathematical description of things and then I also wanted to say a few words about higher order functions and this is because higher order functions we've already seen them like the apply function and these kinds of things but I wanted to show you guys how you can write them yourself and more or less also give you a definition so that I can ask you guys on the exam higher order function and then you can you can answer that good so first off you want to learn programming and that means that we have to talk about practicing and about doing the assignments so there was a very well known study in the early 1990s about psychologists right and these psychologists had the hypothesis that if you look at children and you look at for example playing an instrument then the hypothesis was that some of the children would be naturally gifted right they would have some innate kind of born ability to be able to program or to play the violin right so what they did is they studied practice habits for violin students when they were young in childhood in adolescence and then in adulthood so all of the people that were enrolled in their study were people who played the violin and they all begun at around five years of age and then they would practice more or less the same amount of time because they were in a violin school so or they would go to music lessons and they would practice at home and they saw that when these children became like eight years old their practice times began to diverge some started practicing more because they really liked playing the violin and others they started practicing less because they went to soccer or they hung out with their friends a little bit more but they saw that at eight years there was a big difference already in the amount of time that people would invest in practicing and then by the age of twenty they found out that elite performers so people who were really good at playing the violin averaged more than ten thousand hours of practice and the people who were not as good as playing the violin they only had like four thousand hours of practice in the end but their hypothesis was that of course some of these people who didn't practice as much would also be elite performers and some of the elite performers so the ones that practiced ten thousand hours or more would still be mediocre violin players right because that is the hypothesis if you think about natural born abilities if you think that you are as someone has a math bump so to speak that is what they did not find they found that there are no naturally gifted performance so in the end the only thing that made you good is practice so they concluded like this if natural talent had played a role one would have expected some of these naturals to be in the elite level with fewer practice hours than everyone else and they did not find that they found a direct almost perfect correlation between the hours practiced and the ability to play the violin that means that there are no shortcuts there are no people who are naturally gifted for playing the violin and this study has been done in many fields they have looked at children when we look at things like mathematics when we look at programming playing the violin playing soccer in the end you just have to put in the work so there are no shortcuts to becoming a good programmer you just have to practice and you can lay back and say well I'm very good at mathematics so I am naturally gifted at something because that does not exist at all this study has been replicated in multiple fields and every time people find the same your ability to do something depends on the amount of hours you spend practicing your skills so do the assignments and for that I have actually uploaded additional assignments so for the people who are doing the assignments and are really interested in the course you will become a good programmer as long as you continue putting in the time taking one or two hours every week to program something and I know that the hard part about programming is finding assignments or finding things to program so for that I have uploaded a couple of additional assignments to Moodle they follow the same structure so they start off very easy some of them are very hard at the end and if you want to practice more or if you like reading and learning by reading instead of looking at videos or going to lectures then I also uploaded some of my PDFs so these PDFs I made when I was still working at the University of Groningen there we would have a very similar course to this course it was a bioinformatics course but the people there would be more or less master students they would come into our group and they would do a three months to six months master project and these people they did not know how to program before so I made this kind of introductory PDF it has I think 25 pages or something and after doing these 25 pages so there's a little bit of text explaining stuff and then there are some assignments that you can practice what you just read and besides that I also uploaded my cheat sheet which is a small 1A4 paper which you can just print and kind of plug to the wall somewhere where most of the basic coding are things are explained so it's just a very very small sheet which shows you how to write a small for loop how to write a small while loop and these kinds of things am I still online can someone say something in chat if they still see me because youtube is flipping out and saying that there's that's weird anyway doesn't matter I think I'm still online if I'm not okay good so thanks Misha yeah no youtube is giving me some weird feedback nothing wrong okay good alright so there's the our introduction so that's just text read a little bit do an assignment and I've also uploaded the additional assignments to Moodle so you can also get them there and the cheat sheet is there it's just a 1Page A4 which you can print you can put it next to your computer to remind you to practice but also to kind of cheat and look at say oh I have to write a while loop how does a basic while loop look like or I need to make a subset of a matrix how do I do this so you can get that from my website good so let's talk about algorithms and an algorithm is defined like this and I'm just going to read the whole text which I know you should never do when you're presenting stuff but I think it is very important and the main thing is the bold highlighted parts right so an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function starting from an initial state and an initial input which might be empty the instructions describe a computation that when executed proceeds to a finite number of well-defined successive state eventually producing output and terminating at a final ending state the transition from one state to the next is not necessarily deterministic some algorithms known as randomized algorithms incorporate random input so to summarize this because I don't want you guys to learn this whole definition of algorithm right because that's just too much like no one can learn this piece of text well you could in theory but the algorithm is a discrete isn't is a list of a discrete number of steps that when followed will complete a certain task or it will fail so that is my kind of very short definition of an algorithm and why is this important because the finite list of well-defined instructions so an algorithm cannot be infinite right it has to finish after a certain number of steps an algorithm has two very well defined states an input state and an output state or a final ending state so this is the core of an algorithm it's like a cooking recipe it's like baking a cake you start off with a well defined list of ingredients then these ingredients are used in a series of steps to make your cake and you end up with a cake or you fail which is also a possibility when you bake a cake when I'm baking a cake then the failure condition is quite big but for some people the failure condition is quite small so because of this fact we use state diagrams and we can use state diagrams to represent algorithms and to flow through a system so a state diagram is a diagram which is used to describe the behavior of a system and it is a visualization of an algorithm that people always confuse state diagrams with flowcharts and they are very similar so in A we see a state diagram and in B we see a flowchart and the only big difference between a state diagram and a flowchart is that a state diagram the moving from one state to another is done by an action so it is something that a user does or is some input which is being read or written from the output so it's very similar so a state diagram and a flowchart are more or less almost the same thing except for in a flowchart you can flow from one box to the next without there being an explicit event so flowcharts do not need explicit defense to move from one state to another so a state diagram needs an event so if we look at a couple of state diagrams then we can see more or less how they look so you have an initial state this initial state is always represented with a filled dot and this is the initial state so this is the state before anything happened so this is the state diagram of an engine so the initial state is idle you move from the initial state to the idle state and then you have transitions you can turn the engine on which means that it's running and you can turn the engine off which goes back to idle and then you have the final state and the final state is always represented by a black dot with a circle around it and you always start at the initial state you always end at the final state states are represented by boxes so this is a state diagram of the for each loop so the for loop or the for each loop so if we have a very basic for x in A so A being a vector of elements what do we are going to do we are going to call a function using an element so we start off with the initial state so the initial state is this filled dot here we end up with we start with being in A if A is not empty we move to the next state which means executing the function and then we go back to the next element of A and we go through this loop until A is empty and then we reach the final state so we can describe this if we for example have a 4x in 1 to 10 so just do something 10 times what do we do well we take a column matrix so my mod take column x add 1 and then put this back into the matrix so the initial state here is that you have something which is called my matrix which is filled with numbers there is a finite number of successive states in this case 10 because we do something 10 times the successive states are that the values in column x of my mod are increased by 1 and the final state is that columns 1 through 10 contain values increased by 1 so an algorithm always has an initial state successive states or a finite number of steps and then we have a final state in which we end up and the nice thing is about 4 loops and for each loop and I think I already said this in one of the earlier lectures is that they are a 4 loop and a 4 each loop is modeled very closely to natural language so 4x in 1 to the number of columns of m or 4x in a vector a, b, c 4x in so it is very close to how you would normally describe it which is really nice because that means that there is a relationship with natural language and the programming language that we are using so a while loop has the exact same state diagram the only difference is that while a is true we do a certain function so it is less natural because while something is true or while something do something so it feels less natural than saying 4x in 1 to the number of columns of my matrix but the nice thing about a while loop is that there is no need to know the number of iterations that you have to go through which iterates until something is true the nice thing about the while loop is is that you can actually make a while loop which is not an algorithm because an algorithm always needs to end up in the final state but if this thing, so if this expression in the while loop never becomes true it will iterate forever and ever and ever because if I just say while true do fun then it's not an algorithm and the funny thing is that a lot of software that you are using is built on this principle for example a window manager like PowerPoint has a while true loop in there so while PowerPoint is running do display the PowerPoint slide which makes it automatically not an algorithm because an algorithm needs a finite number of steps and it should always end up in the final state but the state diagram is exactly the same so you start at something called A if A is true you do the function and then you go to the next element of A or you do something else but in the end as soon as A becomes false we end up in the final state but here a while loop can produce something which is not an algorithm for example I got this one from umldiagrams.org and this is a bank machine so it's an ATM an automated teller machine so an automated teller machine is based on this diagram and these diagrams help us to code because by making this diagram beforehand we know exactly which states we should implement and where things can go wrong we have here in the ATM the initial state so the initial state means that the machine is off so when we turn on the machine so this is an action so when we turn the machine on it starts doing a self test so the self test can succeed which means that we are now in the idle state or it can fail which means that we go and have to produce an out of service message to the user don't enter your or don't put your card in because we will eat your card at this point right so the out of service state there are several actions that people can perform one of the actions is that you can turn off the machine which leads you back to the off state however there's also the option to service the machine which puts it in maintenance mode maintenance can fail and then it goes back out of service maintenance can succeed or it can be performed and then it does a self test and if the self test succeeds then you go into the idle mode from the idle mode you can actually directly service it like a guy can come and punch in the secret code and then perform maintenance on the machine and here we see one of the nice things about the state diagram because by making this state diagram beforehand there is a very clear separation of things that need to be implemented there needs to be a self test routine there needs to be a maintenance routine there needs to be an out of service routine and there needs to be an idle routine and besides the idle routine when someone inserts their card we go into the serving customer section of the state diagram and at the serving customer is more or less you have the entry state which is to read the card and the exit state is when the machine gives you back the card but there is again there is a customer authentication then after authentication there is a selection of the transaction then the transaction itself is done and then we give back the card and of course there are more blocks here customer authentication can fail what do we do then a user can select an invalid transaction for example he wants to withdraw much more money than there is currently in the machine what do we do then so by making these diagrams it forces us beforehand to think about our code and to structure our code in such a way that we have functions for everything because generally all of these little boxes here they end up being a function in the software that we are writing right so again an algorithm is a finite list of well-defined instructions going from an initial state through a well-defined number of successive states eventually producing an output or terminating at the final state so we can classify algorithms in several different ways right so there is two ways of defining how an algorithm or what an algorithm or different types of algorithm right so there is a way of defining by its implementation and then we have the next one which is by the design paradigm so the design idea behind the algorithm and these can be mixed and matched so we will first do the implementation right so the implementation defines for example if an algorithm is a recursive algorithm or if it's an iterative algorithm so we haven't talked about recursion yet but we have talked about iteration a lot right iteration means a for loop, a while loop just doing something over and over again until you run out of things that you want to do it on or when you reach a certain point where your where your while loop where it is false right so where you don't have to do the loop anymore so one of the main examples here is this little game which is called the towers of Hanoi towers of Hanoi is a very interesting game because you can actually implement it in a recursive fashion as well as in an iterative fashion so the game is very simple you have three of these pins and the idea is that you have a pyramid called the tower and this tower needs to be moved from one side to the other side and the thing that you can do is you can take one of these rings off and then put it on another pin but the rule is that you can only put a ring on another ring when the lower ring is bigger than the original one so it always has to be a tower it can be a what do you call it a trumpesium which goes out and this is a very very well known mathematical problem and the mathematical problem can be solved iteratively by just going through all different states that you can or you can solve it recursively and if you want to read more about it just google towers of Hanoi and you can see different implementation that people did it can also be that there is a logical that the implementation is a logical implementation which means that it is based on Boolean logic to solve your algorithm furthermore we can classify algorithms as being serial which means that you do one step after another algorithms can also be classified as parallel which means that some of the steps can be executed at the same time because there is no dependency between them and we also define distributed implementations of algorithm which means that part of the algorithm can be run on several machines at the same time deterministic means that there is always that there is that the series of steps that you do is following a deterministic algorithm which means that it will always end up in the exact same state so if you have the same input you will always get the exact same output non deterministic is different because non deterministic means that you with the same input you could produce multiple different outputs we have exact or approximate some algorithms can give you the exact value or the exact answer other algorithms can only give you an approximation for example if we think about linear regression linear regression when you have a very small linear model can lead to an exact answer but when we have for example linear mixed models then generally the answers that the algorithm gives you are approximations so it gives you a value of beta and this value of beta is an estimate it's not the exact value of beta but it's 95% sure that it will be within certain confidence intervals and nowadays we also talk about quantum algorithms like Schor's algorithm for dealing with prime numbers and quantum algorithms are algorithms which are run on a quantum computer so which do not function using ones and zeros but which function using quantum states so if we classify algorithms by design pattern then for example we have brute force algorithms so for example I want to hack into someone's account and I'm just going to try all possible passwords to a certain username this is called brute force furthermore besides brute force it's also called exhaustive search we have divide and conquer algorithms for example sorting so I show an algorithm here which is a sorting algorithm so the idea is to sort these numbers in order how are we going to sort it well we are going to say well I'm just going to split the original list into two lists which are smaller and I'm going to do this again and again until I have until I have individual numbers so I'm going to every time make my problem 50% smaller and then once I reach this state then when I have only two numbers then these two numbers can be sorted themselves or they cannot be what I then end up is more or less loose numbers and then we are going to combine the numbers again in the same but now by putting the smallest one first so divide and conquer is a way of sorting numbers but you can the idea is that you cannot solve the whole problem in one go but that you can solve sub sections of the problem much easier there are search and enumeration algorithms for example when you think about chess algorithms they will just try every possible every possible next move and then they will put a weight on each of these moves so how likely is it if I do this move how many winning states are there and how many losing states are there and if I move the pawn then it's 50-30 and the other pawn it's 30-50 is it the same as dynamic programming so dynamic programming is a design paradigm it means that the program itself is not fixed but that it can rewrite parts of its algorithm if you think about functional programming then you just have functions and these functions they do something but dynamic programming generally people consider it when the code is more or less modifying itself or modifying its input so so search and enumeration is more or less like chess so if I do this move how many winning states and how many losing states are there and you want to optimize in this case so you do the move which has the most winning states in the end we can talk about things like randomized algorithms which is also one of the design paradigms so here you have an algorithm which uses random numbers to for example get an approximation there are Monte Carlo algorithms and we will talk about those a little bit more because I like them a lot and they're used a lot in genetics so Monte Carlo algorithms are simulations and here you are doing the same thing as in almost search and enumeration you say that well I have this situation this can happen randomly so randomly there are for example three different things that can happen and which one is the most likely to happen and then there's also reduction of complexity algorithms which is a subset of divide and conquer so dividing and conquer is a reduction of complexity algorithm but there are more ways of reducing complexity divide and conquer but just remember there are two main ways that we can define or classify algorithms one of them is by design paradigm and the other one is by implementation so how an algorithm is implemented is one side of the equation and what strategy it uses to find an answer is called the design paradigm so when we talk about design paradigms we have to talk about design patterns so a design pattern is a general reusable solution to a commonly occurring problem within a given context in software design for example how to log into a website or how to do two factor authentication these are more or less solved problems people figured out the best way to do this and this is the smartest way to do it the thing about a design pattern is that it's not a finished design that can be transformed directly into source or machine code it is a template for how to solve a problem that can be used in many different situations when we talk about design patterns they are generally represented by the state diagrams like an ATM machine the ATM machine is not something that you can take off the shelf and it will work the machine can be implemented in many different software languages it can run on very different hardware but the thing that the ATM machine does is always more or less the same so it has an initial state where it's off, it gets turned on there's a self test then you have like a maintenance mode and all of these things need to be present so I just wanted to tell you guys why do we use design patterns and that is to avoid reinventing the wheel because logging into the website is a problem that people have been dealing with for a long time there is a best practice and this is proven to work and if you are programming you should be aware that many of the things that you want to do in programming have already been solved before by much smarter people than you so you should just take their solutions and in some cases there's even a mathematical proof that this is the optimal way of doing it and you don't want to reinvent the wheel so some examples that we will start looking at is the model view controller which is a very common design pattern in things like websites and how to separate the data from the thing that the user sees and keep this from the input that the user gives because you don't want users to be able to modify all of the data in the database so model view controller is a very common design paradigm used in all types of software Monte Carlo sampling is also a very common design pattern to solve problems or to solve simulations we have the facade so the facade is a way of separating if you are a software company for example on Twitter then you want to have control over your own software stack and if you internally decide to tomorrow switch to a completely different programming language then you do not want all of the software out there which is for example Tweetdeck or all of the other software that interacts with Twitter to break because they should just call more or less well defined functions so the facade is there to separate your code from someone else's code a facade is also sometimes called an API so an application programming interface and then there's things like schedulers so how to schedule things on a CPU and there are many different ways to do these and every scheduler has their own advantages and disadvantages but schedulers themselves and how to schedule things is a solved problem so generally if you want to schedule stuff so if I'm writing software where things need to happen in a certain amount of time then don't write your own scheduler just take an off-the-shelf scheduler and use that then there's also these patterns for things like locking or singletons we won't really talk about that but there are many of these different design patterns and that is patterns which are known to work and are proven to be the optimal solution for your problem so the model view controller is used a lot in web development and web design so you have your data model and your data model depends on nothing and it knows nothing the data model is nothing more than for example a database or an Excel table containing data and it cannot do anything so the model cannot modify itself it's just a storage place for data and it knows nothing and the model here we mean the way that the database is laid out so the data presentation layer is called the view so if I go to Facebook for example on Facebook I see a view that Facebook presents me this is not the model or the data that is in the back end they have stored my data at Facebook in a completely different way than what I am seeing on when I go to Facebook and then you have the application logic so the view does nothing than present the data so it knows that the model exists, it knows how the model looks like so it just takes the model and then presents it to you and it doesn't provide any interaction all of the interaction through the application logic which is called the controller and the controller knows the view and the model so it knows for example I am logged in so I should give you a friend view of this profile or you are logged in and you are the user so I am going to give you the user view of this profile or you are not logged in so you should only be able to see all of the publicly available data so here is the view that you should have so the controller updates the view the view provides ways for the user to interact with the controller the controller then updates the model and the model then notifies the controller that the view needs to be updated so it separates the data from the logic from the presentation so it is a way of separating out code that makes code reusable because we can change the view and changing the view means that we don't have to update any of the controller logic and you see this a lot in web development so for example in web development the data is generally stored in a database so you have a database where the data is there then you have for example the html page the html page just shows you what should be displayed where and then you have the css which more or less shows the presentation so in that case the html is kind of the controller in a way and the css is more or less the view so what are you able to see and how are you able to see it although in this case there is not 100% clear separation but if you are building web development or if you are building mobile phones then generally you follow this structure so the data is in the model the data does not depend on how the data is presented and it does not depend on the user actions or the user interactions that are there and hey the logic for showing this is all the controller so hey everything flows via the controller and the view can never update the model directly that allows you to put in things like security all of the security is in the controller so the controller decides if a user action is valid and if it is allowed to continue and update the model so why use it? it provides a clear separation between data model and view so you can add new views easily so if facebook tomorrow decides that instead of having the friend of a friend view of a profile it also wants to have the friend of a friend view of a profile then they can easily do that and the separation between the view and controller means that the user interacts with the view and the view does not do any computation the view is just a display and this improves security because everything runs via the controller so every action that the user performs on the view is sent to the controller and the controller can say yes you are able to do this so you are not able to do this another example is Monte Carlo sampling so Monte Carlo sampling is an algorithm which is to map everything that might happen in a complex event and then determine from that which events are most likely to happen and Monte Carlo sampling is used a lot in genetics in economics in agriculture but also in weather prediction so as an example a washing machine that you have at home is a Monte Carlo sampling device because it simulates all of the things that could happen so imagine that I have a bathtub of Legos and I want to know which structures of Legos are the most common when you put them in a washing machine what the washing machine does it kind of generates a random sample because you put your Legos into a washing machine the thing starts turning and blocks start connecting and disconnecting from each other more or less automatically but also in a random way but of course because I give a certain number of blocks of certain number of types some structures are more likely to occur than other structures so you have to figure out how to make this structure from individual components or what the best then I can use a washing machine and there is this really nice paper which actually have random structures from Lego bricks and analog Monte Carlo procedures which kind of shows this so you just say these are all the things that could happen and now we are trying to figure out which is the most likely to happen and this is a really really interesting paper and you can actually publish a paper and it is actually a big group on the internet so there is actually like a lot of people that are throwing Legos in their washing machine and doing more or less Monte Carlo random sampling on Legos and there is all competitions to get like the most complex structure or get like 10 bricks on top of each other and like it is a fun group of people you can see that the guy is having fun he is sitting there waiting for his Lego bricks to connect to each other and then he opens up the washing machine and sees oh I found this beautiful structure made by Legos in my washing machine but that is what Monte Carlo sampling has so it is an algorithm that maps everything that might happen and then determines which events is the most likely to happen right and that is how this relates to Lego bricks and washing machines you want to know which structures are the most likely to happen and this of course has a very tight interconnection with things like genetics right in genetics you have two DNA chromosomes one from your father one from your mother but before you get this chromosome from your father or your mother there is this random event right where the two chromosomes that your mother have they break and they swap with each other and so what you get from your mother is you get part of chromosome one from her mother and you get part of chromosome one from her father and where this break happens this is random but of course you can use an algorithm to kind of simulate what might happen and what is the most likely thing to happen so it relates back to genetics and not only that but also to weather predictions right if it is currently raining then what can happen well it can stop raining it can start thundering it can add mist to it right so there is all of these possibilities that could happen when it is raining then you observe what happens and that is how you kind of fill in your probabilities and then when you do a weather prediction you say well the weather is currently raining so what is the most likely state to happen in next I hope I explained that well but read the paper it is a really interesting paper to random structures from lego bricks and an analog to Monte Carlo sampling procedures and it is an interesting paper alright the facade so the facade pattern is something like this so a facade is there so that it provides a stable call interface and when we talked about this facade before in the overview I told you guys that if you think about twitter twitter is a massive organization with hundreds of thousands of people working there but twitter itself there is a lot of different apps for example an app on my phone that connect to twitter and this app on my phone does not want to know or does not need to know which programming language for example twitter is using on the back end so how do they do this how does twitter have the ability to support literally hundreds of thousands of different apps probably not hundreds of thousands but there is literally thousands of different apps that connect to twitter and use twitter but twitter still needs to be able to update their software and they do this by providing a facade right so this facade is for example there is a function do something and the client can call the function do something and then it does not matter if this do something is implemented in package 1 or in package 2 so what what a facade is it's an interface that provides very high level functions for example add a friend so I'm just going and so the twitter API has a function which is called add friend to add a friend on twitter and the implementation is unknown to the people using the facade right we don't care if twitter is using c++ or if twitter is using r or if twitter is using java to add a friend to my friends list but the behavior is fixed if I call add friend with a certain user name then this user name will end up on my friends list so the behavior is defined and that is the way how twitter can support many different apps and still internally allows them to update their code makes things more efficient and change things because the behavior is defined but it is not specified how this should happen so the advantages of using a facade is that it provides a stable interface for external tools and external application and it is easy to change the implementation behind the facade because it doesn't matter if it's written in java and that there's java code that adds this friend to your friends list as long as in the end calling this function or having the external twitter execute the add friend function as long as the friend ends up on my friends list then it is that's the goal so the behavior is defined but the implementation is unknown to me as a person outside of twitter there are many different schedulers schedulers again is a design pattern right so it is a way to schedule certain tasks in a certain amount of time so these are four of the most frequently used schedulers right so the first scheduler is called first in first out fivo which is the way that a supermarket stacks their products right it's a certain amount of schedule a task so the task is and there's a certain amount of time but this just means so fivo means that you queue processes on your cpu in the order in which they arrive right so I click on the powerpoint the computer starts executing the powerpoint program and at a certain point powerpoint starts up and then I click on excel and then it starts up starting excel of course this does not work very well when you have things that have deadlines right so when there is deadlines then generally people use earliest deadline first so when I have a bunch of tasks and I need to decide which of the tasks I'm going to run I'm going to select the one which comes up the earliest and earliest deadline first is generally how students learn for exams so imagine you have an exam period coming up you have to do ten exams during this one week of exams which one are you going to learn for first well that's the one exam that you have on Monday right and after you've done the exam on Monday then you're going to learn for the exam that you have on Wednesday so earliest deadline first is one of these schedules which is very close to how more or less in my opinion students learn for exams they look at their list which exam is earliest deadline which one do I have to do first okay so then I start learning from that one and once I've learned everything then I'm going to move and learn for the next exam and then for the next one so shortest time remaining shortest remaining time algorithm scheduler is a scheduler where this is a strategy where the scheduler arranges processes with the least estimated processing time remaining to be next in the queue right so again back to the student example the student now when faced with having to learn for ten exams he will look through all of the exams that are coming up and he will take the exam for which he already read most of the book right so you will take the book that you already read for 90% and read the remaining ten then you will take the book where you've already read 50% and then read that one until you're done and then you will take the book that you never opened and then you will start learning from for that one and then we have round robin scheduling which is actually a pretty good way to learn for an exam and that is just saying well I'm going to study exam one for an hour then I'm going to study exam two for an hour and then I'm going to start with exam one again study it for an hour, do exam two study for an hour so just have a fixed amount of time and just cycle through all of the different things that you have to do alright different schedules scheduling is a very very solved problem there's not a lot of progress being made in this field but it's a very interesting pattern good so it's four I still have like 17 slides remaining so that's good for you guys because you guys get another break and I actually remembered that the next break is going to be squirrels I think and I don't know if you already did squirrels before if we already did squirrels before then I'm very sorry but I just like squirrels and I saw one yesterday so I thought why not do squirrels for the break so we're going to have like a 5 to 10 minute break again and you guys enjoy the animated gifs and of course we will have some music as well and the music is going to be hard burn and enjoy the squirrels and I will be back in 5 to 10 minutes finish and the song always ends before I think it will end alright so welcome back or welcome that you're still here so thank you for still being here I wanted to talk a little bit more about functions because they are important and I think I already showed you guys this slide that in my mind functions are factories right and they can contain many variables inside of them and these variables are not visible from the outside meaning that it allows you to reuse variable names in parts of the code which is good because otherwise you end up with a whole bunch of variable names which are just extended and this is called the scope of a variable right so that means that if a variable is defined inside of a function it is only visible inside of that function and it doesn't exist outside so the scope is limited to the function so there is also the parent variable scope so those are variables which are defined outside of the function which can still be accessed from the scope of the variable so from the function scope and I wanted to talk more about a little bit about escaping your own scope so functions that read or variables outside of the function or which write variables outside of the scope because generally this is considered very bad practice because it creates side effects and I think I already showed you guys this slide right where we have a function with a certain input parameter and then the input parameter is raised to the power of 2 it is stored in a variable called intern and then we return the value of the internal variable to the outside world right so when we call this function with the number 5 it will return 25 and if we then type intern then it will say that object intern is not found right because it does not exist after exiting the function right so intern is not visible from the outside it is called the scope however we can access variables in our parent scope so the scope outside of the function is accessible and generally it is very bad practice and sometimes we need to do this for reasons and the reasons are for reading from your parent scope there are actually three possible reasons and the first one is not a good reason we are just being lazy right so being lazy is not an excuse to read stuff which is not an input parameter of the function but in R the main reason is that since function parameters are copied into the function we want to save random access memory right if we have a matrix which is in memory 7 gigs we don't want to pass this matrix to a function because at that point instead of 7 gigs we will use 14 gigs because R will make a copy of this whole matrix when you give it to the function and another reason to read or write your parent scope is when you have plot functions because plot functions need to read environmental settings they need to for example know which color is the background color which color is the current plotting color which symbol is the active plotting symbol right so these things it needs to be able to read and of course it cannot be defined as an input parameter because then the plot function would have literally like an input parameters that you would all need to specify right so this is a bad example of just being lazy right so here we have the same function as before but now we don't do it to the power of 2 we have something called exponent that we use and this works in R the problem here is is that when we call some function with the parameter 5 when the exponent earlier was set to be 5 we return 3125 if the exponent was set to 2 then we would the same function call with the same input parameter would now all of a sudden have a value of 25 and this is very bad practice so don't do this for these kinds in this case the thing that you should have done is make the exponent to which you want to raise the input parameter an input parameter itself right so this function should have had two input parameters one being in param and the other one being exponent here is an example of when we do want to read from the parent scope or write to the parent scope in this case so imagine that I have a big matrix right this is a matrix which literally has one time 10 to the 8th random numbers in there and this matrix is 10,000 by 100,000 so now when I want to transform part of this matrix right for example I want to write a function which takes a certain amount of rows and then applies a certain transformation for example the square root transformation to these rows then I can do this efficiently by not passing in the matrix so what you are seeing here is that I'm just saying read the current rows that I want to read from the big matrix apply the transformation to it and then put this back into the big matrix at the rows that we read from right and you see this double headed arrow and this double headed arrow means assign this not to the internal scope of the function but assign this to the parent scope to the outside so this updates the big matrix but the big matrix in this case is not an input parameter so it's not copied so it won't start using 14 gigabytes of memory compared to the inefficient function here where we have a function where we have the M which is the matrix that we want to apply it to we have the rows and the transformation like before and now what we do is we transform the M while taking the rows and putting it back and then returning this part of the matrix that we modified this call will more or less duplicate the big matrix it will copy the whole matrix use 14 gigs of memory instead of 7 gigs of memory so if you are dealing with massive matrices massive matrices should not be input parameters to functions and this will slow down your code significantly so don't do this use in this case the more or less the efficient version of the matrix and update things outside of your scope this is very rare though because you don't generally have matrices which take 7 gigs of memory but you can in R because of the way how R works save a lot of memory by using the double-headed arrow so and we can also do the opposite we can take out of our own scope creating update variables outside and the possible reasons to do this is to save from to not copy but manipulate one object and we can also do this to return multiple objects we could assign a new variable to the outside a variable that did not exist before if we have two values that we want to return but this is just something that is again falling according to the being lazy category because R only allows you to return a single object if you want to return two objects then just using the double-headed arrow to assign one of the objects to the outside and then return the other object you could have just used a list in this case because in R you can create a list a list can contain anything so you can put two of these objects in and then you return a list with the two objects and again had the environmental settings for the upcoming plots and you use this double-headed assignment operator to create a variable outside or create or update a variable outside of your function so this is more or less the only time when I use the double-headed arrow right so I have a matrix and I have a count variable and generally when you do the function on a matrix across the rows and I execute this function so this function gets first row second row third row in order and I for example calculate the mean sometimes this might not work right all of the values in X might be missing values then hey you cannot calculate the mean of all values when they are missing so if there is no result you should say at which row of the matrix I detected an error because imagine like the example that we have where we were working with 50,000 probes on a microarray if I would do all of these t-test some of these t-test might throw an error because it does a 2 versus 3 comparison and a t-test only works when you have 3 versus 3 you want to know at which line this actually goes wrong so what I then do is instead I have this additional count variable which counts the current row of the matrix that I am on and this of course since this function is a function which I apply it needs to update this count variable on the outside of the function because if I would just say count arrow count plus 1 then count would always be 2 because the next time that this function is executed the original count is gone so here I update the count variable and this is more or less the only time that I use it so when I have an apply function which applies to the rows if I have a large number of rows and I know that the thing that I am going to do might cause an error then I want to know where or which row in my matrix caused the error because I can then throw it out and use that and then I use this double headed arrow to update my count variable from within the function that I am applying and this is a very common design pattern again in R where you update a variable outside of the function to keep track of the row number that you are on because the way that the apply function works is that you do not get the individual you only get the row you do not get the number at which you currently are so you have to keep track of that yourself good so recursion so recursion is one of my favorite subjects so in essence recursion is a function that calls itself so the nice thing about recursion is is that there is a simple base case or several base cases where the answer is clear right if we think about the Fibonacci sequence the Fibonacci sequence is it is the Fibonacci sequence is one and one and then the next number so the third number of the Fibonacci sequence is then the two preceding numbers combined to each other right so the third Fibonacci number is 2 and then the fourth Fibonacci is 2 plus 1 which is 3 and then the fifth Fibonacci number is 3 plus 2, which is 5, and then the fifth Fibonacci number is then 5 plus 3, and so on. But it always hangs, so recursion always starts with defining a base case, so that is the simplest case where we directly know the answer. For example, if I have this little example, which is sum up, right? So I'm going to sum up all of the numbers until x. So if x is 1, then I know what the sum up is going to be, because 1 is going to be summed up with nothing but itself, so if the sum up is called with the number 1, then I want to return 1. Of course, I cannot sum up negative numbers, well, I could in theory, but in this case I don't want to deal with x being smaller than or equal to 0, so I'm just going to throw a stop error. But then if I want to sum up the number 2, right, then I want to do 2 plus 1. Sum up of 3 is 3 plus 2 plus 1. Sum up of 5 is 5 plus 4 plus 3 plus 2 plus 1. So how does this work? Well, sum up is going to be a function of x, and I'm going to just say, well, if I have to sum up a number which is higher than 1, I'm just going to return the current number that I'm looking at, plus the sum up of all of the numbers before this number, so x minus 1. So here you have the function sum up, it calls itself, and x is called the invariant, so the recursion invariant, because x in every recursive call that we do is going to move towards the base case. So the base case being 1, so any number larger than 1, if I subtract 1 from it, we'll get closer to 1, so x is the invariant. And we have rules that reduce all other cases towards the base case, so we always kind of reduce towards the fixed case that we have. So if I call sum up of 7, how does this work? Well, it works like this. So sum up of 7 will say, well, sum up of 7 is defined as 7 plus the sum up of 6. Sum up of 6 is defined as 6 plus the sum up of 5. 5 is 5 plus sum up of 4, 4 is sum up of 3, and until we reach sum up 1, this is the base case and now it will collapse. So now it will just have a single row of additions. So all of these numbers are more or less kept on the stack. So they're kind of in memory, but not in the memory of the ROM, but they are more or less in the function stack. So this has to do with how computers work and how stacks are pushed. But recursion is a very elegant way to compute complex numbers and complex, not so much complex numbers, but complex functions like the Fibonacci sequence. Is this clear? Because I really want you guys to be able to learn about recursion, because recursion is very nice because it's very close to the mathematical definition of things. So let me actually grab one of my functions. So we have Fibonacci, right? Because that's the general, no, not Fibonacci. So Fibonacci is the sequence. So let me see what's the mathematical definition. Do they have a nice mathematical definition on Wikipedia? I can value decomposition. No, not really, not really, not really. That's all way too mathematical for what we want. Anyway, let's do a Fibonacci, right? Since I still have like 30 minutes and I only have like 10 slides left, let's do a quick Fibonacci example for you guys. So let's go to R and to Notepad. So let's do a Fibonacci. So I'm going to save this, fit dot R, right? So Fibonacci is the sequence of like this, right? So the way that I calculate this number is by adding the previous two numbers together, right? And then the next number is going to be three, right? Because that's one plus two. Then the next number is going to be two plus three, which is five. The next number is going to be five plus three, which is eight. And then the next number is going to be five plus eight, which is 13. And then the next one is going to be 13 plus eight, and that's going to be 21. And this is a very well-known sequence. It has to do with the golden ratio. But the way that you can write a function like this is Fibonacci. That's not how you write it because it's with double N, I think, double C, right? So Fibonacci is a function, right? And if I want to know X, so X here is the nth Fibonacci number, right? So this is the first one. This is the second one. This is the third one, fourth one, fifth one, sixth one, seventh one, and eighth one, right? So I'm going to input the X number as the one that I want to calculate, right? Then now I can just say, well, if X is one, right? Then I know what I need to do. I need to return the number one, right? Because the first number is one. And then the next base case, because Fibonacci has two base cases, is if X is two, I need to return a one as well, right? Because the first number is one, the second number is one. And then if the number is not one or two, then what do I need to do? Well, I'm just going to return and I'm going to say, give me the Fibonacci of X minus one plus the Fibonacci number of X minus two, right? Exactly the way as how you would write down how this number of rows is made, right? So if I now would ask Fibonacci of number seven, right? Then this should give me 13, right? So if we go to R, let me actually put some hashtags in front of this, right? So if we go to R, and I hope I did everything correctly, where's my R window? There's my R window. Oh, Fibonacci. Is this not? Oh, right. I forgot to close the bracket, I think. Oh, I wrote it wrong. I didn't write it wrong. Fibonacci of one. Oh, could not find function. Let me see. Why is it? Oh, no. So let's have you guys look at Notepad as well. I'm trying to debug why this thing is not working. Let me just remove the return statement. See what it is complaining about. I already know it's the larger than symbol. I should have added the arrow like this. And let's go back at the statement. So just a little debug. You should never do live demos, but whatever. So the Fibonacci of seven will call the Fibonacci of six plus the Fibonacci of five. Fibonacci of six will call Fibonacci of five plus Fibonacci of four. So it will expand out in memory. And this is a really, really nifty way because you have a lot of these interesting formulas which are defined in terms of the nth number is the number before it times the number before that. And that is a perfect example for recursion. And the same thing is for example with the sum up function that we see here. Because sum of seven can be defined or in terms of the number seven plus the sum up of the number before. Does this also work for a hundred? Shoot. In the end we would lead to a massive infinite recursion because like R only has a fixed recursion stack. So it will grow. But the nice thing is that it will actually not start exploding your memory. It will... If I look at my Argui then the memory is consistent. It does not allocate any... It's just using CPU power at this moment because it will remember the individual numbers. And this might take a while actually. But hey, if I'm looking at my task manager at this point it's not growing in memory. It's just using CPU power because it will remember the intermediate objects by creating new function calls. And function calls themselves are relatively... How do you call it? Relatively short lived. So we'll let this run and we'll just continue with the presentation when it will finish. I will show you guys what the 100th Fibonacci number is. If it actually finishes because there is a recursion limits, right? So when we talk about recursion you have to handle weird and unexpected input. Because what we did here in the sum of function is say well I'm only going to deal with positive numbers. So you have to handle weird and unexpected input. So you don't want to create infinite recursion. There's something on the next slide about infinite recursion. So you make sure that every time that you call your function, if your function is based in terms of X, the function calls should always either decrease or increase towards your base case. You need to check that your invariant is increasing and decreasing properly. So towards your base case in a way that makes sense. There is a recursion limit in R. So you can only do 500,000 function calls. So one function can call itself 500,000 times before R will give you the message infinite recursion. But you can set your own limit. You can update the limit. You can say no, I'm allowing a million function calls or 10 million function calls. So that is flexible. R can say that. So every time we call our function, a new function call is created and this internal variable in the case of sum up, right? So this sum up of 7 and this number 7 is stored more or less in an internal registry in the CPU. Not in RAM because I'm still looking at the Fibonacci number. I'm still using exactly 320 MB of memory. So there is no additional memory or random access memory being allocated. It's just internal on the computer. So the computer CPU has like 8 megabytes of space for storing these numbers. And there's this 500K limit in R which is around like 320 MBs of memory which is used which is exactly actually what I'm looking at in my task manager at the moment using 320.3 MBs. So there's this recursion limit and you can put it much higher. Like in my case, my computer has a 16 gig or 32 gig memory, random access memory. So can put this much higher and every like 500K it will kind of dump it to the random access memory. But the nice thing is that you can have infinite recursion which is a disadvantage. So you have to avoid that, right? So for example, if I have my sum up function that I had before and I do not check for negative numbers, right? So if I do not check for negative numbers, then now sum up of 7 will work. But if I do sum up of minus 7, I will go minus 7, minus 8, minus 9 and it will just continue on infinitely. And then at a certain point it did 500,000 calls and then it will say evaluation is nested too deeply. Did you create an infinite recursion, right? So that is why we want to handle negative input, right? So negative input in this case needs to be either ignored or if we are a negative value, we have to instead sum down, right? So sum down towards a base case. It's actually still running the Fibonacci of 100. That's actually quite interesting. I think via 100 will be because you create two. I'm thinking about how many calls is R going to do. Let me actually open up a new R window because I think I'm doing 100 to the power of two calls. So that's 10,000. So that should just fit in the 50,000 limit. But let's see if it finishes before we end the lecture, right? So handle weird situations, right? So if you have a recursion which is dependent on the previous numbers, then of course if you start with a negative number, you just have to throw a stop error and say, no, I cannot handle this because you don't want to wait for half an hour and then R telling you that the evaluation is nested too deeply, did you create an infinite recursion? There's also something called indirect recursion, right? Because it's not always clear if the function that you are using is calling another function, right? So instead of having a single function called sum up, I now create a meltup function which will call sum up and then sum up will call meltup, right? So it's two functions that are more or less pointing towards each other, right? So in this case, if I would do sum up of 7, what it will do? It will do 7 plus meltup of 6, so 7 plus and then meltup is defined as 6 times sum of minus 1, right? So but now we have to make sure that we handle the input as well and that both of these things have a base case. Let me see, feedback hub background task, what the fuck are you doing? I'm just going to end you because I'm already using a lot of CPU at the moment since I'm still doing the 100, right? So indirect recursion is when one function calls another function which in term calls the original function again, right? So it's like this two, it's like the Spider-Man mean with the two Spider-Man's pointing at each other and in this case, you have to make sure that you get a reduction. So had the sum up function reduces to meltup and the meltup function reduces to sum up. Good, so indirect recursion, really useful feature as well. And again, it is very similar to how you would write down mathematical functions saying that x equals or x, when x is 1, return 1, when x is 2, return 1, otherwise for every x larger than 2, x is defined as being the same function of x minus 1 and x minus 3, for example. Good, so why is it useful to do recursion? Well, in many programming languages, it leads to a lot cleaner code, right? If we look at this little piece of code that I wrote for the Fibonacci sequence, if I wanted to do this with iteration, right, finding the eighth Fibonacci number, then this would have been much more code, right? Because in this sense, I wouldn't even know how to write a for loop which computes the Fibonacci numbers, right? You probably want to go forward, so you just have to do all of them in order. So you would say for, yeah, we can see if we can actually figure out one, right? So Fibrec for Fibonacci recursion is a function x, right? And now we're going to say, well, we have, so for i in 1, or no, so while n not is x, so I'm going to say n equals 1 because we're going to start at 1, so and then we would do something like if n is is 1, then I have to get a fib number which will store my Fibonacci sequence, initially is that's an a. So I wouldn't even know how to write one, which is iteratively. You probably can write an iterative version because I'm thinking about how to, so n is the current Fibonacci number that I'm looking at. And if that's not equal to x, then it's going to be a difficult one. It's going to be a tricky one. I will think about that during my short or three-day weekend that's writing an iterative version of Fibonacci. Of course, if you guys want to give it a go, you're more than welcome. So it's actually still computing the 100 Fibonacci number, but I don't think that you could do it much faster with a while loop, and it will definitely be a lot more code, right? Because you have to keep track of state, right? You have to keep track of what is the current x that I'm looking at, what is the Fibonacci number that I have before, what are the two previous numbers that I have to add up together. So that's probably the way to do it. That's probably the way to do it, to just update one of the two numbers. I'm really thinking. I might do one after five. When we finish the lecture, then we're going to write an iterative Fibonacci computation. I'm intrigued now how to do it. So it's cleaner code. There's less management of state. I don't have to remember the previous two numbers, because the thing is that R will do this for me. And it's often more in line how people think about problems. So in my mind, or if I read recursive functions, I generally grasp much quicker what is going on than when I read iterative functions. And it is sometimes more efficient than iteration, especially when we are dealing with data structures, which are based on trees. For example, if I have a tree structure where I have like a root, and these have different numbers, and this especially happens in phylogeny, right? The DenderApply function is being called for every node in the tree that we are currently looking at. So it is sometimes more efficient than iterations. I wouldn't say always, but if you are dealing with a tree data structure, like a balanced tree or a red-black tree data structure, then recursion is going to be much more efficient than iteration. All right, so when we talk about functions, we already saw this dot-dot-dot parameter. It allows you to pass arbitrary function parameters, and it provides a loose coupling between two functions. For example, imagine that we have a function called A, which has 10 parameters, and then we have a function B, which calls function A, and then passes parameters. If function A adds another parameter, function B does not have to know about it when it uses the dot-dot-dot structure. If function A has the names of its input parameters change, function B also doesn't really need to know about it because of the fact that they are loosely coupled using dot-dot-dot. So how does this work? Well, I have a very complex function, which is a function which has four input parameters, and then it does something to these four input parameters. But only one of these input parameters does not have a default value. Well, so I can define another function, which is called using dots, which is a function which takes an x, and then it takes an additional number of unknown parameters. But these parameters need to be parameters of this complex function, right? So when I define my own function like this, using this dot-dot-dot, then now this using-dots function that I defined allows for four different parameters to be specified, right? I have to specify x because it doesn't have a default value, but I can also specify A, B, and C. And if this function, this complex function, would add a parameter D, then all of a sudden my using-dots function can also allow or also allows for a variable D to be passed in, right? And the dot-dot-dots are used a lot when you write plotting functions, because plotting functions generally have a whole list of parameters that you could provide, but generally you don't, because the default values are good enough. But head-dot-dot-dot is a very common way of coupling two functions together in a very loose way. Because of the input parameters. Good, then a few words about higher-order functions. A higher-order function is a function which allows a function to be passed into it. So a simple example is do2. So for example here we have a function which we call do2. It's a function which takes a value x and a function. And the only thing that it does is just call this function that we provided with the value x. So now we can say do2. This list of values the function mean, right? So it will then call the mean function with these parameters. And this is very useful because we can check the type and decide if it is a function. So when we do something like this, when we write a higher-order function and then we can use the class function to check if this thing that was passed in is really a function and otherwise we throw an error. And then if x is smaller than 10, then we apply this function to x, store it in x and return x. But the class of these functions, a function is in R a variable, a special variable which has as a class function. So it is in the R type system. There's another type and this type is called function. And we've already seen a lot of these functions, right? Because we already saw the apply function. We already saw the L apply function. So they are used a lot. So higher-order functions, they are just functions which call other functions which have the ability to, so recursion is a function that calls itself. Indirect recursion is a function which calls itself indirectly via another function. And a higher-order function is a function which takes a function as an input parameter. Good. I've been saying the word function a lot. So I hope it's clear. Higher-order functions are really useful when you want to provide code which sometimes needs to calculate the mean, sometimes needs to calculate the median, sometimes the standard deviation because you can make the thing that you are going to compute a parameter or an input parameter to your function itself. It doesn't happen that often that you have to write higher-order functions. But they are very, very useful sometimes and they can reduce a lot of code just because of the way that they can just take a function as an input parameter. All right. That's it. So for today, we did the answers to lecture six. I told you about the fact that practicing is the only way to learn how to program. I showed you guys that there are more examples or more exercises on Moodle. We talked a little bit about theory, like algorithms as a concept, how does an algorithm and a, not a float chart, but an algorithm and I forgot the word. I'm out of sugar. I'm really out of sugar. It's almost five. I've been talking for three hours, but what kind of charts did we talk about? We didn't talk about float charts. We talked about state diagrams, right? So an algorithm can be represented as a state diagram. Besides that, we talked a little bit about design patterns. So design patterns are more or less common answers to common problems which have been solved and which have been shown to work in practice. A design pattern is not an algorithm. It is kind of a, if an algorithm is a cooking recipe, then a design pattern is, how would I call that? Yeah, that's not a good analogy, I think, but a design pattern is a way to solve a commonly occurring problem which is not an algorithm itself, but it's more or less like, the above an algorithm. We talked about recursion. So recursion is a function which calls itself. And we talked about higher order functions. Higher order functions are functions which take a function as an input parameter and which are very useful when you think about dynamic programming. So when you think about dynamic programming, then a function which takes a function is definitely something, because depending on the input parameter, the function can do something completely different. So that is, it's dynamic in that sense. Good. So I really should have had more sugar before we started the lecture because I'm all out of sugar. And if there are any questions, then ask them now or forever be silent or something like that. And in the meantime, I am going to write an iterative version of, because I really am intrigued by the, if you can do it with iteration. You still look sweet though. Oh, thank you very much. But I am getting really tired. Like this morning was hectic, hectic as hell. So before I started the lecture, I was still like sending emails and these kinds of things. But are there any questions? Did you guys think that I should do this lecture again? Because it doesn't really fit into our programming language. I just generally do this lecture so that I have a couple of things that I can ask which are not programming assignments. So that I can ask you things like explain recursion in your own words. Draw a little state diagram of this for loop with an if statement. So these things, I had this whole lectures here just to give me some freedom in kind of rearranging the exam questions and have more, like, not really applied exam questions, but also have some more theoretical exam questions. It's still running by the way, Leonardo. It's not very happy about the 100 Fibonacci number. So let's see, because if I think about Fibonacci, right? Fibonacci with iteration, right? Then that is a function. I want to have the x number and then I just want to say pref n is 1. Pref n. So we have the previous number and we have the number before the previous number, which is also 1. So now if x is 1, we return the previous, previous number, right? And if x is 2, then we return the previous number, right? And then while not is equal to 0, what do we want to do? Good question, what do we want to do? So then we want to say that our previous, previous number is actually the previous number and I want to have a temp variable. So TMP is the previous number plus the previous, previous number. Then that one moves 1 and now the previous number should now be the temp number and now I'm going to say x equals x minus 1. And then so if x is 1, we return if x is 2. So here we then have to say x equals x minus 2 and then this should and then in the end we're going to return I'm going to do this while x is larger and so if x is 0 then the Fibonacci number should be in the previous number, I think. I think something like this. This lecture could be part of the introduction. Yes, it could definitely be part of the introduction but the problem there is that the concepts are relatively advanced in a way, right? The introduction is more about telling you guys about the history and that it's good to use computers and showing you a little bit of the look and feel in R but if I start in the introduction talking to you guys about recursion and functions, I think functions are kind of the hardest part for people to grasp. There's certain things that are really hard in programming. So the first thing which is hard in programming is wrapping your head around the type system. Different objects have different types. You have numerics, characters, logicals. Then the next kind of big hurdle is learning if statements because that always turns out to be really difficult for people to grasp an if statement but once they grasp it's good, right? And then you can write your own if statements, no problem. And then the next level of difficulty is more or less writing for loops and while loops. The iteration part is really, really hard to kind of get your head around, to do the same thing over and over again but now every time moving one column or one row in a matrix. And then you have functions because functions are kind of encapsulating for loops and while loops. They have their scope, they have stuff with you. So I do think that it could be part of the introduction but it would be really hard. Let me actually, let's go to 15. So let's do 15 Fibonacci numbers, right? So let's try it with iteration and let's duplicate this and then use the Fibonacci sequence where we definitely know that it will work and then I'm going to say A and this is going to be B and then I'm going to cut A which should be equal to B and now unfortunately I have to get a new R window which I then have to also put into the display because I'm just wondering if this will work, right? Like the idea here is that you store two numbers, right? We have PN which is the previous number and then we have PPN which is the number before that and I just want to move them, right? So I have to have a temp variable which is the addition of the two and then I shift so the previous number becomes the number before that and then the previous number now becomes the temp and then I count down, right? So I say 3 is going to be so I think that this should work but I'm just curious to see if it will work. So I'm going to add a window capture and that's going to be not this one but the R window, this one, right? So that we have a new R window I'm going to put this on top and I'm just going to see if it works, right? So I'm just going to copy paste in my code surprisingly enough I think that this actually works so this wonky ass piece of code that I actually wrote just off the top of my mind not this one but this one this actually seems to be properly calculating the Fibonacci numbers I'm really surprised I actually thought that there was probably going to be something weird but you can see how complex the code is to do it with iteration but if we look at the R window then all of the numbers are equal so unless up to 15 it definitely works that's what we can conclude we get the correct Fibonacci numbers from 1 to 15 so let me actually update it to the first 25 Fibonacci numbers and see if it still works still seems to work now the question is does the iteration version compute this 100 Fibonacci number much quicker than the other one right? so you wanted to know the 100 Fibonacci number so that's actually really fast so in this case iteration is much much faster than doing it via recursion because if we look at the R window this one the recursive version is still busy branching and computing because it does a lot of duplicate work right? but the with iteration version seems to be doing quite well but the question of course is we still need to make sure or we still need to figure out if this is the real answer if this is really the 100 Fibonacci number so it seems to work at least until 25 both of the methods are more or less the same but the recursive version is just a lot cleaner instead of having to shift things and having like a previous number than a previous previous number and then moving down this is also horrible x-2 actually this could be like this and then we don't have to do that I think and then this could be I'm done like this since we want to set everything to one we can do it like double arrow it's not bad I think that this would work as well so let me run the code again just to make sure that it's still correct seems to be correct and the iteration is much much quicker because the other one is still busy still busy still trying to figure out anyway so I'm happy with myself so for you guys calculate Fibonacci with iteration not calculated with recursion but the recursive one is just so much more it's just so much clearer right here it's very clear what's going on here I actually noted this down for myself just so that I know kind of what's going on but this is just on the top of my mind right this is the first thing that I thought of like if you start a previous number and a previous number so you need two variables and then you need a temp one to compute the sum or you might not have to actually because the temp no no no because you no because otherwise this would overwrite the preview yeah yeah you probably could do a the greatest common denominator trick here as well because the greatest common denominator is also really good so the finding the the biggest number if you have two numbers the biggest number by which it is divisible through is can also be done recursively which is also really beautiful algorithm um but I'm wondering if I can make this one look a little bit slicker but uh remember here again we see a very good and of course we have to make sure that we check that x is not negative or zero because that that would just break the whole function but uh just from basic like first 25 it seems to work could probably go until like 35 before the whole thing uh before the recursive version start crashing down like you can see how how slow it starts to become right and now I'm definitely going to have issues because now I have like three processes on my computer trying to calculate Fibonacci numbers anyway um I if there's no other questions um then uh it's 505 so thank you guys so much for being here for being active in participating I will definitely take your advice on the consideration to make it part of the introduction and have it flow in there but I do think it's nice to have a lecture which is not just about programming but which is more or also about um a little bit of theory behind programming good then for me that's it I wish you guys a very very good weekend um if you don't have to work tomorrow and otherwise good luck on working tomorrow and have a good weekend after that and I will see you guys um next week and enjoy the weekend bye bye