 Start recording. So welcome back everyone also on YouTube and on Moodle like subscribe favorite all those kinds of things of course so in our example for the linear model where we Try to predict the ozone concentration in the air by the temperature We have to calculate first some values ourselves And so we we want to get the margin of error and for that we need the standard error We get that from the LM summary we have the critical value that we can calculate very easily ourselves and for that We need the probability boundary, which is our alpha coefficient. So 0.05 We need the degrees of freedom, which is n minus 2 and then have we can use the quanta t distribution To get our critical value and the degrees of freedom here y minus 2 Well, it's the number of measurements that we have and we are estimating two parameters We are estimating the intercept and we're estimating the beta regression coefficient for the temperature So that means we lose two degrees of freedom So we just multiply these two values together and then we get our margin of error So for our example I take the summary of LM temperature and I store that in a new variable called LM Some for summary and then I take take from the LM summary the coefficient temperature and then instead of Taking the beta estimate I take the standard error in this case. So the coefficients are just a matrix Where in the rows we have the different? Parameters that we estimate like the intercept and the temperature and then for each of these parameters We can get the standard error, but also the value itself. So in our case our standard error I save it in a variable is 0.233. I Can calculate our critical value by taking our confidence border, which is 0.975 Not 0.95, but because we want to have 5% error in the whole thing We that means that there's two and a half percent on the top two and a half percent on the bottom And then in this case we can just say well the number of degrees of freedom that we have is the number of rows in the Air quality data set because that's the number of measurements that we have and in this case We just subtract two so our critical values 1.9 Multiplying these together we could 0.46 and then have of course the confidence interval for our temperature estimate Ranges from 1.97, which is the estimate minus the margin of error 2 2.89 right so that means that the real answer so had the the real Core or the real estimate of the the the regression coefficient can be as low as 1.97 And it can be as high as 2.89 Somewhere in between head is there is so there's a 95% chance that the real coefficient will be between these two borders All right. I hope that's clear of course We can also calculate the confidence interval using more or less a very similar structure and There's two ways of doing this So if I want to get the confidence interval right, which is the the border that you would normally plot around the regression line Head then we can use The easy way which is just using an external library, which is in this case fish right so for visualizing regression And this just has a function which you can just use so you can just say take my model Hey, so do a vis right so a visualization of the regression of my regression model Say alpha is 0.05 and I give it a title and then I can set some parameters to make it look a little bit more beautiful But of course I can calculate it myself as well. So how do I calculate it myself? Well, I can use the predict function So the first thing that I that I have to do is make a data frame and this data frame will have One column and this column is called temperature right because temperature is the input to our model so and what I do is I say well let's Temperature so predict at each of these temperatures and of course I start at the minimum temperature and I go to the maximum temperature and I every time increase my temperature by 1 degree So I just use this as the sequence function to make a sequence which goes from the minimum temperature to the maximum temperature Then I use the predict function using my model Using the kind of range which I just defined right so new here is my is my is my kind of Area where I want to predict so the x-axis more or less And then I say int is c because I want to have a confidence interval And then I'm just specifying my level which is 0.95 Which is more or less the alpha here So it's 1 minus the alpha that that the vis rec wants and then what can I do? Well, I can just say well plot the air quality temperature data set against the ozone Put on the actually put on the x-axis the temperature and put the ozone on the other one Right, so I don't have to use the the squirrely line I can also just plot the two vectors against each other and then I first add the regression line By using new temperature as numeric The conf right which is the prediction and then the best fitting line I could have used the same strategy as before right so taking the alpha and the beta coefficient and just plotting it like that But since everything is stored as well in this comf object So that the prediction gives us back both the upper the lower bound and the best fitting line I'm just taking all from the same object So I'm just saying well at the new points right so at the Temperatures that I want to do the prediction at give me the best fitting line then plot the upper coefficient or plot the upper upper Plot the upper part of the confidence interval and then plot the lower part of the confidence interval and I make these both blue So how does this look? Well, this looks like this So if I use my own plotting function, then it plots it like this Which looks a little bit crappier than when I use the fish right function and the fish right function Just does a nice kind of gray shading for the confidence interval And then now it starts looking like a real plot right and you might ask Why are 95 percent of the dots not within the confidence interval and This is of course because our model only explains 48 percent right We don't have a model which perfectly explains the relationship between temperature and ozone It just catches only the temperature effect on the ozone so we saw from the summary of the model that this only explains like half of the variance right so half of the points or 95 percent of half of the points are within this confidence interval I hope that's clear But if you have any questions, just let me know I can show you in our how to do this as well But this is the way that you can draw confidence intervals around your regression line in our All right So a little bit of a word about residuals has so residuals are this error term in the model has so residuals are the Variants which is left after fitting your predictors after fitting all the individual things that you think might influence your Your your thing that you want to predict right so residuals and because I always think about linear regression as just taking a Finer type or measurement which has a certain amount of variance and this variance gets distributed to some Variables where we say that this variable might have an influence on the thing that that we're interested in right So residuals in a sense. They are a measurement of how well your regression line fits to the data Right, so it's a measure of the goodness of fit So by regression what we want to do is we want to minimize the sum of square of the residual And why the sum of square well that is because residuals can be positive and negative But we we don't want them. We don't want to residuals to cancel each other out So we take the sum of squares to make sure that have because when you square a negative number it becomes positive So head there's no difference between negative residuals and positive residuals when we when we square them So for each of the the points we calculate how far it is towards the best fitting line And then we square that distance and we add it up across the whole line And when we do this then we have a maximum likelihood model Right, so if we minimize the sum of squares then we have done a maximum likelihood regression model And there's other types of models as well where you don't minimize the sum of squares But where you minimize some other parameters of your function But in in 99% of the cases when you're doing basic linear regression using one or two predictor variables Then you want to minimize the sum of square because you want to minimize the residual error So you want to make sure that as much variance is explained by the predictors And then everything which is not explained by the predictors goes into this error term and the values of these error terms are called the residuals So how do we visualize the residuals? Well residuals are Easiestly visualized on clean data. So we want to get rid of all of the NA values, right? So if we have any NA values, we want to kind of remove them So that's the first thing that I'm doing. So if any of the ozone measurements So the thing that I want to predict are NA I'm going to say well if they are not NA using the question mark. So if they are not NA Then which are those? So these are the ones which are having a real value and not a missing value And I'm only going to take that so I'm just going to subset my air quality data set Using this statement saying well take all of the values, which are not NA So you always read from the inside out. I think that's clear by now We have done a couple of these lectures and this is just making sure that there are no NA's in the ozone column And I call this air quality clean So the first thing that I do is I do a linear model, right? Because otherwise I don't have any residuals So I do my model of ozone temperature using the clean data So the ones without DNA's then I'm just going to say predict Right, so I'm just going to say predict the model and when I call the predict function without giving it a range It will use the range or it will use the points that I had so it will use the temperatures That that were measured to do the prediction on so here we do the here We do the prediction also for values where we did not have a temperature measured But in this case hey, we we don't care. We just want for each Estimate of for each temperature. We have a measurement measured ozone value and we now for each temperature Also get a predicted value So what am I going to do? Well? I'm going to plot the air quality dot clean the temperature versus the ozone just like we did before I'm of course going to add the regression line to the model and then what am I going to say? Well, I'm going to now go through each of the rows of the air quality clean And I'm going to draw the residual for each of these points So how am I going to do that? Well? I'm just going to use the line function So of course the line function is at position x, which is the temperature So the start of the line is at the temperature Where is where we've done the prediction and the the the end of the line is also at this temperature, right? Because it's a straight line up the the x position of start is similar to the x position of the end And then the y position goes of course from the ozone concentration So the the real measured measured value to the predicted value at i Using a nice blue color. So how does this look? Well here we see the plot right so we have the points which are plotted first Then we draw the regression line and then we use the had the blue lines that we add are the lines to the To the points and of course the length of the line is the residual value All right, so this is how you visualize more or less the residuals And of course you you don't have to visualize the residuals like that But you can also just plot them Without having them go to the to the regression line But this is the way that you can visualize your residuals and of course this helps you to kind of understand how much variance There is still in your data after doing your prediction, right because had the the smaller these lines the more Tightly these dots are near the regression line the better your model fits So the best model is a model where there's actually almost no difference between your best prediction or between your best fit line And your measurement data points All right, so this is more or less the things that you do when you do single linear regression Right because we only have a single predictor variable and we have a single response variable the osil However, we have of course in our data set if we look at the air quality data set that we're looking at We have many many more factors that might influence the amount of osil Right, it's not just the temperature which affects the ozone It might also be that the the wind affects the ozone concentration or the solar radiation at this point Um, yes So if we look at the head of the air quality data set And we see indeed that there are many things which might have an influence the month at which you are measuring Those on concentration might also have an influence. There might be some kind of seasonal effect Right, so we can now start to study that using multiple linear regression So instead of using one predictor like temperature, we're going to use multiple predictors So how does this look in a mathematical model, right? So again, here we have the y variable, right? Because now we're drawing not a single straight line But we're having a straight line which is influenced by two Factors, right? So it's kind of a not a 2d plot, but it's more or less a 3d plot But what do we have? So we have our our ozone concentration Is predicted by an intercept Plus our first predictor variable, for example temperature and a second one, which might be something like wind And of course, we're now estimating alpha beta 1 and beta 2 So this is what the regression model will will estimate for us So if we of course have more than two variables, we can write this down in a generalized form So that means that we have y so the ozone is the intercept So the mean of the ozone concentration Plus one to n and n here is actually k in in the couple slides back This is defined as k So n here is not the number of measurements of ozone that we have But it's actually the number of unknown variables that we want to estimate And each of the estimates are of course following the same structure So it's a beta which is estimated by the regression and it's an x y Which is the measurement of temperature wind solar radiation or whatever we want to use as the predictor So when we model ozone as a function of temperature and wind We get the following model if we would write it down in a scientific paper Then we would say well the way that we model ozone is saying that a plus beta 1 times the temperature beta 2 times wind In r Not much changes, right? So multiple linear regression in r is very similar similar to single linear regression The thing that we do is we just use the plus symbol to add another explanatory variable So again, we do linear model Ozone is predicted by the temperature plus the wind We use the data set air quality and of course I first want to look at the summary Right. So what we see now is that everything starts changing, right? One of the things that we see is that this model explains more variance, right? The previous model explained 0.46 is like 46 of the variance. This model already explains 57 percent of the variance in the ocean concentration Again, we get estimates the intercept is now at zero degrees Centigrade with no wind So hey, if if the if the temperature is zero and the wind is zero Then the estimate of the intercept is minus 71 The temperature now has a different estimate, right? The the influence of temperature has changed From what was it before? before it was 2.4 And now it is estimated to be 1.84 The wind has a negative effect on ozone So the more wind there is so for every one unit of wind increase the ozone decreases by around three points While for every temperature increase by one Fahrenheit There will be 1.8 points of increase in the ozone All of these factors we can see here are are highly significant. So the multiple r square Question is Of course, is this a better model? Well, one of the things that we can see is that the residuals actually became more negative Right. So the previous residuals were like minus half But now we are at minus 2.8, right? So we have to be very careful that we don't stray too much head that the errors are not like completely negative or completely positive Because that means that our model is not valid And the median is is still not so far from zero But it's getting further away from zero meaning that there is something in our model which is not entirely correct But one of the things that we can see is that the the normal more or less So the the gaussian distribution has been a little bit is now more symmetrical Right because the the quantile one is at minus 13. The quantile three is at 11 Had there used to be six points of difference between the two quantiles and that's only two points now So had the the the whole thing shifted a little bit more to the negative side, but at least it became more Symmetrical which is of course a requirement for a normal distribution But but this is kind of how you can read through this summary But of course we're interested in the estimates because in the end if we publish We're going to say something like using this model. We show that the temperature increases ozone concentration by 1.8 points for every Fahrenheit increase while the wind has a negative influence And it reduces the ozone concentration by three points for every increase in one point of of wind speed And so like I said compared to our first model which explains 48 percent This model explains 56 percent of the of the observed variants Observed that the estimate for temperature changed slightly before it was 2.43 now it's 1.84 Is this still within the original estimate right because we also calculated the critical values and the And the estimate so you now see that it actually By including wind The estimate of temperature changed quite a lot because originally we said that the lower bound of our temperature was 1.97 Right, so it's already but that is of course what happens when you do multiple linear regression when you do multiple linear regression the The the effect or the direct effect on of temperature on ozone is of course changed because we're now fitting a second second Explanatory variable so however fitting ozone was the second variable So that means that the the estimate starts changing But that's not bad Because we still need to do a model selection to see which one of these two models is most most fair or most accurate Of course, we can again plot the observed versus the estimated values even when we use two Variables right so again what we do is we use the predict function We use our model so linear model of temperature and wind again And now we have to provide a list of values because we have we we have to provide At which temperatures and at which wind speeds we want to have the ozone predicted so here we see that the observed values are in the round circles and the and and the um Triangles so the red triangles are the predicted values, right? So we can do first plot the real measurement values and then we add to the plot using the points function in red Using the triangle we now plot the ozone dot s so the estimated values And of course we can add a legend on the top left saying that well We have observed and predicted values and of course have we have to specify The the some of the parameters But you can see that hand like the the predicted values Already start becoming relatively close to the observed values so The model seems to be getting better We also might have the idea that there might be an interaction Right there might be that at higher wind speeds the temperature effect is less While at lower wind speeds the temperature effect might be bigger, right? So in r we can use this double point for interactions However, mathematically we would write it down as so we have a beta a direct effect of Temperature on the ozone concentration We have a direct effect of the wind on the ozone concentration and mathematically in a paper You would write now we have an interaction effect of temperature by wind And we have to estimate a new beta coefficient for this interaction But in r we use the double point So how to write this down in r? Well, this is the model that we would build if we expect that there is a An interaction between the wind and the temperature And then now what what happens is we say well give me a linear model where I say that the ozone temperature Is dependent on the temperature the wind and the interaction between these two factors Remember that you cannot fit an interaction without fitting in the main effects You cannot just say Ozone is determined by temperature double point wind It allows you to do that, but this is not a valid model I don't even know if it allows you to do that by the way I've never tried that but if you fit an interaction you should also fit the main effects So the direct effects of these two things which are participating in your interaction So how does it now look? Well, we look at the summary again We see that this model explains again more right 0.6261 so it explains 62 We get some new estimates again the temperature changed and the wind is now a much much stronger factor And we see that there is a slightly negative interaction. So for increasing wind speeds The temperature will be The effect of temperature is reduced by 0.22 ozone points Right so the interaction of the two things has a slightly negative effect that means that the the stronger the wind The more or the less Ozone points should be attributed to the to the temperature Of course all of these are not of course, but we can also see here from this part of the summary We can see indeed that when we use the t statistic We can see that all of these things are highly significantly influencing the ozone concentration So adding the interaction means that compared to our first model 48 percent the second model 56 percent This model again explains more 62 percent And now of course had both the temperature Both the estimates for temperature and wind changed again It used to be 1.84 points per increase Fahrenheit But now it seems that every Fahrenheit that we increase the ozone increases by four points So interactions relatively easy to add to your linear model just use the double point instead of the multiplication in r Of course we can also Have the idea that if we look at the the curve, right? So let's look at the curve of when we just plot the values right here We might say that well, um the temperature right the the higher the temperature There seems to be that there's kind of a curvature in there Right, so it if if if the temperature is relatively low from like 60 to 80 It seems to be that it's much more flat and then all of a sudden above 80 The temperature seems to be rising or the the the ozone seems to be rising In a in a kind of different shape, right? So we might want to say well, there's this Risk fund doesn't adding variables always raise the r square. Yes Well, it might keep the r square the same, right? If there's no effect of your variable But yeah, of course adding variables always explains more variants because it's just you have a new thing to put variants on So you can always put some of the variants on there Yeah, but that's that we we will get to that We will get to the fact that we need to compensate for adding variables And the fact that the r square changes, but then we're talking about model selection and we will we will get to model selection First I want to make one more model right because if you very closely look at the data It seems that there is kind of a curvature in there And of course using multiple linear regression We can are using regression even basic linear regression. We can also fit that So what do we want to do? We want to do quadratic regression right because when I look at the plot then I see well when the temperature increases It seems that there's kind of this Well quadratic slope right So a change in one of the independent variables always corresponds to a change in a response variable Because a lot of people think that quadratic regression is not linear regression right because In linear regression people always think about a straight line But that doesn't that that's not what linearity means in in the in the scope of linear models In the scope of linear models linear is defined as a change in one of the variables Always yields a corresponding change in the response variable, right? So the model here is My line so my x or so my y position of the line is dependent on the intercept Plus the regression coefficient times the x values that we use to do the prediction So for example had the ozone is the intercept plus b1 times the temperature However, of course for x I can also substitute in x square. This is still a linear function I could even go further and I could say well no I can even make something like I take the e Sort of natural natural base logarithm to the power of b1 times x Right because all of these things are still linear right They all still follow the very same structures that we had before but instead of using x directly We can use x square or we can use e to the power of x and of course then the b1 Estimate goes into the goes into the exponent as well Right. So quadratic regression is something that we can also do with r. It's still basic linear regression Have we're not doing like linear mixed models or other very crazy things But hey, remember that linear regression does not have to result in a straight line We can also have a quadratic line. No problem It's still linear regression because linear regression means that if I have a change in one I have a change in the other thing So increasing wind or increasing the temperature will cause an increase in ozone But of course this increase does not have to be equal across the whole curve, right? I could have like small increases at the beginning and then very big increases at the end Which fits very well to what we what we observe So again in a mathematical model This is how you would write it down if you write a mathematics paper However, if you would write a more or less a biology paper And then you would say something like well, we have ozone which is determined by the intercept plus the beta one Times the temperature plus the beta two times the temperature to the power of two So in r we need to use this the Axon IQ I would say I'm not too good in how the axons are called But we need to use this roof thing which is on my keyboard shift six for quadratic regression But in r you have to remember that when you want to do quadratic or well Not just quadratic, but also when you want to use like regression to the power of three or to the power of four You have to use the i the identity function to tell r not to just Do the quadratic thing right because hey, it's not just Doing the temperature to the power of two, but you want to fit the model Um, it based on the square because actually you have to use the i function So how do I do a quadratic regression in r? So well imagine that I have the idea that I will Model my ozone and I say that ozone is dependent on the temperature And on temperature to the power of two then hey, you see that I have to just do i across Around temperature to the power of two and I just fit it in Right, so now I get a model this model explains 54 of the variants and you can see here the estimates So hey, there's a massively negative effect of temperature now and that is because we have this quadratic term which of course because we start at 60 Fahrenheit and this quadratic term already has a big effect because 60 to the power of two is much higher Yes, so but now here we now say that for every increase in temperature We get a decrease in Um, the the temperature estimate But of course there is an increase from temperature to the power of two which compensates for kind of the decreasing line Right because if we look at the if we looked at the dots and at the plot Then hey, of course, we know that it always goes up But hey, of course the temperature here is estimated negatively And then the i or the temperature square is estimated positively and this of course cancels each other out So and the the positive part of this to the power of two will be much bigger than the negative part of the temperature So this model explains more than the model using only temperature, right? And the quadratic term is very very significant, right? So if we we write this model down and then we would write it down as y So the ozone is 305.5 minus 6.9 times the temperature plus 0.078 times temperature to the power of two Of course, we can plot the regression coefficients But now we can't just use the standard like line function, right? We now have to use in our the curve function So how do we do that? Well, hey, we can plot the points again So we say plot the ozone is dependent on the temperature using data air quality And then we add the curve just by specifying the curve parameters So we say do a curve 305.5 minus 9.6 times x plus 0.078 times x to the power of two Add it to the plot And the nice thing is is that the curve function will actually look into the plot that's currently open And we'll take the x range from the plot that we are using So we don't have to specify our x range ourselves And then we see that this is a line which actually fits the data really really well So might be the temperature is not directly influencing it, but the temperature has a direct influence But that this temperature square is actually a better model for our situation All right, so here's the latin slide for you guys So we made a lot of models, right? And like riskven already said and is it is completely right the more Variables we add to the model the better the r-square, right? Because we have just more things to put the variance on so which one of these models is true For that we use Occam's razor. So I'm going to read this for you guys and you can enjoy my my crappy latin So first of it per plura quatt protest fieri per pausior pao pao chi ora pao chi ora Um, so I'm glad that our italian student is not watching the lecture because he would just laugh his ass off for me like just butchering that like a madman But what it kind of means what Occam's razor kind of is in a single sentence is saying that it is futile to do with more things That which can be done with fewer But the r-square smaller than before no because here we are only adding the temperature, right? So we have temperature plus temperature square. Um, so here we have two variables While the previous model that we had right contains, uh, three so temperature wind and the interaction between the two so the We have to compare this right because we have the the original model was ozone by temperature Which is 48 percent And now here we have ozone by temperature temperature plus temperature square and now it explains 55 percent so adding this Square term adds like six percent variance explained from 48 percent to Because you you can't compare the model for temperature plus wind Plus something else to this one, right? So you can you can compare this model to the ozone temperature model directly But you can also compare them to the other ones But but the idea behind it is is that if you do this, hey, you would want to prefer the simplest explanation Which is consistent with the data at a given time, right? Why add more variables if it does not really improve your model? That's kind of the idea behind it So you always want to find kind of a minimal model with a minimal set of variables to do your explanatory variables So of course we can look at the r-square we can count the number of things But there is just a very formal test for it. There's actually three different formal tests And I will just introduce you to one of them. So we have the akai Akai information criterion also called the aic We have the bic, which is the bayesian information criterion and we have the log lick, which is the log likelihood so and Model selection is the task of selecting a statistical model from a set of candidate models, right? So we've looked at like a Four or five different models already and now what the question is which ones of these models Are we going to use in a publication? Of course, we can use a formal approach and we can just say well, I'm using the aic To decide which model I want to prefer So it is a relative comparison of models because we can never or well never is a big word, but generally you cannot do a An absolute comparison of models, right? You always Compare model one with model two or model three. You can't compare a model to nothing, right? So you have to do a relative comparison Remember that the aic is not a statistical test. It is just a guideline So hey the the thing which aic does it rewards the goodness of fit, right? So the more you minimize the residuals the better And it uses the likelihood the log likelihood function to do that But it also includes a penalty for each Parameter that you put in your model so for every new variable that you add you get a penalty while for The kind of r-square so the higher the r-square the better the model gets rated But if you have an r-square of 55 with six parameters Then it gets penalized for six parameters But if you have the same r-square but with three parameters then it will prefer the model with three parameters Over the one with six right because the the the a or the log likelihood is is more or less based on the residuals So if you have the same more or less residual variants Then of course you're going to prefer the model which has the least amount of factors or the least amount of Predictor variables in there So how do you calculate it? Well the aic is two times k So k is the number of estimated parameters in the model and then minus two ln l That's kind of the definition right so l is the maximum value of the likelihood function for the model Which is the goodness of fit right? So the idea here is that we prefer the model the preferred model is the one which has the minimum aic value Right because two times k gives you a kind of the upper bound and then you subtract the goodness of fit from it So the lower the aic the better a model is And the general rule of thumb is that it needs to drop at least 10 aic points But again, this is just a rule of thumb You could prefer a model which only drops five aic just because you like it better Yeah, but if if your aic increases Then the model with the lowest aic is always to be preferred and in r We can just use the aic function for this So it's a built-in function in r and since r is very focused on statistics. It's just built in So as an example, we use the air quality data set again And now we just do the different models that we looked at so we have the model one Which is just that the temperature is explaining the ozone right? So that the ozone is only affected by temperature We have the second model where we say that the ozone is predicted by or and is caused by the temperature The wind and the interaction And we have our third model where we say that the ozone is predicted by the temperature Plus the temperature to the power of two If we then do the aic right then the aic will tell us We can just call the aic give it the three models and then it will compare these models to each other Right, so it will tell you the number of degrees of freedom that is taken up by the model So the first model takes three degrees of freedom the second model five the third model four I don't know why it's Oh Never mind It's a one-off error So but the important thing here is is that the aic values. So the second model According to the aic criterion is the best model The third model is the the second best and then the temperature model is the worst model Right, so we see that going from temperature Adding temperature to the power of two we drop 10 points, which is good So we prefer this model But if we compare this model with the model where we have the interaction and temperature and wind Then we would prefer the model where we say Temperature plus wind plus the interaction between the two because that is the model that fits the best All right Good So this was it for today. This is what I wanted to tell you about basic linear regression So we discussed the basic linear regression multiple linear regression by including multiple factors I showed you how to do quadratic regression In conclusion, you have to remember that all models are wrong None of them are true or have any value All models are wrong. It's just that some of the models are useful And this is generally attributed to george box Um, this is a nice xkcd. It's uh, the hobby is extrapolating, right? So hey yesterday she was not married today. She is married So if you then draw a best fitting curve, um, then a month later, you will have like four dozen husbands But of course any all models are wrong not not one model is correct or true There are useful models and there are models which are not useful All right, so that was it for today like it's a short lecture. Um, but I hope you guys learned something And um, if there are any questions, then I'm more than welcome to answer. Um Could have made this a three hour lecture But if you had more questions in the meantime or ask for more examples But I wanted to keep it short, right? It's a lot of information, right? So I had just to very quickly summarize I taught you how to do linear regression in r using the lm function I taught you how to do, um, calculate your own confidence interval calculate the kind of The 95 confidence interval of the beta parameter how you can plot these How you can do multiple models using different factors And then as well as using like quadratic because linear regression doesn't just mean straight line It just means that there is a linear relationship. So an increase in one causes an increase or a decrease in the other Um, and of course that also works with an x to the power of two And then, um, I kind of always use the aic when I want to do model selection Remember that there is also a bic function But I've not run into a situation where there's a big difference in aic and bic So for me, that was that was it for today. Um, not a very long lecture. Um, like I'm gonna be home very soon moderator Do you have a few example exam questions for us? Perhaps? Yes, I do and they will be on the last lecture Because um, I have like the it's already in the last slideshow. So there's uh, there's a bunch of them Or do you want them now? I can I can put them online as well Like I don't care, but I generally save them for kind of the last overview lecture. Um, so that uh But if you want I can I can put them on moodle and you can look at them there like it's just like Very three or four basic questions, which I generally ask or variations of those questions But you already saw a couple right in the one of the first, um In one of the first lectures we have this What is the type off? Right, so we had that in like lecture number one where we discussed the type system And there will always be a question like that where I try to trick you guys to kind of Name the wrong type So I will show you something which looks very much like a logical value But then it turns out it's a character or the other way around Um, but um, I would be interested now. Yeah, well, we we can do them now then we can do them later again Uh, let me look up the last last last lecture. Um, so pptx You guys are way way way too worried about the exam at like It's in my best interest to have you all guys pass right because if if if you pass then I have to Don't have to do a re-exam which saves me doing work Um, but let's go to the art course Why do I need a c can I only look at the r square? No, and that has to do with Occam's razor, right? So because we prefer models Which are simple So we need to penalize for the number of elements that we put in because everything that we put in our model catches some variation But that doesn't mean that it should be in the model Even if it's significantly influencing your thing of interest It might still be that including it into the model doesn't provide you enough of an improvement to add it to the model Right because all of these models, um, if we look we see that they have different degrees of freedom which they use Right so the the first model is the simplest model the the third model So lm3 is is kind of the model which is after that the simplest But the most complex model is lm2 And then of course in this case lm2 wins because hey, it's just the best model But it has also the most terms so it doesn't need a penalty for that But in this case the most complex model is actually the best model Um habit like for these two models, right? You can see that the aic does drop by like 11 points Habit the model temperature and the model temperature plus temperature to the power of two Adding this term is not The it doesn't provide you with that much improvement. The improvement is significant enough to still prefer this model But it might not be the case in in all of them and that is why you need the aic So the aic penalizes for the number of terms that you put in your model, right? Because otherwise I could just say well, I can do Temperature plus temperature to the power of two plus temperature to the power of three Plus four plus five plus six plus seven right and then at a certain point Um The curve can just go through any point right and then in the end you will fit a hundred percent model But that is not a model which you would prefer although It is fitting exactly to the observations that you have But then this is why we need the aic function to do a model selection Because just looking at the r square only tells you part of the story. It doesn't tell you how complex your model is Right, and that's why a model like the model for Gravity or einsteins model are such beautiful models because they only have very few terms in there Right. So of course you could describe gravity using like hundreds and hundreds and hundreds of terms Like you could include the moon and all of the stars and all of the planets But of course, this is not a very useful model At the the model that is most useful is the the smallest model which has the best explanatory power All right. Uh, let me see overview lecture. Um, no wrong one Overview lecture, so Uh-huh, there it is and then some example questions. Good luck on the exam. Oh Wow, that is interesting. There are no example questions of the exam in What where did I leave them? Where where where did I leave them? I might have moved them one of two lectures earlier. Ha interesting Do I not have example questions? Check your desk. Yeah, there's there's all kinds of example. There's a whole exam there No, that's not it Oh, it might be behind one of the your own choice lectures because they generally come after the overview lecture Let me let me look through my hard drive. See if I have them somewhere. I might have not put them on my one drive Um So i'm just gonna sit here and enjoy My all the yeah, I know your all the exam is there. Yeah, but like I might want to recycle some of those questions I like I could just show them the entire exam and then like everyone's going to pass But then i'm going to stop the recording first because then I can't like that would be then a really nice reason to watch twitch so that you get all of the exam questions for free. Um, but uh Let me look at the older years um Because on my one drive, I haven't put all of the new ones yet. So i'm just uh No, that's the wrong one. So I am looking at lectures. I want to have the r-course and then 2019 exam No, china. No, I want to then have this one lectures Overview lecture Interesting. Yeah, here they are. Here they are. I knew that I had them. Um, so put this thing into reader mode And then PowerPoint, I want to switch you to PowerPoint Lecture number 12 overview. No, that's not the wrong. That's the wrong one. Um power point lecture 13 overview Here All right, so here we have um some example questions for the exam So of course like I already told you there will always be one of these. What is the type of right? So what is the type of a b c with the double quotes around it? This is of course a Throw it in chat people First one. What is this? What is the type of this one? numeric logical Character vector matrix function. All right, a b c Character. All right, plus one toco for all Very good character. Yeah, it's a character. Good second one Yay All right. So the second one is number That's Wrong. It's called numeric numeric Last one false false false false. What's the what's the type of false? logical very very good It's that easy. It's that easy, but I'm going to do my best to trick you guys So there will be like things which looks like characters, which are actually not and and the other way around Um, so had that there will always be a question like this because there it's just an easy question to make and um It's easy to check as well for me. So um and the type system is really hard in our right So hey, you guys having a good grasp of the type system is important for me Um, so question number two is something that generally also occurs It's like write a for loop that prints using the cut function the numbers one to a hundred to the screen Make sure that you also print a new line after each number So of course if you would do something like this then hey, it would be something like So let me get a new one. So then you would say it's like four X in one to a hundred Then we have to use these quotes then I have to say cut and then I have to cut x comma and then slash new line So that would be the answer to that question right and then of course you have to say um q2 question two Or well, you will write it on paper right because you don't have a notepad window to type But this would be the thing that you would have to write down on the piece of paper so Very common example question. Sometimes we go from one to a hundred. Sometimes we go from a hundred to a one Sometimes I want you guys to use the sec function. And so that prints the numbers two to a thousand stepping by seven heads, so then you have to use the sec function um, of course the sec function can be Combined with the ref the reverse function if you want to go from a thousand to two Going down by seven every time So hey, or you can just use minus seven as the bi parameter Another question here is like when creating an r package How should the folders holding the c or c++ source code be named? so That that's some question that well, we already had this lecture so you could be able to answer this Do you give points for partially correct answers? Yes, so generally when I have the uh, the first one right where you say True false and and logical numeric things then I generally have like five and then um, You can score one out of five two out of five three out of five four out of five or five out of five points So every question is worth one point and if a question is an a and a b question Then that means that question a is half a point and question b is also half a point um The only time when you don't get any points Is when you fail to adhere to the question So I often have questions like Write down three reasons to use r And during the years that I've been giving this course people still Write down four reasons and then it is completely wrong Because you did not read the question. So if I say write down three reasons You can write down one reason for one out of three points You can write down two reasons if both are correct. You get two out of three point You can write down three reasons But you cannot write down four Because four reasons when I'm asking for three means that you're just not understanding basic mathematics And I'm not going to choose right because three out of four might be correct But then it's up to me to choose which answers. No, so if you so if there's a question which explicitly ask for an n number of answers Right, so write down three reasons why you should use r Then write down at maximum three Four means it's completely wrong if you write down two you maximum can earn two out of three point Just so that we're clear on that that I don't want any any anything behind Fair enough would you give partial points for partially correct codes like in number two of the example question? Yes Yes Yeah, so there's the then of course it's up to how wrong I think it is How many questions do we get in total 42 the answer to life universe and everything always Well, actually there's 41 questions plus a birthday question like I told you guys already question number 42 will be a drawing question So you will be drawing something having its birthday or something on fire or I don't know. It's always related to some event that happened on the day. So one of them is Last year the exam for the bioinformatics course was on the day the hindenburg crashed So the example or the the last question was um draw a zeppelin and then people draw zeppelin and then of course like head How beautiful your drawing is is up to my discretion and my decision and the most beautiful zeppelin gets One question from the exam more or less one wrong question erased So I generally try to use this question for people who are very good at drawing But just happen to have like 20 questions instead of 21 Right because you need half the half of the question is correct plus one So there are 41 real questions meaning that 21 questions Or 21 right answers is enough to pass the exam Any more questions about the exam or not about the exam? The exam is 90 minutes. Yes yeah Unless you have dyslexia or some other like Illness that qualifies you for more time And I'm I'm not someone that that will like force you to put down your Like if you're still drawing your nice zeppelin or your firefox or whatever I asked in the past A platypus, um, then if you're still drawing then that's fine. You can finish your drawing like I'm not I'm not like a A time nazi that says well, it's 90 minutes like pens down now But yeah, you have 90 minutes to do the exam and then there's like a 10 minute window to photograph it and send it to me Um, yeah, because you're writing on paper then you take a photograph of the paper Then you send that to me by email and the pieces of paper that you wrote on They go in an envelope and you send it to me because I need to have the original answers Um from you guys and then hey, I can already start grading the exams that you photographed And then once I get the um the hard copy so to speak, um, then I can Put the grade on it for sure because then I I know that the photo that you sent me is really the exam that you also sent me more questions remarks like I think you're you're way way way too scared for the exam Florian tell them how easy the exam is like even you passed and you still struggle with our Or did you find the exam really hard like I I don't know if I ever ask you this the exam that hard I don't know right. It's just like It's just testing if you were attending the lectures. Yes, Florian tell us. Yeah, is he still here? Florian Or are you still trying to order stuff via the university's new system? But don't don't worry too much about the exam the exam is relatively easy and uh It's it's perfectly doable In the end the the assignments are the thing which matter, right? Um, and Because the assignments teach you or give you kind of tasks to program And head that is the that is the main thing like you only become a programmer by investing Five to ten thousand hours in learning on how to program Um head just listening to me talking about linear models and how I do them in our Will not really really help you guys to um to to do that. Um, so Um, oh, yeah, good luck on the exam. Well, we're not there yet. Um, but But yeah, that that's that's kind of it for today And uh, yeah, you can spam Florian also on Moodle. Um if you Really really want to want him to tell you some of the questions But I will have to do new questions anyway Because we did it online last year. Um, so that means that a lot of people have A digital copy of the previous exams. So And I know students so I know that people from this year talk to people who followed it last year and Try to get the exam questions and head try to prepare well enough But I'm going to do my best to give you guys a completely fresh polished new exam with New questions and of course, um Slides change also from year to year not not too much, but hey, it's it's it's my job to Not make it too easy on you guys, but it's also not my job to Make it impossible for you guys to To pass the exam So hey in the end like I don't gain anything by flunking all of you or giving you all like force Um, that that also is not fun for me Any more questions No questions Like I know there's just a little delay. So just sit here a little bit and have you guys like think about questions Thank you and goodbye. Uh, thank you Lidia Thank you Lidia for attending like um the questions and Lectures only become fun when they're students that So thank you for attending and yeah, enjoy the rest of the afternoon All right, I will stop the recording then so um people on Moodle and people on youtube see you next week Like subscribe favorite hit the bell icon and like all that other crap. I don't want to do this Anyway, see you guys next week