 everyone even if you're watching this on YouTube now we're in five years welcome to the lecture we're gonna talk about regression analysis so the way that I planned it is that first talking about regression in general and then showing you guys single linear regression things like how to compute confidence intervals and make some plots I wanted to say a couple of words about multiple linear regression and quadratic regression and then have words have some slides about model selection because that's generally the hard part putting up the regression model is easy but then defining what is a good regression model and what's a bad one is is definitely more more difficult so what I wanted to stress is that you can ask any question that you want I've been doing linear modeling now for like almost 12 years I'm not an expert but I am kind of right that's just point the more you know about a subject the more you know that you don't know right so but I I've got a lot lot of experience about linear regression so in our you do it using the LM function and the ANOVA so today we are only going to talk about this little red part so general a general linear models we're not going to talk about how to deal with repeated measurements or mixed models where you have random effects and fix effects I don't want to talk about generalized linear models where you have a response which is not a normal distribution like logistic regression when you're trying to define like a zero one output so kind of what you have in a case control study right so someone either gets better or dies right so if that is your response surviving or not surviving then you cannot use general linear models so generally in your models are four variables which follow a Gaussian slash normal distribution so there's many different terms so people some people some people call it linear regression other people say analysis of variance other people call it analysis of gov variance or multiple linear regression all of these things are the same thing in my mind and I'm trying to I hope that in this hour I can convince you guys that that's the case right so regression analysis is actually very basic it is a statistical process for estimating the relationship among variables right so variables can be human height called stature and for example your food intake but so anything that you can measure if you're measuring multiple variables you can more or less relate to each other and regression analysis that the goal of it is to kind of find a model which estimates this relationship and which can be used right so which you can exploit in the future and so if we know that people who are bigger right have more stature are more likely to give money than when I'm a hobo and I'm begging for money on the street I will not ask people who are small but I would only ask people who are big right that's just the way that it works if I know that people who look very rich give me more money then I'm going to try and exploit that right so the regression analysis is a statistical process to estimate the relationship between two or more variables and there are many many techniques for ever for analyzing several variable models right regression is not the only technique you could use correlation and all of these things or covariance or whatever you want but in regression the focus is on the relationship between a dependent variable and one or more independent variables so the dependent variable is the thing that you are trying to model so in our case when we look at the Berlin fat mouse the thing that we are trying to model is the fatness of the mouse so the amount of fat that it has the body weight that it has so these are our dependent variable so it's the output or the effect the thing that we are trying to estimate the independent variables are variables which are the input or the cause for example the food intake of the mouse the amount of exercise that it gets and all of these things that we think might be causing the dependent variable causing or affecting the dependent variable right so if we if we would write this down into a mathematical sense then hey you would write down that why here is the dependent variable because we are want to predict why and why is then given by some function of x and x is then the thing that we are measuring for example the food intake so is there a relationship of function that we can define which couples the food intake to the obese phenotype of a mouse right it could be that we have a more complex function have where we have is that being our dependent variable so the thing that we want to predict and now of course that we can have the food intake plus the amount of exercise is there a function when given a certain food intake given a certain amount of exercise can predict the dependent variable so the body weight of the mouse so in a regression model this is the kind of standard regression model that everyone shows you so why here is our dependent variable the thing that we want to model the thing that we want to predict is around equal to some kind of a function of x which are the independent variables and then we have unknown parameters right so the betas are the estimates of the effect of these individual measurements and this x is actually a matrix because it can be a single variable but it could also be a hundred variables right so when you think about this in your mind and then regression is more or less a matrix where you have a vector of measurements right imagine we have a hundred mice of which we measured the body weight and then we might have a matrix which is two columns the first column is the food intake the second column is the exercise and then of course for every one of these hundred mouse we have the measurement for this right so in the end we have a vector which is our y our dependent variable then we have independent variables which is a matrix of x and then we have betas because we want to estimate the effect for example that the food intake has on our dependent variable which can be positive and which can be negative so within this regression model people never mentioned that there's actually two constants which you don't see into the model but n is the number of independent measurement so if we have a hundred mice then n is a hundred and then k small k is the number of unknown parameters so in our case that would be two right because we have the food intake and the exercise so in regression the power of a regression model when the assumptions are holding comes from this so when the n so the number of independent observations that you have is larger than the number of observed are the number of independent variables for which you want to estimate the beta this is called the excess of information so n minus k gives us how much information we have to base our predictions on of course regression comes with a bunch of assumptions and the ones highlighted in red are the ones which are mostly violated by people when they write scientific publications right so the first assumption is that the sample that we are looking at is representative of the population for the inferred prediction right that means very basically that if I want to know something about humans and I'm doing regression models on humans to figure out something for what is generally true for the whole human population I cannot limit myself to studying only people from European descent right because then the first assumption is broke because humans as a group do not consist of only European descent people there's also people from Africa there's people from Asia and so there's a the sample that I'm taking needs to be representative and this goes wrong 99% of the time when you are looking at a paper and people are doing a association analysis for example in a hospital in Spain then they always make it seem like their results can be generalized across the whole world right so they have a drug they want to look at the effect of the drug they they have five hospitals in Spain in which they give people the drug or they don't give people the drug and then in the end they conclude well this drug works but that is wrong because they should have concluded that this drug works for people who live in Spain so first error that always goes wrong the other one is that if you look at the error term right so after we have done our model right so we have our observations that we are interested in modeling we have our observations which we think predicts the thing that we want to model and we have our beta variables after we fit this right so after we fit the model we are left with an error term which is how good the prediction fits the real-world observation and the distribution of these errors should be a random bear the distribution of this should be a normal distribution with a mean of zero no one ever checks this I almost never see papers where people show the distribution of the error term not showing the distribution of the error term means that no one can estimate how valid your model is the next one is the independent variables are measured with no error and this is impossible you cannot measure the food intake of a mouse without any error there is always error so had these ones are generally considered to be unfulfillable it is an assumption but this assumption is generally not true the predictors are linearly independent so this means that the food intake and the exercise should not show any correlation right if food intake is highly correlated with the exercise I cannot estimate the effect of food intake on the body weight independent of the exercise on the on the body weight so far linear regression when we start doing linear regression but more than one variable the assumption is always is that these two variables that we're looking at are not correlated to each other so that they are linearly independent the errors are uncorrelated we will get back to that and that that's very similar to the variance of the data studio blah blah but Thomas class so but the ones in red are the ones that if you ever get asked to review a paper because you're master students so some of you will do a PG at some point if you are analyzing a paper where people do a regression model make sure that the first or the ones highlighted in red really hold and the first one is the one that goes wrong the most because people look at their favorite population right so they do something in Germany they have five participating hospitals in Germany and then they conclude that the drug works and it's all fine but it's not true because when you are looking at hospitals in Germany in the end the population that is represented by our sample is not humans it is Germans which is a very distinct subset of the whole population we already talked about the access of information which is one of these things that you also have to check make sure that when people have measured a hundred humans that they are not estimating a hundred and one different parameters right so different observations so instead of having exercise and food intake hey you can only estimate a hundred well not really a hundred because you have to have some access information just generally what we say is that when you have a hundred mice you can fit the square root of a hundred of parameters which means that if you have a hundred mice you can kind of reliably estimate ten of these effects right so you could estimate the exercise you could estimate food intake and then like eight other ones but if you would then add another ten then that is impossible because then you run out of degrees of freedom so to speak so in our we do regression via the linear model function so for this I first want to go to our to kind of introduce to you guys the data set that this thing uses so the data set that that this one so today the data set that we use is a data air quality and this is a very commonly used data set so the air quality data set looks like this when we look at the first ten rows right this is always what I do when I just want to see a piece of the data I just say from the air quality data set show me the first ten rows and that's it so we see here that we have different things which have been measured right so we have something which is the ozone concentration the solar radiation we have the wind we have the temperature we have the month and the day right so in this case we can have what we might be interested in is ozone right we might want to kind of predict the ozone concentration in the air based on the other measurements that we have based on the solar radiation the wind or the temperature or something like that right so we first have to define what is our independent variable and what are the dependent variables so in for this data set the the independent variable the dependent variable is ozone because that's the thing that we want to predict we don't want to predict the temperature because that's just something that is caused by by the sun or by other factors we don't want to predict the wind because like head so the ozone concentration is the thing that we want to predict so that's our dependent variable all the other ones are possibly contributing to the ozone concentration so those are the independent variables right we have no influence more or less over we could influence those but the ozone concentration is the one that predict right so when we start linear modeling the first thing that we want to do is look at if there is anything there right so what we can say is we can say well we want to do a linear model and this linear model is we want to predict ozone and we might first think well the ozone concentration in the air is caused by the temperature and then I have to fill in data is right because I have to tell our where to find the ozone column and the temperature column right so I'm just going to say do a linear model of ozone based on the temperature and then I press enter and then it tells me oh this is the formula that you did right so this is the formula that you gave me and then it calculates these coefficients so these are these better coefficients right so in this case we have one thing that we want to predict so when we would look at this in a plot right so we can actually do a plot of this as well so we can say plot air quality dollar ozone versus air quality dollar temperature right so in R we can also use this linear model term right so what we see here is we see the temperature on the x-axis so the thing that we think predicts the ozone concentration and the ozone concentration is on the y-axis right so we just see the measurement points against each other and when you see this you might think yeah there's some kind of a relationship right because when the temperature goes up the air the ozone concentration also goes up so here we get two coefficients right so the first coefficient is for temperature that I want to look at which is 2.429 right that means that what R tells us that with every degree of temperature increase you get a 2.4 increase in the ozone concentration but well not the concentration but the number right so it adds 2.4 so if you go from 60 to 61 right then the ozone concentration goes from well around like 10 to 12.4 if I go from 61 to 62 we go from 12.4 to 14 point something right so we just this is the relationship then it gives you this other coefficient right so the other coefficient is called the intercept and the intercept is when the temperature would be zero where would the ozone concentration be so it says that if the temperature would be zero the ozone concentration measurement would be minus 146.995 right and that's where the the x-axis that's where the line the regression line crosses the y-axis at x equals zero and that is because we're having temperature here measured probably in Fahrenheit and not in centigrade so that's why I have 60 degrees is probably around zero degrees Celsius or something alright so let me switch back to the PowerPoint so like I showed you guys what we do is we load the data set air quality using data air quality then we put up a linear model and then here I show you the summary so the summary of the LM let's go back to R because we only use that the linear model itself right but if I would do a summary of the linear model then it would show me the exact same thing as what I put on the on the slide right and then let's go back to the slide so here again we see that there's two estimates one for the intercept and one for the temperature and then we also get a multiple R squared so the multiple R squared is a measurement of how good this model fits the data right so R squared is kind of variance explained so that means that this model where we say that the ozone is controlled by the temperature explains around 48% of the variance that we see in the ozone so we still have 52% of the model or 52% of the variance which is not explained by the temperature and there might be another factor which is causing that but this is the first model that we do and so we're just interested in like using the LM function just do a linear model and then see what the estimates are and then we get an idea of how how how good our model is so in this case our model fits around like 50% of the data we can predict 50% of the data we can't 48% we can't right so first things first what we want to do is plot of course the regression line right so if we go back to our head and we see that we have this thing and now we want to plot the regression line right so we want to get the the intercept and the temperature into our model right so we could just do it basically and say upline right which is the function to draw a line and then I would say that a equals the intercept so that is minus this number and then I would say the B so the directional coefficient of the line in this case is temperature which is this and then it would just add this line to the plot right so now we see this line and this line fits pretty well in a way right so here it's a little bit too low here the line is a little bit too high and here's a little bit too low again of course we don't want to just hard code these numbers right we want to make a line and we want to make a plot and in theory when we do a different model we want to use the same plotting routine so what you do in R then is saying that okay so hey instead of so we plot the ozone versus the temperature I can here add some point sizes and make it blue so that's a little bit more visible and then I say from the linear model right so from this linear model that we do we can just store the whole linear model results into a LM temp which is called linear model temperature so I just came up with that name but I just store it in a variable so what I can do is from this linear model that I stored I can say give me the coefficients right so dollar because it's a list so from this list I want to get the coefficients and then I want to get the intercept and I call this a and then I want to get the temperature coefficient and I call this B and then I say plot the line ASA BSB make it red do a line wide of two right so that's how I plot a regression line within one of these data sets now we want to do the confidence interval and I'm just gonna skip this the slides are available for you guys but calculating the confidence interval is calculating the margin of error right because these beta estimates are not fixed right it's not 2.4287 right that's just a that's just the best guess it is somewhere with 95% confidence within some confidence interval right so had the beta coefficient might for example be 2.38 or it might be 2.56 right we don't know but we can calculate the error right so we can calculate the confidence interval and for that we need two different things so we need the standard error which we can get from the summary and we need to get our critical value so our critical value is based on the number of degrees of freedom so this is the n minus k parameter right so we're estimating the the intercept and we're estimating the beta parameter so that means that we we lose two degrees of freedom which means that I have to do n which is the number of observations that I have minus 2 and this is then giving me the critical value I do the margin of error which is the critical value times the standard error which I got from the model so in our example it means that I'm just get the summary I get the standard error for the temperature measurements which is 0.23 I calculate my critical value which is just the number of rows in the air quality data set because those are the number of observations that I have minus 2 because I'm estimating 2 betas 1 for the temperature 1 for the intercept and then I'm using 0.975 which is actually the two-sided tests right so it's 1 minus alpha divided by 2 our standard alpha is 0.5 actually no one actually does this we always use packages for this but if you really want to in a single linear regression model you can very easily calculate your own standard error you can calculate the critical value and then calculate the margin of error so in this case if we multiply these two numbers with each other we see that the standard or the margin of error is 0.46 right that means that if I would write this down in a paper then the beta so the estimated the estimated parameter would run from the estimate minus the margin of error to the estimate plus the margin of error so the real value of the temperature coefficient is somewhere between 1.97 all the way up to 2.89 generally we don't really do this generally you can ignore this part here this is just making a plot of the confidence interval if you want to do it yourself normally you would just use a library so you would just lose visualized regression which is visrec and then you would just say visrec give it the linear model tell it the alpha give it a nice name and then give it some of the parameters to make it look pretty right so if you use the function visrec then this is how the so here you see the same thing so you see the regression line and then in grayish you see more or less the standard error here we see the result of me doing it myself but that's just the way that you can do it but I'm not really interested in explaining you exactly how to do it but you hate what you do is you predict your data then you do the plot and then you plot the predictions and stuff but generally just use the visrec library to visualize your confidence interval good so the residuals so the error right so this is one of the assumptions that you always need to check is my model valid and my model is only valid when the residuals so the error term is a normal distribution right so the residuals are the variance left over after fitting your effects so after I'm fitting my temperature effect right and they are a measurement of how well the regression line fits the data so the goodness of fit right so we aim to minimize the sum of squares of the residuals so if we do that then we do maximum likelihood models and we are just doing maximum likelihood models so what we want to do is we want to make the residuals as small as possible because the smaller the residuals the better our model right because the predicted values are very much in agreement to the observed values so how do we visualize the residuals well it's easiest on clean data so what we do is we first remove all of the measurements where the ozone quality was not measured so the NAs so we remove that so and then we do of course our linear model so what we do is ozone by temperature just like we did then we do a prediction so we just use the predict function to predict the values based on the model that we have so here we have ozone is our observed and here in LM temp clean we have our predicted values and then what I do is I just plot the air quality clean temp versus the air quality clean ozone so I just plot the data so I plot the observed data and I plot the so I plot the the temperature data and on the y-axis I plot the ozone I do the upline so to put the regression line in and then I'm going to go to each of these air quality clean rows in the matrix and what I'm going to do is just say draw a line where I'm going from the current measurement to the ozone concentration so how does this look this looks like this right so what we see here is we see all of the points which we had before right and what this little piece of code does it just draws the blue line right so the blue line is the residual it is the distance of the value that we observe towards the value that we predicted right so at this temperature of 80 we predicted the ozone to be around 60 but we have three measurements and these three measurements are not exactly this so linear regression is nothing more than the mathematical procedure in which this regression line is more or less wiggled around to make sure that these these lines that we see here are minimized so that the regression line should be as close to all of these points as possible right so those are called residuals the code is here the presentation will go online so you can just look at the code and type it in yourself and see what changes but you have to realize that here we have two models so one at first we clean our data set we do a linear model where we where we regress the ozone on the temperature and then using this model we do a prediction so for each point for each temperature observed in the model we do a prediction of the ozone which we could do ourselves as well because we know the intercept and we know the beta coefficient right but in the end the residuals so what do we want we want the residuals to be a normal distribution because if the residuals are not a normal distribution then our data doesn't fit very well right and we aim to minimize the distance of the observed points towards the predicted points so that is single linear regression that's it that's the only thing that you do so you are trying to have so you have a point of measurements of two variables against each other and what you're trying to do is draw a straight line through the data and the straight line through the data is trying to minimize the error of the of the observed points towards the predicted points right of course we have many factors that might influence the amount of ozone right we don't just have the temperature we also have the solar radiation we have the wind the month and the day right so we can also we also want to or need to add those influences into our model when we do that we start talking about multiple linear regression right because now we're not just doing a single variable and trying to predict the ozone we're doing two variables or three variables and predicting the ozone so mathematically this looks like this has so we have why which are our ozone concentration is our intercept so the alpha we have beta one which is the estimated parameter of the temperature here we have just the temperature which is x1 and then we add plus b2 which might for example be the wind in our case the model if you want to generalize it it looks like this but when we want to model ozone as a function of temperature and wind then we get a model which looks like this so ozone is the intercept plus the temperature times our beta plus our wind times another beta so how do we do this in our well in our almost nothing changes we just use the plus symbol to add another explanatory variable so here we see again we do a linear model predicting the ozone concentration by fitting the temperature plus the wind from the air quality data set then we do the summary and now we see here we get different estimates right the intercept has changed it used to be minus 190 the temperature estimate changed also a little bit it's now 1.84 instead of 2. something and we now also have a beta parameter for the wind and we see that the multiple r square went up right our model is better because the original model explained 48% but this model explains 56% right so the estimate for temperature changed slightly first it was 2.43 now it's 1.84 right so and this will happen every time so every time that you add a new variable in the model this will kind of influence the estimate of the other parameters right so again we can do the same thing we can plot the observed values versus the estimated values so here in the in the red we have the so in the the black points we see the observed values in red we see the predicted values and of course now the predicted values are not a single straight line because for every value that we see here right because I'm only plotting the temperature here but for every temperature we might now have different wind values right so instead of getting a single line estimate multiple linear regression gives you multiple estimates for each temperature right based on the wind sometimes the wind was low sometimes the wind was high and again the same thing holds we try to minimize the distance of the observed value towards the value which we predicted so we can also deal with interactions right we might think that the the more the temperature the more the influence of the wind right it might not be that the wind is independently doing the ozone concentration it might not be that the temperatures independently doing the ozone concentration there might be a relationship between the two right if we look at it it it seems that there's kind of a curve right that that at higher temperatures the ozone concentration seems higher than we expected to be right so and this is called an interaction so an interaction is nothing more than just saying well we have temperature we have wind and now we are are adding a new interaction term which is the wind multiplied with the temperature and we're just estimating a new beta parameter for this new variable right so we have the beta one for temperature beta two for wind and now we are estimating a new third interaction beta parameter for this interaction term between wind and temperature which is just a mathematical multiplication between the two so in R when you want to model interactions you use the double point so is there any interaction between wind and temperature so we put up our model so we say we have a linear model ozone is predicted by temperature plus wind plus the temperature interacting with the wind of the data of the air quality again we see that all of our parameters change which is very logical because we're putting up a new model and we see that again the multiple r squared went up so this kind of gives us an idea how we also see the probability so we do get a very significant effect for the interaction right so that this gives us the clue that yes there might be an interaction has so when the temperature goes up then the wind might have a bigger effect or smaller so adding the interaction means that our first model explained 48% the second model where we had wind and temperature explained 56% and this model explained 62% of the observed variants again the temperature changed again it was first 1.84 now it's 4.07 and this is because some of the variants will be attributed to the other factors in our model okay so this is just multiple linear regression right so we have either single linear regression putting a single predictor we can have two predictors we can have two predictors with an interaction all of this is called multiple linear regression we can actually also say well the effect of the temperature might not be a basic beta right it might not be a single number it might be that at the beginning if the temperatures below like 70 degrees Fahrenheit the wind might be the if the temperatures between seven below 70 degrees it might be that there is an increase right and this increase might be 1.2 but if we look at higher temperatures it might be that the increase is much higher right so that we instead of a single linear line we have a curve right so linearity of coefficients means that a change in one of the independent variables yields a corresponding change in the response variables right because we are dealing with a model which looks like this so y equals the intercept plus the beta times the thing that we look at but the following functions are also linear right because I can do y equals a plus b 1 and then I'm taking x square right so x square now now makes it a quadratic function right the same thing holds for when I take the exponent right if I say y equals a a plus e to the power b 1 x that is also a linear function right it's just that instead of having x I'm now having x square or I am having e to the power of beta 1 x right all of these are linear models and this is just called quadratic regression right so we're just using the x square model so how do we fit this x square model so in our case we have ozone which is the intercept plus the first beta times the temperature plus a secondary beta times temperature to the power of 2 right so in our we use the square this this square Keppi thing for the quadratic regression however we need to surround the statement by using the identity function this is just something that we need to do in R so if we want to do this quadratic regression model in R what can we say well we do the LM is the ozone is predicted by the temperature plus the temperature to the power of 2 and we need to surround this by the y right otherwise it would just multiply to the power of 2 and just put that in but so you need to use the identity function here again we get two estimates right so we get now on a negative effect of temperature so temperature seems to be with increase of temperature the temperature directly negatively affects the ozone concentration but there is a slightly positive quadratic term which means that the one will outpace the other one so it will kind of curve up so this model explains more than the model using only temperature because here we can see that the R square now is 0.554 if we only use the temperature not the quadratic term we were at 0.48 and so the quadratic term is very significant in your paper you would write it down like this and say that the ozone is 305.5 which is the intercept minus 9.6 times the temperature plus 0.78 times the temperature to the power of 2 right and again we can just plot the regression coefficients and we can just say well okay so we can say plot the ozone by the temperature using the data air quality this is just the plot that we made already and now I can use the curve function to add a line right so I'm just going to add the numbers that I got from my model and then just add it to the plot and then it seems that yes this is a model which fits a lot better than just a single straight line and here I have my quadratic curve so that's the way that you do it you just use the curve function give it the numbers that you got from the summary and then you see this really nice curve being plotted good so we already made like a lot of models right so we we now have a whole bunch of hypotheses that we tested kind of a is their temperature influencing the ozone is the wind is the temperature to the power of 2 which one of these models is really true and of course since it is statistics and none of these models are true right because the the basic rule about statistics is all models are false except some are useful right because we're just doing a model and we have no idea if this model is really true but in the end what we want to find is the model which gives us the simplest explanation which is consistent with the data and this is called Occam's razor so in latin it is called and in english this would translate it is futile to do with more things than which can be done with fewer right so scientists always prefer the simplest explanation that is consistent with the data so how do we do this more formally how do we more formally deal with this model selection well we can use something like the akaki information criterion which is a relative comparison of the models it's not a test it's not a statistical test it's just a guideline right it tells us that this model is to be preferred so it it awards the goodness of fit right so the more you minimize the residuals the better the score of the model but it penalizes for the number of parameters that you put into your model right a model which only includes temperature is of course a model which is worse than when you include temperature plus wind right the question is is does adding wind into the model produce it significantly to kind of have this extra burden of now estimating two of these batas right so the aic is is a function which kind of does this comparison for you so you just give it two linear models and for each linear model it will give you a value and this value means nothing it's just relative to the other models so the way that the aic works is you have the aic is defined as two times k right k here is the number of independent parameters that you estimate going back a couple of slides minus the two natural logarithm of l and l is the maximum value of the likelihood function for the model which is the goodness of fit right you can you can think about this as the sum of or the square sum of squares of all the regression lines right so it's just the the when you fit your line through the model head you have unexplained variance um so l here is kind of a measurement of how good your model fit and then k is the number of estimated parameters in the model which is similar to the n minus k when we talk about the excess of information right so the preferred model is the one with the minimum aic value however it needs to drop at least 10 aic points right if if there if you have two models one is minus a hundred and the other one is minus a hundred and five you would prefer the minus a hundred model because it's not two ten points different right the you only accept a model as being better if the aic drops at least 10 points so in r we can just use the aic function so how do we do that well we load our air quality data right we do for example three different models so our first model is the temperature the second model is the temperature the wind and the interaction between temperature and the wind and then our third model is the model which is temperature plus the temperature to the power of two so now what we do is we we pose these three linear models and then we ask the aic function which model is to be preferred so you see that the first model takes away three degrees of freedom the second model takes away five degrees of freedom the third model four right so this is the thing that it penalizes for and here you see the aic so the first model is the worst model so just saying that it's only the temperature which controls the ozone is not a very good model the other model right has an additional term it is temperature to the power of two so the third model is temperature to the power of two in there and it is better than the first model not that much better because you can see that the drop is around 11 points the best model in this case so the model that we would prefer is the model in the middle where we say that the ozone is determined by the temperature the wind and the interaction between temperature and wind it does give them it does take up most degrees of freedom right because we're fitting the most amount of parameters we're not just fitting temperature we're also fitting temperature wind and the interaction term but still we can see that this model drops significantly at least 10 points from the other model that we are interested in so in our case we would say that we prefer model number two and model number two is the most valid model in our case good so in conclusion all models are wrong some are useful and this is attributed to George Box again extrapolating is something that you shouldn't do from linear models linear models are only valid for the measured domain and this is one of the XKCDs which kind of explains that that hey if you yesterday were not married today you are married and then of course in a month time you would extrapolate that you have like 30 husbands which makes no sense good so that that was it so in our case of course the model validity goes from around like 60 degrees Fahrenheit to like 100 degrees Fahrenheit right so this is the domain that we did our measure or this is the domain of the temperature so only in this temperature domain is our model good you guys made it still six people at the end like I think we did a good job today so are there any questions about modeling do you want to show me some show you guys some more modeling do you are you having questions about linear models because I think regression is one of the most powerful tools that you can learn on how to build up your models and I also think it's important for people to see that linear models are not just straight lines right something to the power of two is still a linear model and still very basically fitable by the LM function in R so R is a very powerful language we talked about like how you can manipulate your data we didn't even talk about how to load in your data right we didn't discuss like the read table or the write table function so if you're interested in learning more about R during the summer I will be giving a course on how you guys can learn how to program in R the course from last year is also available on YouTube let me get you guys the link I don't know if it's still in chat but there you go so here on YouTube I have my R programming course from last year so if you're interested in learning R go to my YouTube and look at the R programming course I there's like 50 hours worth of me talking about programming and then it goes much much slower than we did today good so today we talked about some basic R we did some very advanced R linear models I think in the R lecture R also like lecture number eight or lecture number nine so they they are generally all the way in the back so I hope that you guys liked it I'm gonna stop the recording for YouTube so YouTube people's see you in a couple of days with the next lecture