 This is the second video in the series a seminar series on the basics the fundamentals of linear modeling And I'm using Python and more specifically the stats models package Which makes working with linear models are very very easy to do So if you haven't done so already You'll see a link up here. It'll also be in the description down below to the first video in this seminar series That's about straightforward linear regression with a single Predictive variable and I want you to watch that video before you watch this one because we just can use the terminology and The explanations that we had in that video. Yeah This video is all about analysis of variance You remember that there was an analysis of variance table when we talked about linear regression And I want you to understand and see clearly how these two things are how they One follows from the other how they are connected. So I'm using a visual studio code here it's a Jupiter notebook inside a visual studio code and We can see at the top there Python 3.9.10. I am using environment using a In a condo or then mini condo So there is my first cell packages used in this notebook We can use pandas once again, you'll see that I don't use any namespace abbreviations NumPy and then from sci-pi we're going to import the stats module and then also the patsy Package and that's going to help us create design matrices Next up is going to be this section here. That's all going to be all about our plotting So I'm going to import the express and the graph objects modules from plotly also the IO module so that I can set The default planning style to plotly dark because I'm using a dark theme here in visual studio code So that we can have nice dark plots and then stats models that's going to do all the heavy lifting for us We are again going to use the OLS function ordinary least squares function We also can use the ANOVA LM ANOVA underscore linear model function We're going to use pairwise underscore 2 key HSD and then the multiple tests function All from different modules inside of the stats models package So you'll remember this is the other the table that we after This table tells us about this what I would term the four fundamental linear model types And we busy with the second one here We're going to do one-way analysis of variance or ANOVA and it's all again. It's a model So we have one or more independent variables and that data type has to be For one way ANOVA. It's nominal. So we're going to have a categorical variable and That is going to be our predictor our independent variable It's one way and as much as we're only going to have one and then our dependent variable is still interval So it's still a continuous numerical data type So given certain levels of our independent variable Which we can term a treatment so different levels of the treatment or unique values in this categorical variable We're going to use those unique values to try and predict a value for continuous numerical variable And you might imagine the problems that we would have with that So let's generate some data as I said in the first video in the seminar It's good to just generate your own data because you can control how that data is generated to some extent and So you already have some understanding of the type of results that you would see and Sometimes that is more powerful when you're trying to learn how to do this then to just import a CSV file Or later in the seminar series We will just import data that already exists in a in a spreadsheet file So let's just generate some data So we're going to generate data for 30 observations as we can see as we can see this will be 30 subjects 10 in each of three groups so I'm going to have this continuous numerical dependent variable and then a nominal categorical Independent variable and that independent variable which I'm going to call a treatment. There are many synonyms for Most things in statistics. It's going to have three unique values or three levels So that's what we're going to do. First of all, I see the pseudo random number generator there with numpy dot random dot seed And I'm passing the integer 10 if you do the same you're going to get the same pseudo random values I'm going to have a computer variable called treatments and that is going to be numpy dot repeat and What I want to repeat is a Python list object And you can see it has three strings in that list object a b and c and I want to repeat that ten times So what that is going to do it's going to repeat a ten times then b ten times then c ten times so that I have a a a a Then b b b b then c c c and that is all going to exist in a numpy array Then I'm creating a new variable called depth a and then also depth b and depth c and I'm going to Generate ten random values pseudo random values for each of those and those are going to be the values for each of the subjects That has an a or a b or a c So I'm using the round function numpy dot round in the second argument There's one because I just want one decimal value But the value that I'm after is the stats dot norm that RSV function now what the RVS I should say what the RVS function does is it takes Values from a described distribution in this instance. It's a normal distribution So it's stats dot norm dot RVS And because it's a normal distribution that depends on two parameters a mean and a standard deviation So I have to pass those so lock LOC. That's the mean. So that's a hundred scale That's the standard deviation is ten. So I'm going to take size equals ten ten Random values from a normal distribution with a mean of a hundred and a standard deviation of ten and then comma one Just to do the second argument of the round function now For those that will have a value for of b in their independent variable I'm doing that also from a normal distribution also ten values But this time from a mean of a hundred and fiveness and a standard deviation of seven again I'm just rounding that and then lastly from a mean of 95 and a standard deviation of five So let's do that. We'll run that if you use the same seed you're going to get the same values and what we're going to do now is just to build a Data frame object, so I'm gonna call my data frame object df. So I'm assigning this data frame object to the variable df and Then I'm calling the data frame function from pandas. So it's pandas dot data frame And I'm using a Python dictionary as you can see here so that I have key value pairs And if you pass a dictionary to a data frame to the data frame function The key is going to be the column header column name And the values are going to be down the row. So each of those observations. So for treatment This string treatment here. That's gonna be my key colon. The values are treatments. And if remember those are these Treatment study values a a a 10 of them then bbb 10 of them then ccc 10 of them And then for the dependent variable I'm passing a numpy array of these 10 10 10 bit values I'm saying numpy dot array and I'm passing as a Python list these three Numpy arrays and but then I'm also calling the flatten method because I just want that to be flattened as One long numpy array because then I have 30 of them because remember I created 10 10 10 And I have 30 values in treatment so that I just have these two columns that are of equal length of the same number of rows Down each of them. So there is my data frame all created. All I want to do is just to state very clearly that The treatment column now which I can refer to just by short hand notation DF dot treatment That that is a categorical variable. So I'm using the categorical function in pandas a pandas dot categorical I'm saying take the DF dot treatment pandas series. That's going to return for me that column And I'm stating that that is categories and the categories are a b and c The order here really matters because the first one is going to be your base case And later on we'll discuss exactly what that means But specifically when you get to logistic regression that becomes very very important But what I'm doing to this. I'm overwriting the DF dot treatment column so if we do that pandas now knows that that is a Column that contains a categorical variable. So let's do some exploratory statistics That's always good before you do any modeling try and understand, you know some summary of your data either through Calculations as we'll do here and through data visualization. So some form of descriptive statistics and data visualization So I'm going to call my data frame DF and then the group by method there So DF dot group by and what I want my data frame to be grouped by is the treatment column So it's going to look at all the unique values It'll find a b and c and then for each of those take the dependent value and Then describe that so the dot describe method there that gives us this whole range of Summary statistics. So first of all the group by work by finding a b and c and then it found that they worked the count the frequency There were 10 Observations in each of those treatment levels We'll see a mean now remember we chose a mean of a hundred and a standard deviation of 10 and Using that seed for the pseudo random number generator. We got a mean of 100 point six For be a hundred and six point two two and for C 96.78 you see the standard deviations there the minimum and when the quartiles and the maximum So very nice. Let's plot this though and I'm going to use plotly express the express module and the box function. So express dot box First argument is my data frame object. What do I want on the x-axis? Well, I want the treatment So it's going to be a b and c on the y-axis I want dependent the dependent column and I've given it a title as well Let's see what that looks like because this gives us a very clear indication of You know the information locked away in our data set So indeed in the nominal categorical variable treatment it found Three unique values a b and c and then the dependent continues to make a variable They do a box and whisker plot of each of these and as I hover over these you can see the quartile values they and then the zeroth and and The maximum there as well So you see the box and whisker plots and you see there is suspected outlier as Far as the dependent variable value is concerned for those with treatment level a Now I'm just going to plot it slightly differently I'm going to plot it as a scatter plot instead of a box plot Because I want you to think back about the tutorial on linear regression The magic was to draw the straight line that was this best fit model that made the least error Remember, and we have a problem when we look at this scatter plot because I don't have a continuous numerical variable on my Horizontal or x-axis and I can't just draw a straight line through here So I've got to create a different kind of model. I can't do the straight line You can well imagine. I mean it depends how close a b and c are to each other there's no ways that there's a interval Constant interval between a and b and between b and c these are not numbers even even if you encode them as numbers Such as 1 2 and 3 or 10 20 and 30 doesn't matter what you do. Those are not numbers And what we do with analysis of variance is that the best fit model is actually be easy It is the mean of the dependent variable for each of these three groups So you can think there's a little red line here and a little red line there and a little red line there That is going to be our best fit model. Just predicting Given a value for the input variable the independent variable our treatment variable Given a input value, which is either a b or c It's always going to predict the prediction for the dependent variables always going to be the mean for that group as simple as that What you'll see though is it's not so different from linear regression edge might think it is So we are dealing with nominal categorical variables. So we have to discuss dummy variables So if you think of the original Our original variable was this called treatment and if I look at the first five, so I'm using indexing there So I say go up to the fifth Remember Python is zero index. So it's going to show me the zero and the five is excluded So it's going to show me the zero throw the first row the second the third the fourth And they were all a a a because remember the way I use the repeat function in numpy So it's gonna be a a a 10 times then BBB then ccc So we're only gonna see those but it tells me it is a category. There are three objects and it's a b and c So what we want to do is convert those Into dummy variables. You cannot encode those with zero one and two or one two and three that does not work You have to create dummy variables. In other words, instead of the just the treatment variable I now have three variables a b and c and we use if we do machine learning We can call this one hot encoding But if that subject is an a there'll be a one under the a and a zero under the rest And if you can look at those three variables the three dummy variables a b and c that clearly shows you if it's one zero zero That'll be an a and if it's a one under B. So zero one zero that'll be a b and zero zero one that of course be a C But remember we start to talking about one of these cases has to be our base case against which the others are measured And really when we get to logistic regression that that's going to become very important So it might make sense for us to choose a Because you can also look at this If I just have b and c as my dummy variables if it is a zero and zero It still describes an a because if we have zero for b and zero for c what other option is there for this to be a And if it's one zero, of course, it has to be b and if it's zero one it has to be c So there's redundancy in putting this a in there. You can think of it in that way It's really redundant just with these two columns These two dummy variables I can still know whether a subject was in group a or group b or group c This by these two. So those are our dummy variables and we really have In essence this comparison against the base case, which is going to be a But this is all I need to describe exactly which group or which level Of treatment each of my subjects were in So let's discuss the research question and what I've done here is I've written it as a mathematical equation or statistical equation And once again, I do refer you to the video lecture on linear regression Just to understand what is going on here My research question and words would be though And now we've got to be careful because there are two ways to go about this The way that you might find in a traditional textbook That are going to compare the three means the means to each other and talk about This ratio of the between group variance over the within group variance now that might make sense if you just look at Analysis of variance, but I'm trying to build a bigger picture for you here one of linear modeling And if you as you go on to more complex forms of models That if you remember that one one just builds on top of the other We're just expanding our knowledge of these linear models. It makes more sense To still describe it in some sense as a linear regression model So while we might say Most textbooks would say that we are comparing the three means which is exactly what you would use an analysis of variance for We want to state that our independent variable Is a predictor of the dependent variable So we're still keeping it as a model as we did with linear regression And we're still going to have and we're going to get to that our null hypothesis and alternative hypothesis What we are saying though is We want an estimate or a predicted value for our dependent variable value. So you see the little hat there in equation one So we want a dependent variable value Estimated or predicted Based on these coefficients I'm still going to have beta sub zero hat and remember why we put the hats on there because this is data from a sample Not the whole population. So we're only estimating what that two two parameter value is in the population So still they intercept Still a slope beta sub one But that goes without dummy variable b which can only take a value of zero or one and this is multiplication So i'm going to have a beta sub one hat value and i'm either going to multiply by zero Which means it disappears or by a one which means it's just that value beta sub one zero This b and c are not continuous omega variables either zero or one and then the same for beta sub two times either zero or one For the c value So it's all about the three coefficients here It's all about the three coefficients And if you think about it because I can only plug in either zero or one So I have three cases zero zero one zero or zero one It means this estimated dependent variable value can only take three possible values for this model this research question that we have Can I build this model on the right hand side of the equation to give me an estimate of the dependent variable value? so If it is a we have zero zero and beta sub one hat times zero plus beta sub two hat times zero That says zero plus zero which is a zero So it will always predict beta sub zero if it's a case That is in group a if the case that's in group b It's going to be beta sub zero times beta sub one because then i'm multiplying beta sub one by one And beta sub two with zero so that disappears and with c you can see a similar thing So this research question of mine Predicting or getting an estimate for the dependent variable value given My dependent variable values I can only ever my model will only ever predict three values now If I had more groups of course more levels to my treatment or my independent variable I would have More values that can be predicted So our null hypothesis instead of thinking of it in terms of Equal means and you'll see it is actually actually those equal means We are actually comparing means you'll you'll see that that come out But I want you to still think about it the way that we've thought about it when we did linear regression That my null hypothesis is that beta sub one equals beta sub two hat Both our estimates remember equals zero And if they both zero think about that it means It doesn't matter if b is one or zero. It doesn't matter if c is one or zero It doesn't matter if it's zero zero because you know, then they fall away. Anyway My estimate will always be beta sub zero. So that's my null hypothesis that The independent variable is not a predictor of the outcome of the dependent variable because we will always just have beta sub zero In essence what we're saying is the three means are equal irrespective of what group My subject is in think about that My alternative hypothesis then is beta sub one is not equal to zero And or beta sub two is not equal to zero Because then we're not going to get the same means out, are we? Because now we might have a zero and a one in there and we're going to get different values So instead of thinking of the means That these three means are equal you can see from these coefficients. It makes more sense to think about the coefficients This is we did with linear regression That if beta sub one is zero and beta sub two is zero I can only get a value for beta sub for beta sub zero The other two are zero zero beta sub one had to zero beta sub two had to zero So I can't have anything other than this estimate of the dependent variable being equal to beta sub zero So there's my null and I alternative hypothesis So let's build a model now for our model We're going to use the ols function just as we did with linear regression And you can now see almost start to see why we can't just do that And remember the ols function it takes a formula or it can take a formula which makes it very easy So that is a string. So it's got to go inside of quotation marks. So dependent given the treatment So dependent given the independent whatever you name that in your data frame And those are the terms we used So the dependent given the treatment comma the data comes from the data frame and I'm also passing the fit methods I'm fitting the data to my model and I'm assigning that to the computer variable linear underscore Model And there you go We can now check out our results, especially of the coefficients by just calling the summary method So linear underscore model dot summary And there we go looks very much like what we saw before with linear regression So if we look at these coefficients, I see an intercept and now I only see a value for b and c And look at those values. Where did we see 100.6 before? That was the mean of the dependent variable for group a And what we see with these two other coefficients, that's the difference between b's mean And a's mean and between c's mean and a's mean So all I'm going to predict all my model is going to predict and remember what I said when we looked just at that scatter plot The best fit model is just going to predict the mean of each of those groups. That's all it's all it's doing And these are just the differences between each group's mean And group a's mean. That's what these coefficients are. You still see the standard error So, you know, we're going to take the coefficient divided by standard error to get a t statistic And, you know, we can use the degrees of freedom of that specific sampling distribution Which is the parameter for t distribution to work out a p value and we interested here in the p value For treatment b. So that will be beta sub one and beta sub two hat And you can see for a chosen alpha value of 0.05 we do not reject the null hypothesis when it comes to Beta sub one but beta sub for for either of them. So it's 0.09 and 0.249 So we don't reject that and we also see the 95 percent confidence intervals around the coefficient there So we're quite familiar with what is going on here. Nothing different from what we saw in linear regression And there's our ANOVA table And you see we have degrees of freedom just as we have before We have the sum of squares due to the regression and due to the error. They're still there The means which is just the sum of square divided by the degrees of freedom And if we do this division Of the Mean square due to the regression over the mean square due to the error We get an f statistic and the f statistic takes the two parameters Which are these two degrees of freedom two and 27 given that f ratio or f statistic. We get a p value So very very simple. You've seen these things before So let's just do this kind of by hand just so that you know what is going on So let's start the sum round with the coefficients So we're talking about beta sub one hat and beta sub two hat Now my linear model does have an attribute called params So if I call linear underscore modeled up params, it's going to give me back these parameters the coefficient values 100.6 5.62 negative 3.82 Now let's do this By hand, which is not really by hand. We're going to let python do the heavy lifting So remember patsy creates these design matrices for us. So i'm calling patsy dot d matrices the d matrices function Again, the little formula, which is the same as the ols formula and the data comes from df That's going to return two objects for me y which is this column vector of my dependent variable values And the design matrix x remember the first column is going to be a constant column all ones and the second column is going to be the values for My independent variable, but this time round remember that it's nominal categorical. So it's going to do the dummy variables for us And there we go Let's have a look at the first five rows of my design matrix there And then you can see the first column is all the constants And then you see 000 000 because remember those first 10 subjects were all a so it's always going to be 000 000 And you can really start to see That there's very little difference here Between what we're doing here and what we did with a linear regression Let's look at our dependent variable and those are the first couple of values For the first subjects now To do this bit of linear algebra here the least squares method I just convert them to numpy arrays as we did before and overwrite them And there's our friend and remember our friendly equation for least squares and remember from the first video again If you understand a little bit of linear algebra the column space of my design matrix does not span The whole space that it can possibly be I only have a few column vectors in there Hopefully they linearly independent. We have to have that So what I need is I need this orthogonal projection onto the column space of my design matrix And that is going to give us the best possible values beta sub zero beta sub one beta sub two And there's the equation I take my design matrix take its transpose multiply by itself take the inverse of that product Multiply that by the transpose of the design matrix and multiply that by why my column vector Of dependent values. That's what I do in code there. You can have a look at the mat model function as a matrix multiplication so we do all of that And we get these values. Let's have a look at them Remember, they should be three and they and they are and they exactly are parameters that we had before there's the params from from our There they are 100.6 5.62 negative 3.82 Look what we get from the least squares 100.6 5.62 negative 3.82 exactly the same So let's save each of these we're just using indexing And assigning these to a computer variable beta one beta two beta zero beta one and beta two And now I'm just creating a little bit of a python function just so you can see what a function looks like I'm going to call my function research and it's going to take one argument And if that argument is the string a I'm going to have my estimate is beta sub zero If that argument is the string b it's going to be the estimate Is going to equal beta sub zero plus beta sub one remember that's multiplied by one And beta sub two is multiplied by zero. So I'm not putting it there And else so that's if else if or l if In python and then else if it's not a not b then this is beta sub zero plus beta sub two And then that means the beta sub one Is going to fall away because I'm multiplying it by zero So if you're interested, that's just a little function there. So if I pass a It's always going to predict 100.6 What happens if I if we look at the mean Of so I'm using df.lock So I'm asking for a location. So I go down the column only select the a's Take the dependent value and calculate its mean. What is it going to give me back 100.6? So given a my model always predicts 100.6 and that is indeed the mean Of the dependent variable for that group If I pass b I now get 106.22 And that's that 100.6 Plus 5.62 as simple as that And you can see where all these things come from now. And if I just look at the mean for the group bs It's 106.22 and if we look at c It's They see let's call the function and you see it is 96.78 and indeed it's 96.78 if I just do the mean For those so you can see that we are comparing three means here And our model is always going to predict these three means and that's exactly how these coefficients were calculated But you can see there from the ordinary least squares We or the least squares method using linear algebra algebra at least we get the same exactly the same results And that's why we can use what you'll see in your statistical textbook this idea of comparing the means Because that's exactly what we're doing But I still want you to think about this in terms of the coefficients And the null and alternative hypothesis in terms of the coefficients Remember that we can now do the t statistic now the magic happens Really as far as I'm concerned with the equations for the standard error And you can look up all the different equations for the different model types There is a bse attribute in the linear model and that's going to give me the These three standard standard errors for each of my coefficients And I'm just saving them separately as computer variables there Because remember in the equation for you how we did the t statistic That's just a coefficient divided by its standard error And there's a attribute called t values So there are all the t values And if I just take beta sub 1 and I divide it by its standard error I'm going to get that exact value 1.732 1.732 Same for c and I'm just saving these or signing these to computer variables Because I can now calculate the p values And there is a attribute p values Remember an attribute doesn't have open and closing parentheses It's not a function or a method And there we go there are the p values the three p values that we had Remember that these are from a t distribution So I can use the cumulative distribution function Given the correct degrees of freedom And in this instance it's going to be 27 remember my sample size my number of observations with 30 I'm subtracting three from that in essence because I have three parameters beta sub 0 Beta sub 1 beta sub 2 But you can think about this in a different way the subtraction that you make from 30 So just be careful there And because this is tb if we just look at what b was it was a positive 1.73 So I'm just making it a negative So remember we counting with a cumulative distribution function from the or if you just think of the pdf the probability density function We counting from the area under the curve really from negative infinity So I multiply that by two because we have a two-tailed hypothesis and they are get 0.0946 And there is the p value 9.46 times 10 times 10 to the power negative two that scientific notation, which is just 0.0946 same result And there we do for the coefficient beta sub 2 hat and we get the same p value So in both of these instances remember For an alpha value of 0.05 we fail to reject the null hypothesis That my beta sub 0 and my beta sub 1 hat and my beta sub 2 hat are equal to 0. We fail to reject that null hypothesis So there we go confidence intervals for each coefficient as shown in 5 That is how we calculate these confidence intervals you take whichever beta hat you're working with You add to that or you subtract from that for the upper and the lower bounds of your confidence interval Given a confidence level and by the way this t-crit you've got to work out what that critical value is And you multiply that by the standard error for that coefficient So let's have a look what the t-crit is the critical t value For the t distribution given 27 degrees of freedom. So I pass in 0.975 Remember i'm talking of an alpha value of 0.05 two-tailed so 0.025 on either side So that's the critical t value And I can plug that into the equation multiply by the standard error and add and subtract it from The coefficient value there. There is a conf Method to my linear model that's going to return the lower and upper limits Low and upper bounds for my 95 confidence intervals, which by default would be 95 And they are calculated by hand And you can go through the code there and you see Using that equation we get exactly the same results and there you go the coefficients You see now how it fits in just builds from linear regression And you see why we can talk about comparing the means which is actually what we're doing But it's actually about the coefficients. It's just the dummy veils being 0 0 0 1 or 1 0 And if we had more, you know, there'd be more And those coefficients are just talking about the means but I still want you to think about this in terms of these coefficients So let's look at the analysis of variance table This is analysis of variance after all And if we look at this, you know, it's going to show it's the degrees of freedom the sum of squares The mean of the square mean square of the errors the f statistic and the p value for that f statistic So let's use a nova underscore lm. That's what we get degrees of freedom And remember here we are still talking about the regression and the error So let's just look at the fitted values. So there's a fitted values attribute And that is just going to calculate The estimated dependent variable value which in our instance remember can only take one or three possible values Which are going to be the means so there we go. The first lot would just be 100.6 And I'm going to just add that as a new column to my data frame. So I'm calling a new column header there Touch to my data frame and I'm assigning to that these fitted values Because what I want to do is again, let's talk about the sum of squares due to the regression And remember as before the sum of squares due to the regression again Just be careful because different textbooks different lecturers use different terms. Yeah The sum of squares for us the sum of squares due to the regression Is this difference between the predicted value the estimate that my model now makes and the mean For that dependent variable. So that's exactly what we have here and remember This estimated value y sub hat only takes one of those three values So for each one of those it'll just be that estimate now there is A attribute in stats models for our model. So linear underscore model dot es s If you use that you see we get the sum of squares due to the regression the difference between What my model predicts and this baseline mean model and again just what we had in linear regression There's still this baseline mean model that says irrespective of the input Whether it's a b or c we always predict just the mean of the dependent variable one single mean overall mean And that's what we get 450.96 and if we look up here at our table 450.968 So that's the sum of squares due to the regression if we now look at the sum of squares due to the error Remember, that's the difference between the actual value And the predicted value So that's the next one sum of squares due to the error and we use the e here because Regression had an r in it, but you remember the error is also the residuals same thing So it's the estimate The predicted value minus the actual value square those sum of all of those and there is An attribute called s s r which makes it slightly confusing now because you can see Stats models uses the r there for residuals And it uses s s e But I want you to stick with this s s r and s s e so that we don't get confused But it is there it gives me the sum of squares due to the error there 1420 which is exactly what we saw now on an overtable And if we do this by hand assign it to the computer variable s s e I get exactly the same results Remember this code is exactly just that equation We do those differences and I added that new column of estimates Subtract from that the actual value and df dot dependent square all of those sum over all of those That's exactly what we get So now we just have to do that's our numerator and denominator sorted We just have to talk about the two degrees of freedom in the numerator and the denominator And what we just have to consider is the number of observations and in the simple case the number of parameters or although I said you can Think of that in a slightly different way So if we just call the shape argument or the shape Attribute for our data frame and pass the value zero because it's going to return for us rows comma columns We just want the rows and I'm going to assign that to the computer variable m Just to save that value 30 we had 30 observations And my number of parameters is 30 You can also think of that as the number of levels in your treatment There we go k Now just to show you that the degrees of freedom is in an attribute So df underscore model is going to give me back the 2.0 because I take those three and subtract one from it For my numerator degrees of freedom, which is just too very simple There is an attribute called df underscore is it and that gives me the degrees of freedom of the residual and you can think of that as m minus k There are three parameters in my model beta sub zero beta sub one beta sub two I subtract that from my sample size and i'm going to get 27 See Simple as that and remember The mean square as you take that specific sum of squares and divide it by its degree of freedom Again, their attributes for these msc of the model is there and if we do that by hand we get exactly the same result 2 to 5.48 And then if we just look at that there is an msc underscore is it there we go And if we do that by hand we get the same result as well But now you know where these values come from and remember our f ratio is then just this Mean square due to the regression over the error It's still exactly the same as we did with linear regression And this is where we get this idea of the between group variance because remember if we do the sum of squares and divided by The degrees of freedom we're talking about a variance and the sum of squares due to the regression You can think about that as the mean of a group divided by or subtract from that the overall mean So that is in some sense A variance between the groups And in the denominator It is the predicted value minus the actual value and the predicted value is always going to be the mean So that is kind of the variance within the groups and that's why these terms come from between group over within group But I want you to stick with this explanation That we've used up till now there is an f value attribute If I do it by hand though, I'm going to get exactly the same value 4.285 There is a f underscore p value attribute which gives us the overall p value And remember I can use the cdf function for the f distribution. So stats dot f dot cdf I pass my f ratio f statistic value, which we've just calculated and the two degrees of freedom Which we've saved as df1 and df2 That was 2 and 27 Remember that that area under the curve it starts counting from the left hand side So we've got to subtract that from 1. We're interested in that f ratio and more extreme And there's exactly our p value given a bit of the small difference in rounding rounding errors there So that's really powerful stuff. You understand Hopefully now all of what is in that table Just a little Plot that I've drawn for you here. You can have a look at the code But just to understand there's the pdf for an f distribution with parameters 227 I've calculated for us a critical f value there for an alpha value of 0.05 And there's our value there and remember from the left the area under the curve And but we want from this value out towards positive infinity. So this area under the pdf here And hence we subtract Our value from 1 We can also just talk about the coefficient of determination Remember that's still going to be this ratio of the sum of squares due to the regression over the total sum of squares And we remember that the total sum of squares Which is the Addition of sum of squares due to the regression and sum of squares due to the error There is an attribute for that our squared But if we were to do that by hand we're going to get exactly the same result Now we just have to interpret this slightly differently here the interpretation is not as straightforward, of course As we as we get at least with linear regression More importantly, I just wanted to talk to you about post hoc tests Now we're only going to do a post hoc test if we find a significant P value If not Then we don't Now in our instance when we looked at those three means if we looked at beta sub 1 and beta sub 2 They were not significantly different from zero. We failed to reject those null hypothesis And therefore we won't do post hoc tests, but in this instance I would just want to show you what it is and post hoc tests means I'm now going to do pairwise comparisons. We had three levels a b and c So what's the difference between a and b? Was it between b and c? Was it between a and c? So we've got to do those three The problem is that we have these family wise errors you compound your type one error And as we can see we call it a family wise error So that's the alpha value inflation or the cumulative type one error And then you see a little equation of it that alpha underscore ec there That's our alpha value that we've used throughout open 0 5. So that's 1 minus open 0 5 And you raise that subtraction to the power the number of p values that you have You subtract that from 1 and that gives us this Idea of a family wise error. So if we do that in our case We had three three pairwise comparisons and so our alpha value is not open 0 5 If we if we do all these pairwise comparisons is actually open 1 4 2 So we've got to correct for that and there are different ways to correct for that I'm going to show you two the first one is tukey's Honestly significant difference test on a significant difference test And there's a pairwise underscore tukey hsd I pass my two variables set my alpha value and get a summary of that And now you can see the adjusted p values and you'll see it's between a and b between a and c and between b and c The other one other way to go about it and there are quite a few ways is just Do this pairwise comparison of my t-test with a one for only method So that's also a correction what we have to do is just to save Those dependent variables each as a separate python list, which is what I do there And now I do and save my individual p values for these t-tests So you see I have my three t-tests there And now i'm going to pass them to the multiple tests function So I pass them all three p values as a python list I set my alpha value and I can set a method in this instance. It's the one for only method And there you can see what happens. I can see between a and b We uh, this is about rejection of the null hypothesis So we don't reject the null hypothesis. We don't reject the null hypothesis But between a and c we do reject the null hypothesis. You can see the three adjusted p values there And then what you can also see is the corrected alpha value for the cydaq method and the corrected alpha value for the bond for only method So just how the corrections take place So I hope this was enlightening that this was a good video tutorial for you The second in the seminar series on linear models. This was analysis of variance And really if you understood linear regression, there was almost nothing new here and you can see How we can use analysis of variance to compare the means of groups But what we are considering is just those coefficients and then also just our analysis of variance table