 And while the sun shines brightly over here, I want to give you a brief introduction to this next video. My name is Dr. Jean Clapper and I'm a research fellow at the School for Data Science and Computational Thinking at Stalambosh University. Now this video is going to be about linear modeling using the F distribution. So we're going to do some linear regression. We're also going to do some t-tests and then analysis of variants. So I'm going to give you a brief show on the whiteboard just what this is all about and I'm going to assume that you've watched one of my videos just to know what a sampling distribution is and how we use that to build our intuition for how to calculate a p-value. So let's go to the whiteboard and then we're going to open a Jupiter notebook and I'll show you how easy the code is just to do all of these things. So in this video I want to talk about linear models. So we can use Python and I'm going to show you a couple of things. We're going to do some linear regression. We're going to do t-test and we're going to do analysis of variants and it's all going to be based on this little beauty here. The F distribution and you can see always in green is all our definitions. There is our equation for the F distribution and it says x whatever our value is going to be depends on two parameters. So parametric distribution and you remember how these work. Well before we get to it, there is the long distribution. You see D1, D2 that is our two different degrees of freedom that we're going to pop in there and of course we're going to use some code so you never have to worry about this equation. I just want to show it to you in all its glory and the B that you see there. Well that is the mathematical beta function. So depending on these values D1 and D2 you're going to get different sampling distribution curves their probability density functions. So we can imagine given any two specific D1 and D2's if we could do a study over and over and over again we're going to get a sampling distribution of these F statistics and you see there the probability density function. And what we're going to do, we're going to do a study and our F value is going to be somewhere say for instance it's right there and the P value because of this is a PDF that area under the whole curve is one and what we're going to have is we care about this area under the curve on this side and that of course is going to be our P value. And as always when we do use code we start doing it from the left hand side because it is the cumulative distribution function that we actually worried about but we need to go there so the whole area is one we subtract the this part so one minus this part leaves us with that part. So that is in essence this probability density function of our F distribution. So we're going to talk about as I said linear regression so this one for us would be linear regression linear regression and what we have here is a simple univariable model so we can imagine all these dots they belong to a single observations whether they be patients organisms doesn't matter what the situation is but for any one of our observations in our study they have two numerical variables we're going to gather data for two numerical variables so two columns in a spreadsheet and so for this individual here or let's do this individual here for the one variable they're going to have this value and for the other variable they will have this value here so every single every single observation will have these two numerical variables as I think as I say if you think about a spreadsheet there's going to be one variable say variable one and there's going to be another variable two and we have all our observations here patients whatever they may be and we have values for each one of those and each dot belongs to a separate individual and those are the two values there and what we want to do eventually is build this model this red line and this red line is what we call a best fit model and what we do is we say for any given instance here of the independent variable our model predicts that to be the value for the dependent variable so that goes out all the way they've excuse me my lines are not very parallel there so there's this difference we'll get to the difference shortly so there is this idea that this is every single individual's pairs of observation of course we can have more than one independent variable then we have a multi variable logistic regression but we're going to keep things simple so given this we want to try and predict this value why would we want to do that well that's modeling this might be very expensive or very difficult to collect and it might be very easy for us to collect these and then we can just predict our model can predict what this value is going to be now the way that we go about it with linear regression remember is about about fitting this line is and remember what is these all linear models so we're going to see straight lines what this is y if we remember from school y equals mx plus c or b whatever the case might be so y was this value remember x were these values and m is the slope of this line so if the slope is one we'll get a difference in one there for every difference for every difference in one year and a difference in one day and then c is if x is zero anything times zero zero so that falls away and y equals the c so right here where x is zero that is where we have this y intercept and that's very important so if we have a slope and we have a y intercept that's all we need to draw the straight line but how's the straight line done well it is done by minimizing the error you've already seen an error right here it's the error between what the this person's or individual's observations true value is versus what the model predicts that there is the error and we can just draw it in here that would be the error for that individual and here we have an error for this individual and for this individual and for that individual and for that individual and for that individual so all those are errors that our model is making and these are also called so let's have it out there those are also called residuals and how do we get this line well they're different methods there's gradient the same there's ordinary least squares we're not going to worry about that we're going to write some code and that's going to happen for us we can use stats models for that but somewhere we've got to relate all of that back to this and with linear models we're also going to express something called the r squared that's the coefficient of determination that tells us how well our model does versus a baseline and this is our baseline we could always draw a straight line and that is just going to be the mean of this variable here if I should say this variable here what am I writing so let's have that out the mean of our dependent variable so we calculate the mean there and we say whatever this individual's independent variable is it predicts the mean so they all predict the mean and of course that means we're going to have very large we're going to have very large residuals and the idea behind this one so the scale is a little bit different use your imagination so of course here we're going to have the fact that we minimize these and remember some of them are on top and some of them are on the bottom so if we have this value minus this value that's positive but if we have this value minus that value that is a negative so if positive negative positive you just add all of them you're going to end up with zero so what we do is we square these residuals we square them or the error terms we just square them and then we add all of them up and that's called the sum sum of squared errors sum of squared errors and we're going to use that quite a bit just telling you these terms so when we get to them of course what happens here if we look at the difference between every value every value and its mean while we're working towards the variance and of course we're going to use this idea of the variance when we get to our squared irrespective of all of this this is all going to be some f value that we can work out is going to force somewhere in our probability density function and we can use the cumulative distribution function to work out this little area right there and that's going to be a p value now we're also going to use the f distribution to do a t-test and you might say well we have students we have students t-test and we have a t-distribution that is this beautiful curve that looks like a normal distribution well we can also use the f distribution to do this and what we've done here is of course with a t-test let's say for instance students t-test we have our bunch of numerical variable values here but we have two groups and remember that group comes from the sample space elements i'm going to say here sample this is horrible writing sample space elements of a categorical variable a binary categorical variable so this categorical variable again another one here so let's make that cat one this is only going to be one two two one and remember it's that categorical variable that binary categorical variable that gives us our groups when we do students t-test so we'll have group one and group two and for each of these individuals they have one single numerical variable and we're going to compare that very same variable between these two so imagine that would be the situation and what we're going to do is we're going to do the going to work out the mean for this group and we're going to work out the mean for that group and that is what we're going to compare with each other and that is how we're going to get to an f statistic that again we can plot here so not a t statistic for the t distribution but it's going to be on the f so we're going to have those parameters there as well and then of course when we do analysis of variance so this was a t-test but we're also going to do analysis of variance so there we go and over and for analysis of variance we can of course have more than two groups so if our categorical variable there had three or more sample space elements then we'll have another group and we can compare all those means together with each other and again we're going to compare that to putting the whole bunch together and it's overall mean and I'm going to show you how all of that works and how easy it is to do just use python so let's open a a Jupyter notebook and let me show you how how this is done now that you have some idea of what this lecture is all about let's have a look at our Jupyter notebook that we've opened here remember there will be a link in the description down below to get up where you can get your hands on this this notebook because there's all sorts of information here written down for you we can see the f distribution there and that we talked about on the board and a lot of explanatory text there but let's start at the beginning we're going to import some packages so we're going to import numpy as np so we're just using the namespace abbreviation np there for numpy from the scipy packages we're going to import two libraries the stats library and the special library and then from pandas we're also going to import the dataframe function then to do our plotting we're going to use plotly one of my favorite packages for for data visualization so we're going to import both the graph objects so graph underscore objects and the express libraries in the plotly package and we're going to use the namespace abbreviations go and px and then from we're also going to import plotly.io as pio and then immediately use the templates dot default setting there and set that to plotly underscore white since i'm using a white theme here we're going to have plots with a white background and then from patsy we're going to import the d matrices function remember i do have a video out also in the link down below that shows you what these d matrices are all about and patsy they really make it easy for us then to use our data inside of packages such as the stats model package stats models as you can see there and we're going to import that stats models dot api as sm so let's do that now that we've imported all our packages just a reminder then there of what the f distribution is and what i've created here remember the function on the green board is the equation the function of the green function on the white board it is written for us there and i'm creating a little python function there just to to calculate all of that for us and you can see special dot beta there that is the mathematical beta function and then i'm just going to generate a couple of these distributions just given different values for our two parameters and there we go we see three different distributions there given d one values for d one and d two as you can see there and once again we're going to find an f statistic for our test and f ratio that's going to be somewhere here and we're just calculating the area under the curve right there so let's do one example i'm going to use d one as one and d two as 10 and i'm going to create an f statistic right down here at 3.5 so we're actually calculating the area under the curve there but we use not the probability density function remember we use the cumulative distribution function and that's how you would do it as i showed you on the board it's one minus stats.f.cdf 3.5 was our f ratio or our f statistic and the two parameters and we get a p value for that so that was just all a bit of a reminder i'm sure that you are you have knowledge of all of this so let's get going and talk about what this tutorial is all about we're going to start with simple linear regression so we're going to have univariable linear regression we're going to have a single independent variable predicting our dependent variable and let's create two of them so i'm using the random dot seed function inside of numpy and i'm setting an inch into just seven if you do the same you're going to get the same pseudo random numbers as me and i'm using the numpy dot random dot uniform so i'm taking from a uniform distribution and with a low of 10 and a high of 100 so on that interval from 10 to 100 we want 20 of those values and i'm wrapping all of that as first argument in the round function numpy dot round and then comma 1 so we're just going to get one decimal value and then the dependent variable what i'm going to do again wrap something inside of the round function and again comma 1 there at the end for the second argument for the round function just so that we just have one single decimal value there and what i'm doing i'm taking every independent value and i'm adding a bit of random noise to it so that we're broadcasting their element wise addition of these two arrays so numpy dot random dot normal so that comes from a normal distribution with a mean of zero and a standard deviation of two and again 20 of those i have 20 in both arrays and we'll just do element wise addition of those so let's just visualize in a scatter plot what that's going to look like and you can look at the code there for the go dot scatter function and as you can see there's beautiful correlation here as the value for the independent variable increases so does the value for the dependent variable and i remind you that each of these markers would be one observation in our data set so one row of data for that individual there were two numerical variables and we have a representation of both of these so each of these markers belongs to a single observation and now i'm just going to convert this into a data frame and i'm using as you can see here curly braces so this denotes the use of a python dictionary to create this data frame so now we're going to assign that to the computer variable df then use the data frame function and here's my dictionary i've got a key value pair right there the key being independent in quotation mark so it's a string and that's going to be the name of that column the name of that statistical variable independent and i'm assigning the value to it all my 20 values in my computer variable independent and then dependent dependent and then group this is going to be assigning a key that's a list of all these individual characters or strings so just doing that by hand and you see c there in e let's imagine that one is the one group of observations are the control group and the other one other ones are the experimental group and i'm using very easy simple indexing there i'm just going to see the first five rows of data and there's our lovely data frame so we can see these values 16.9 18.0 and then that observation was in group c now we're going to convert this into a format that's ready for use when we do linear regression that's very important and for that we're going to use the dematrices function inside of the patsy package and patsy creates you can use as first argument these very easy formulas which reminds us of the formulas that we can use in the r language for statistical computing so it makes it very easy and it uses this little tilde symbol which is in different places on different keyboards you'll have to find yours and it says dependent by independent so i'm trying to predict the first one given the second one and if we had more if it was multivariable that just a regression we just put a plus there and add a new name and name of one of the other columns and the name of the other columns and please watch that video on patsy if you want to know exactly how to do this and then a second argument we've got to specify where these variables come from that are used in this formula comes from the df data frame and what this is going to do is it's going to create two separate entities for us so we're going to have two variables there a y and an x y is going to be my dependent variable vector and x is going to be my feature variable my feature matrix the matrix that contains both my intercept and my independent variable i'll show you what that looks like so let's just do that let's have a look at y and indeed y as you can see there that is just my dependent variable 18 80.4 50.0 and we see the values there so it's a design matrix that's the data type when we use the d matrices function in patsy but let's have a look at x so that's my feature matrix also a design matrix data type from patsy and we see our list of independent values there 16.9 80.2 16.9 80.2 you see them up there 49.5 49.5 they're all there but we've got a new column called the intercept column and that's all ones and if you look at the linear algebra if you wanted to look at the linear algebra behind the scenes here you need that column of ones and to do your matrix multiplication and i remind you just this is what we're trying to achieve here we have all the values here 18 80.4 i hope i jotted them down correctly if not this first vector that is my y vector my dependent variable and we're trying to predict its value and for that we need this column of ones and we need this column of my independent variable as you can see there so what we're looking for is these two parameters beta sub zero and beta sub one such that if i look at that 18 right right up there i'm saying if i have the right value for beta zero and i'm multiplied by one that first one there plus if i have the right value for beta sub one times 16.9 there we go time 16.9 and that should give me very close to 18 there'll be a bit of an error though it won't be exactly 18 but if i choose my values for beta sub zero and beta sub one very nicely my error error value is going to be very small and it's in bold there because it's also a whole vector of values in case you were wondering and if i if i just have to get the right beta sub zero and beta sub one and beta sub zero that's going to be my y intercept when x is zero remember and my beta one is going to be my slope so if i choose those correctly it'll minimize the error and i'll get us close to for each individual value i get us close to my dependent variable value as is possible and again this is what we had on the whiteboard all our values our model is going to be this blue line and again for any independent value down here the model predicts on the blue on the line this predicted value but the actual value for that observation was way up there and this difference there is called the error or the residual and there we go there's those residuals the difference between every value and what the model would predict for that value and as i said if you add all of those up you're going to get to zero so we actually square all these values remember if you square anything minus three squared becomes positive nine and so we square all of those and that's just what this little graphic is trying to say so we square all of these and what we want we want values for beta sub zero that's our y intercept beta sub one which is our slope that determines then this line and it'll be the best fit because these squares will be the smallest that they can so if we add all of them they will be at their very smallest and then there's a way to determine how good our model does i mean any model like this we can fit a line to no problem but we've got to somehow express how good this is doing and we do that through what is called the coefficient of determination or r squared and what our square does is just this little ratio and sigma squared as you can see there that's the variance in other words so we're going to take the variance of some mean model we're going to see that and then we subtract from that the variance of our best model divided by the variance of our mean model so it's just this fraction of variance that we're trying to express here and in the end we'll see that our model is going to explain a certain fraction of the total variance in the dependent variable don't worry about that let's have a look at how to do this now you'll remember on the board we had this mean model so that's the one irrespective of what independent variable value is it's always going to predict the mean of the dependent variable and that's what we're doing here so i'm just calculating the mean of my dependent variable up there so i'm using numpy dot mean the mean function passing my array of values there for the dependent variable and i'm assigning that to the to the computer variable mean underscore dependent and we printing that to the screen and we see it's 52.89 so if you set your serial random number generator also with a seed of the integer seven you're going to get exactly the same values and you'll also have 52.89 so let's have a look at a scatterplot of this model that we're doing here so the line colors have changed around but there's my mean right up there this is more tools towards what we had on the board so whatever the independent value is it's going to predict 52.89 for each one of these independent values and you can see the residuals the errors they're going to be quite large because i could draw a line straight up there that's going to look a lot better at a angle going up towards the top right from the bottom left and that's going to have residuals that are much smaller and we'll see that but this is the mean model and for this it's r squared the coefficient of determination that is how we calculate how good our model is because it's it's based on this very bad model which is the mean model and i remind you then if we look at this one down here it has a dependent value of 10.9 and the model is predicting all the way up here it's predicting what was it 52.89 that's a big residual and you can see a little calculation there for it one thing i want you to do though is just consider the dependent variable on its own and its mean and all its value so it's just the dependent variable here on the vertical axis on the y-axis the residual is nothing other if i square all the residuals add all of them up and divide by how many they are what do i have i have the variance so if i remember what the variance is it is just this average difference between each value and its mean but we square all of them so that we can add all of them up and it remains a positive value and just divide by how many they are that's the variance so it's the sum of squared errors the difference between the actual value and the mean we square that and then we add all of them up and we're going to divide by how many they are that's just the variance as you can see there in equation four so let's just have a sample size in i'm going to sign the length of the dependent variable that was 20 remember we had 20 numbers in there i'm just going to save that as a value and let's do this little equation we're going to say dependent minus the mean of the dependent and remember python will do this element wise it'll take the first dependent value and subtract 52.89 from it then the second one subtract 52.89 etc it will carry on like this square each of those so i've put those in parentheses there and then to the power remember in python two stars there two two that would mean square so i square all of those and then i pass that to the numpy.sum function so i'm squaring all those differences then i'm summing all those squares and i'm dividing by how many they are and remember that's nothing other than the variance in actual fact i could use the var function in numpy so mp.var pass the dependent to that and i get exactly the same value and that's only because we're using the mean that this turns out to be the variance so let's look at stats models and get our best fit model so stats models a wonderful python package and it can do all sorts of statistical tests and the one we're going to use now remember we use the namespace abbreviation sm for stats models dot api one of its functions is ols all uppercase ols ordinary least squares so that uses linear algebra and we're just closing down the span of the vectors that we're dealing with don't worry about that that's all linear algebra but i'm passing my vector of dependent variables and my feature matrix x there and then i'm saying dot fit i'm using the fit method there and that is going to fit my data so that i can get these values for beta sub zero and beta sub one the intercept and the slope that are the best possible and i'm assigning this to the linear underscore model computer variable and there we have it we have a linear model with a snap of a finger or the hitting of a key that was done so this linear model it has a bunch of attributes methods to it one is with summary two i'm just calling the summary two method there on my model and that prints out a beautiful summary of our model and what you can see hiding there first of all there's an f statistic on the right hand side two thousand four hundred and eighteen we see a p value for that f statistic and that's just abbreviation something that's 10 to the power negative 20 that's just computer truncation that means it's basically zero so tiny tiny little p value so very significant model and of course you would expect that because you know we just added a little bit of random noise and when we created those values but you also see an r squared value here on the left 0.993 now that coefficient of determination it goes from zero to one zero means the model will be as bad as the mean model and anything better than that as those residuals get smaller the r square starts to climb and if it's at one point oh it's perfect and then you want all those dots in a line with no errors whatsoever and then when we look at the coefficients here we see the intercept 0.2150 and the slope that's the independent variables coefficient 0.980 so what we're saying here is that if we take 0.2150 plus this 0.980 times the independent variable value that is going to equal the predicted dependent variable value just as I showed you using those vectors before so let's have a look at this you can have a look at the code there you can pause the video and just have a look at the code and there we have it the same data and now we have this line very different from the mean model and you can well imagine that these residuals are a lot smaller than when we talked about the mean and we can do sort of the same thing as work out the variance in all the residuals but first of all we can just list them all by using the this dot residual attribute of our linear model so linear linear underscore model that's the computer variable that we used when we created the model dot result for residual and you see all the residuals there for all the values and then as an as an aside I can actually use the dot predict method and pass my independent values to this model and it's actually going to show us the the predictions that it makes so all of these values lie on lie on this red line given all the independent values that's what I've passed to it it shows me what all the values on the red line is going to be what the model predicts but what we want to do is we want to calculate the variance in this best model so you can just say numpy dot var of all the residuals I just want the variance in those residuals and we see that there and now we can you now we can calculate our squared remember the equation that we saw before so I take the variance in the mean model subtract from that the variance in the best model divided by the variance in the mean model and we get the same 0.99 and if we round that two up to a three because there's a six that follows we get exactly what we got here r squared right there 0.993 so we can really just calculate this by hand and it gives us a good intuition about of what this r squared is and as I said the interval for r squared is between 0 and 1 and what we can say is adding this independent variable to our model explains 99.3 percent of the variance in the dependent variable that is the interpretation of r squared how much what fraction of one or then 100 percent if you want to multiply by 100 does the model all these independent variables that we put in what fraction of the variance in the dependent variable does it explain and our model obviously explains a lot of the variance because it's a very good model now we saw an f statistic on f ratio and a p value there so let's have a look at how to do that and that depends on these two parameters so it's tiny here on my screen and in the video you're probably not going to be able to read that but have a look at the notebook on github and you'll be able to see it for yourself it's a ratio so we see a numerator and a denominator but the numerator has a numerator and denominator and the denominator has a numerator and denominator and both in the numerator and the denominator its numerators stay stay the same as what we had with r squared but we're dividing each of that numerator and denominator by something else and those are d1 and d2 that we saw when we when we looked at the probability density function of our f distribution there and that what that says is right in the tiny little bit down here at the bottom and in that little piece that I've highlighted in blue now it's the denominator says p of the best model minus p of the mean model and that has nothing to do with p value it's the number of parameters now the number number of parameters in our best model remember that are two parameters an intercept and a slope and our mean model only had one parameter it's only the mean so that's two minus one and two minus one is one so we have one of our d values there and then the bottom one says n minus p best model n is the sample size and p best remember how many parameters are there in the best model they're two parameters so we had 20 cases in our study minus two that's 18 so we have values for d1 and d2 and those two parameters define their probability density function so let's just save them there p best and p mean as two and one and then I've got my f value that I'm calculating here the variance in the mean model minus the variance in the best model that's over p best minus p mean and we have on this side the denominator which is a numerator and denominator is var best model the variance in the best model minus n minus p best so if we do that we have an f statistic 2418 and that's exactly what we had let me go up and prove it to you there we have an f statistic 2418 right there so we've calculated by hand and we know the easy equation now for our specific f statistic all that remains now is to plug it into one minus the cdf remember we want that little bit of area under the curve to the right and for the cdf we say we want an f statistic because we're using f.cdf in stats in the stats library of sci pi f.cdf it wants an f ratio on f statistic 2418 and then it wants this p best minus p mean so that will be our d1 and n minus p best it'll be id2 and if we do that we get something that's truncated as well remember it was my times 10 to the power minus 20 but I mean that all is just truncation it's basically a p value very very close to zero now that we've looked at the univariable model we have a single independent variable now let's add another independent variable that may this makes it a multivariable linear regression you also get multivariate but that refers to how many dependent variables there are so we're just talking about more independent variables so I'm going to create a data set for us there you can have a look at the code what that does for us is variable one variable two and a dependent variable so we're going to try and use variable one and variable two to predict variable three and we're going to do exactly the same thing so let's just have a look at one way to visualize it perhaps not the best way we have variable one on the x axis variable y variable two on the y axis and then the color of that is the dependent is the dependent variable we can also do a matrix plot matrix scatter plot and there we look at pairs of these so of course variable one against variable one would be the very nice correlation there but we see not so good for the other pair so let's see how this model does first of all we're going to create these design matrices using the d matrices function in the patsy library patsy package and there's our formula and as I said we just start adding them it's not plus as in one plus one is two now it's just listing all of the variables and please watch that video on patsy to see how these work so I have my variables my design matrices and if we look at x we have our column of ones for the intercept but we also have variable one and variable two listed there so yes let's use ordinary least squares again very simple sm dot ols pass my vector that is just my design matrix of my dependent variable and now my feature variable x that's a design matrix with the three columns use the dot fit method and I'm going to save that as a computer variable and then also use the dot summary two method and we can see we still have an f statistic there a p value for their statistic and an r squared value very poor r squared look how close that is to zero and we're going to do exactly the same thing we're going to create a mean model and look at the variance of the dependent variable so it's this df for my data frame dot dependent so just the variance of that and then the variance of the best model again I'm going to use multi underscore lin underscore model dot residual that gives me all the residuals and I'm looking at the variance in the residuals and I'm going to save that and my r squared is going to be exactly as we did before and we get what we had in the model 0.07 so a very poor model as far as the number of parameters are concerned there are three in the best model remember the coefficient and variable one and variable two so that's three and in my mean model there's still one that's just the mean and length is the number of cases we have there so I'm just passing df dot dependent any one of the three columns and just taking the length of that for n which means we can do an f statistic 3.71 just as we had before and then the p value of 0.079 just as we had when we used ols we get exactly the same values and I think you now know how the r squared function works and how the f statistic or f ratio function works so that we can calculate an f statistic and a p value that being said let's now use it instead of the students t test so I'm going to create for the same variable two sets of observations and I'm going to call them group one and group two using roman numerals there and both come from a normal distribution and we're going to make a mean of 100 and the standard variation of five for the one group and for that very same variable 103 and 8 and I have 100 observations and 110 observations and then I'm just going to create another numpy array that just puts all those two together so I'll have 210 values all of them together so let's have a look at that there we have for the same variable for group one and group two individuals and we see you know you can see the difference there so let's do students t test it's the stats module or the stats library whatever term you want to use stats model stats module in the sci-pi package and I'm using the function t test underscore ind independent variable t test and I'm passing the two sets of values there and we see a p value of 0.014 now let's see if we can use the f ratio to do exactly the same again I'm going to have this mean model so I'm grouping all 210 individuals together and I'm looking at that single numerical variable and I'm just calculating the mean for that and now I'm looking at the sum of squared errors so I'm not doing variance anymore because I can't because I can't divide by the same sample size it's easy enough when we did linear regression because we have pairs of values or two independent variables and a dependent variable but they all come from the same observation so the sample size is always the same but here we had a sample size of 10 and 110 for the two groups can't do it anymore so I've just put all 210 values together and it's each individual value minus its mean for that whole set and square all of those so it's the sum of squared residuals and I'm using residuals from as far as the mean is concerned and we're going to save that as ss mean so we're not dividing by n to make this into variance anymore and then I'm individually looking at my two groups so for that variable minus the mean for that group and then squaring each of those and summing them all up again for group two and then the best model just adds those two sum of squared errors so once again it is only the sum of squared errors so just to actually take those residuals if I can use the term residuals when my model just predicts the mean and that's all I do now my best model I have two parameters because I have two means one for group one and drawn for group two which I added together and for my mean model there's only a single mean and then I just need to have a length for all of them the old 210 combined so n is going to be 210 so if we do that now this is my f ratio now so this is not the same as with the linear regression now with the linear regression I can tell you now you can also drop the n dividing by n and just do some of squares as well because all your n's they are the same anyway so here we have sum of squared errors for the mean model minus the best model divided by that d1 at the top and then sum of squared errors for the best model divided by the sample size minus the number of parameters so that'll be d2 so there we save it as an f and then we are going to do a p-value for that and if we go up and we look at students t-test the p-value there 0.014 exactly the same as what we calculated calculated now with an f statistic and that's quite wonderful isn't it and when it comes to analysis of variants remember we can use that to compare the means of more than three groups so I'm just going to create a data frame for us there let's have a look at the first 10 observations and what I've created is a single variable and a group's abc abc etc and now I'm just going to save them once again as separate numpy arrays so you can see the code for that there so we have the three lists and then we can just describe that per one of these groups so we get group ab and c we get how many there are in each group and we get a mean for each of these three a standard deviation for each of the same numerical variable and we can do a box plot of those three and we can see we're probably not going to find a difference of means between these three so there is an f underscore one way for one way analysis of variants in the stats module inside of the sci-pi package and we see a p-value there and and the statistic and once again as we did before we can sum all of them together get a mean for that we'll put them all together get a mean for that and we can do that for each of the groups and eventually we'll just add all of them to get our sum of squared areas for the best model now once again we have three groups so the number of parameters in the best model will have a value of three and for the mean model only a single mean and the sample size we save there and again we're going to use that exact same equation for our f statistic and for our p-value it really is as simple as that so we can use this idea of the f distribution and the sum of squared errors when we're talking about numerical values for linear regression and when we compare between two categorical sample space elements for the same numerical variable so i hope you enjoyed this this explanation of the use of the f distribution when it comes to linear modeling