 So in this notebook, we're going to talk about linear modeling. There really is a lot to unpack So notebook number 11 linear modeling and what we're going to discuss here is a little bit about a new Distribution called the F distribution as you can see they are linear modeling using the F distribution But we're going to stick to this idea of resampling so that we can understand what is going on So we're going to talk a little bit about linear regression We might revisit the t-test and Also analysis of variance ANOVA something that is very useful when it comes to data analysis and data science, so let's Look at the packages that we're going to use. It's a numpy sci-pi So from sci-pi we're going to import stats and also from sci-pi we're going to import the special module And then we have pandas there as usual and then our plotting libraries as per usual Now here's something brand new The Patsy package so from Patsy we're going to import the D matrices the D matrices function and then from the stats models Package we're going to import the API module so stats models dot API. We're going to import that as SM And what we're going to talk about now that we've imported all of this is we're going to start off with By talking about correlation So previously we took a Categorical variable and we looked at a sample space element and from that we could Create different groups and then we took a numerical variable and we compared it between those two groups Here we're doing something a bit different though so we're taking each observation in a data set and For any one of those we only look at numerical variables and we're going to compare those numerical variables to each other So if it's just two numerical variables for each observation will have a pair of values and we're going to compare those those the pair of values So yeah, let's start with just two and if we do one of them We're going to term the independent variable and the other one We're going to term the dependent variable and our aim is with these linear models To predict eventually a value for the dependent variable Given only the independent variable that becomes very useful because it might be that the dependent variable is very difficult Expensive and to get hold of that data and we can model What it should be based on what we do have and that's a very useful thing to be able to do We're going to start off with by just talking about correlation and I think most people have an intuitive understanding of what that term means so we're going to create two numerical variables and We're going to do that in the following way We're going to see the pseudo random number generator as you can see there and this time I'm using the integer 7 and I'm going to create an Array of of values and I'm going to assign that to the computer variable Independent and the way that we're going to do that is right on the outside. I have NP dot round and Right at the end. It has comma one. So I'm just going to have one decimal place But what's inside of here is more interesting. So I'm using the random dot uniform function a low of 80 and a high of 100 So that's my interval 80 to 100 give me a uniform distribution So every value between 80 and 100 has an equal likelihood of being chosen and I want 50 of those and Then to each of these values. I have those 50 now. I'm just going to add a little bit of random noise and That is how I'm going to get my dependent variable. So NP dot round again I just want one decimal place. So take every independent variable variable value and to that I add a little bit of random noise and that random noise is a mean of 0 and a standard deviation of 5 and there's 50 of them So I'm just adding a little bit of random noise to each of the Independent values and I think you'll see it's quite easy then to calculate, you know what the dependent should be Just adds a little bit of noise In case not let's have a look at what the data looks like there We have all our independent values on the x-axis and our dependent variable values on the y-axis So this is a scatter plot. So I'm using graph objects So go dot figure and the data is going to be a scatter plot Go dot scatter on the x-axis is my independent values and my y-axis is dependent variable variable values So I must have an equal length of both each Each pair that I must have both values for each pair. I can't have more independent than dependent variables They have to be pairs and that's what it looks like So you can clearly see I have between 80 and 100 for my independent values and then my dependent variables We just added a little bit of random noise a little bit of jitter up and down and as the independent value increases So that's the dependent variable because we just added a little bit of noise to that Now we all know by now what the variance is that the variance is we just take every value In a variable numerical variable. We subtract from that the mean for that variable. We square that so that we have positive values we add up all these squares and We divide by one minus the sum total of values that we do have why in minus one Because we're only dealing with a sample if you're dealing with a whole population It's just divided by the number of Subjects in the population But if it's a sample, it's always n minus one. So that's our variance. So you see an equation one just the variance We're just going to run through all of them So it's this idea of the squared average of the of the difference between each value and the mean for that variable So let's just calculate that for our case here and we're going to do the variance Just of our independent variable and we do that according to equation one here So what we can do is take our independent minus the mean for our independent And that's inside of a set of parentheses that subtraction and then we square that and Then we sum over all of those with the numpy dot sum function and then we divide by the land remember that tells us how many Elements we have in that set and we just subtract one from there So that obviously gives us now the sample variance twenty seven point three oh eight two Of course, why did we do all of that? We can just use the numpy dot var function So let's do that numpy dot var and I pass all my values to it my Numpy array of values, but I've got a set DDOF equal to one that's our degrees of freedom we set that to one and That means we have this one sample We subtract one from the sample size n minus one and that gives us the sample variance We don't put that we're going to get a population variance. So that's just divides by n So and then we see twenty seven point three oh eight two exactly what we had before So let's look at the variance for the dependent variable. That's Numpy dot var pass the dependent comma DDOF equals one and we get fifty one point five nine nine So now we're going to build this on this idea of the variance that we have for each of our variables And we're going to look at how they co-vary Well, somehow we combine this idea of how they vary and then in equation two you see the Co-variance for two variables. So what we do is the following We take each and every value remember We have an equal number of elements for x and for y for both our sets of values So what we do is for the one we just subtract from the mean and then for its pair for its Friend in the same subject We do the multiple and we do the subtraction again and instead of squaring we multiply these two by each other and Then we sum over all of them and then we divide by the sum total and there's a function for that fortunately It's this MP dot cov Co-variance and we just pass our two x and y is there our independent variable our dependent variable and we get this matrix back and Two of the values you'll see are familiar. There's the 27.380 in the 51.5 9 5 5 7 5 5 1 That is the variance in each one of these independently. So that 27.380 that was the variance in Independent variable and the 51.6 there 51.5 9 5 5 7 5 5 1 That was the variance in the dependent variable. So on that main diagonal of this matrix Across we see the same value twenty six point eight twenty six point eight eight. That's the covariance So that's how they vary. That's the variance in The two of these variables combined we can say twenty six. Let's make a twenty six point nine So what we have here is two rows and two columns so we can use indexing What we after is either one of these twenty sixes so this top one is in row zero and Column one Python being zero indexed. So I'm just use indexing there zero comma one And if I do that on the covariance that just gives me the twenty six point eighty eight That's the covariance that I'm after So we have a strong understanding now of what the covariance is and how does that help us? Well, it helps us to express How well these two variables are correlated? Does one change as the other one changes or is it just a complete random mess and for that We can express this correlation the strength of the linear connection or relationship between these numerical variables and we can express that using this covariance as What we call the Pearson correlation coefficient. You can see that in bold here Pearson correlation coefficient and We have a little symbol for that. That's a lowercase r And we see the equation for there in three In a page in three So we take this covariance and we just divide by the product in other words the multiplication of the two standard deviations So standard deviation of the one times standard deviation of the other one It's just that ratio between correlation between I should say the covariance between them Divided by the product of their standard deviations So let's do that in a little equation So numpy dot covariance dependent and independent, but we only want One of those four values that we're gonna get in the 0th column and 0th row first column And we divide by the standard deviation of the independent variable and the standard deviation of the dependent variable But because it's a sample. We remember to do that DDOF equals one and that gives us The Pearson correlation coefficient Fortunately for us. We didn't have to do that There's a function for that and the stats module of sci-pi and that's stats dot Pearson r and all we have to do is just to pass the two arrays to it and it's going to Return two values for us both the R This correlation coefficient and the probability of that a p-value for that. So let's do that r comma p Because we get two values back We can state two computer variables there and what we printing out is just the correlation coefficient 0.716 and that's exactly when we did our calculation That's exactly what this function does 0.716, but it also provides this p-value and If you do print out the p-value, let's do that here. You'll see it's tiny It's one of those round off errors. No, it's a zero comma eight zero comma eight zeros five So zero point zero zero zero zero zero zero zero zero five. That's just a round off. You know, that's just for zero So it was very unlikely It's very unlikely to find this r-value Now the correlation coefficient is on an interval it goes from because it's a ratio it goes from Negative one to positive one. So if it's if it's negative one, that's an absolute perfect negative Correlation as the one increases the other one decreases in step and positive one is a positive correlation as the independent Increases so does the dependent in step. There's none of this little jitter none of this noise to the data Now we have to talk about uncertainty again What, you know, how how certain are we because we're just working with a sample of this r-value of ours So what we're going to do here is Hypothesis and we're going to state that Well, you know, it doesn't matter how these values are paired up. I can reassign them and There should be no difference So that's one way that we can go about it and that's exactly what we're going to do here Some straight this empty list or underscore vowels underscore zero and then I'm going to do 5,000 Shuffles here. We're not going to use the shuffle function because that remember that changes the original So we're just going to do Random choice and how we're going to do the random choice is over here So numpy dot random dot choice from independent, but we set the size to 50 and replace equals false So we get the exact same ones back. They're just going to be in different order So what I'm saying is these pairs now don't belong to the same subjects anymore. I'm shuffling them around And under the null hypothesis that should make no difference So let's do that and we do that 5,000 times over and every time we append the correlation coefficient. So there's my numpy dot Correlation coefficient and at the end I'm taking an index zero one So I just get that one value back and there we go We've done that 5,000 times and now we have a distribution of possible our values Correlation coefficient. So let's do that and plot that out and The one that we found so there's a distribution of all the correlation coefficients You can see all of them here going from negative Towards negative one on this side towards positive one on this side and the one we found was way out here So again, we can just express this idea of How many of them were our value and more and that's exactly what we do here So let's just say the r value just to remind us it was 0.716. So all we say asking here is I want to in that array So remember r underscore vales underscore zero is a numpy is a Python list And we pass that to the array function in numpy so becomes a numpy array And I want to know how many of our simulated ones were greater than r Our r that we found and then we divide by how many of these simulations we did And we did 5,000 of them so I can get this fraction the proportion of ones In the simulation of r values that was greater than ours and remember our p value was zero So let's see our simulation zero There was none of those reshuffles of the pairs the values in the pairs That gave us the r value that we got so that was very unlikely to have been found And we see that very small p value So let's get to the uncertainty then in our correlation coefficient And what we're going to do here is we're going to do bootstrap resampling, of course So we've got an r value We know that it was an unlikely one to have been found now How certain are we about those r value can we set some confidence intervals around this r value? So we're going to use the numpy dot stack function And we're going to pass as a python list the independent variables and the dependent variables and we're going to set the axis to one And i'm assigning that to the data function. So let me show you what that does very neatly It's going to create this array That is a list of lists So each subject it has the independent variable value and the dependent variable value For that subject for the next one for the next one for the next one. So I have to stack them together So they're in a neat nice stack. Of course, if I had this as a Pandas data frame, you know, it would be much easier just to extract those two columns But because they twin two separate arrays, I'm using This stack function. So now that I have them I can actually do resampling So bootstrap resampling that means with replacement. So this is what we're going to do here So I'm creating this computer variable r underscore vales underscore boot and I'm using list comprehension to create 2000 resamples Remember with bootstrap resampling though each one must have the equal sample sizes we had before And that's why we set the size to 50 and the way that we're going to do this is we're going to say Random dot rand int I'm taking a random integer the shape of my data remember that was 50 long and too wide Two columns and I'm asking for the zeroth index. So that's the 50 So in a random you take a random integer from zero to 50 that in other words zero to 49 That's one to 50 And I want the size of 50 of them And we've got this row var equals false as far as this as this RAND int function is concerned as suppose as this correlation coefficient function is concerned and you can see there row var equals Two as the default But when we scroll down we say row var is true by default And that's not where we want we want false because We have here each column represents a variable and each row is an observation And that's why we say row var equals false there So I take a random integer From zero to 49 in other words i'm looking for between rows one and 50 But in python speed python being zero index starts at zero i want 50 of them And I want to calculate the correlation coefficient of each of these so remember now with this bootstrapping I'm going to have repeats in each one of my 2000 So definitely going to be repeats. So let's do the list comprehension of that Now, let's set a confidence level of 95 percent. Remember if you if you want a good reminder Just go to the previous notebook So we're interested in the 2.5 percentile and the 97.5 percentile So as an index which one of those values will be remember how to calculate k So we have we want the 50th value and we want the 1950th value Of course zero index. So we're actually looking for 49 and 4949 Let's just remind ourselves of what the correlation coefficient was it was 0.716 So what we're going to do is we sort the values Remember we have to sort them and then we take the 49th index the value and that is going to be our lower bound For our r and we see that's 0.58. Let's look for the upper bound The index one nine four nine once we've sorted them And there we go. We see the upper value is 0.82 So we have a lower bound and an upper bound for 95 percent confidence interval around our correlation coefficient So let's plot that out And by the way, we can that stated there our correlation coefficient was 0.72 95 percent confidence interval 0.58 to 0.02 and the p value value is very small So mostly we wouldn't say zero. We just say it's less than 0.01 So there we go. We see our plot. There's our correlation coefficient We see this this distribution of possible r values from bootstrap resampling And we see our bounds our lower bound and our upper bound for 95 percent confidence intervals Absolutely great So there's various ways to tackle this problem And this is one way and there's a nice paper that explains this Using bootstrap approach to correlation analysis. So in this next section, I'm going to show you a brand new distribution You know what a normal distribution looks like nice bell shape curve t distribution Which is very close and we use because we don't know what the population looks like and now we're going to just have a look I'm just going to show you the f distribution So we're not going to get into the nitty-gritty of the f distribution But it is a very useful distribution, especially when it comes to linear modeling So I'll show you what it looks like. We'll see one or two equations But then you know once you know that it exists. That's that's all we need to know in this course We'll start using it. So there we go. We see the f distribution and an equation for there. We see this whole thing and Very impressive looking thing makes use of all sorts of Our weird symbols. We see a beta the beta function there What we're interested in know is that d subscript one and d subscript two So it comes in the numerator A couple of times and then in the numerator denominator And those are two degrees of freedom. So there's actually these two degrees of freedom In this function, and that's the only thing we're really going to be concerned about what I've done for you here it's just creating my own python function f underscore pdf And it takes a arguments f a d one is set by default to one and by the d2 to 19 So we use this function and we don't pass values for d1 and d2 the defaults are going to be one and 19 And it takes this special dot beta. So the beta function So it's going to take all of that and I've created that little function And the reason why I do that is just want to I just want to show you what this distribution looks like And it really just depends on d1 and d2 Those are the important ones and we see different d1s and d2s There's a d1 of 1 and a d of 10 and a d2 of 1 A d1 of 5 and a d2 of 2 a d1 of 29 and a d2 of 18 So you just see what sort of what they look like They start off with you know bunched up here on the small side with these long tails But they can also start Starting to go towards normal distributions And what we do is we're going to calculate an f statistic this time and that f statistic We plot somewhere on our draft And so here we have one Where we have a d1 of 1 and a d2 of 10 and an f statistic here of 3.5 And how we're going to calculate this proportion of how likely was it To found to have found the results that we have is from this this value towards positive infinity So this tiny little area under Under the curve there and of course it always counts from the left-hand side So what we're going to do is 1 minus all of this That tells us how much is there, but remember when we do our sampling techniques We just have to ask in all of our resum and in all of our simulated recounts for instance Now how many we're from there outwards Under this null hypothesis of ours So let's just see what that would have been i'm using stats.f.cdf. Don't worry about that We see that that would have been a p value of 0.09 So that's the f distribution now You know that it exists now and I suppose that's all that's important for us in this course so We've spoken about correlation. We know now how you know one changes one variable changes as the other one But now we want to use this idea in modeling We want to do some modeling on this. So let's have a look at our data again I've got my independent variable my dependent variable So what I want to do is I want to create this model that given any independent value I can use this calculation to give me Some dependent variable value That's what modeling is all about. So let's put this in a data frame. I'm just using I'm just using The dictionary there. I've got two columns independent and dependent And I'm just adding a third one called group and I'm just going to repeat c and e So that some are going to be c and some are going to be e Starts off with a bunch of c's and and then the e's and that's what we have our data frame. They're independent dependent and then a group variable Now as I said, I want to build this model that given any independent value It's going to predict what the dependent value variable should be and for that we need this thing called design matrices because we're going to use the stats models and these design matrices are just perfect for use inside of Stats models So here they are there's the d matrices function. Remember we imported that from patsy And it's going to return two values for us It's going to return This y value we're just going to call it y and then we're going to call the second one x But that y is just going to be Our dependent variable As simple as that But x is going to be this matrix and it contains the independent variable, but it contains a second column as well So let's unpack this. I'm using the d matrices function and then I pass a little formula And there's my formula. I'm saying I want this matrices design such that the dependent variable I'm trying to calculate that given the independent variable. So it's dependent till the independent So of course you're going to work with data and these columns are going to be named differently But this is really these names are very illustrative of what we're trying to do here with this formula I'm trying to predict the value in the dependent variable Given the independent variable and the data comes from the df data frame and that's going to give me these two design matrices So let's have a look at them. Let's have a look at y And why it's now a design matrix data type, but it's just my column of independent My column of dependent variables. So you can see they're dependent And if we look at x it's now a design matrix, but this is what it looks like I still have my column of independent values But it's added this intercept column of all ones And that's all about using linear algebra behind the scenes to do these calculations for us But I just want to show you that's nothing other really than still the dependent and still the independent columns And the d matrices just makes these design matrices. So the linear algebra behind the scenes are very easy So let's just think about how linear regression works So in this equation five, it's a beautiful illustration of what's going on here. I have all my dependent variable values there 73 86.6 90.7. They're all there And I'm saying I can actually calculate each of those values now They represented here is something that we call a column vector. Don't worry. We It's just a shorthand notation rather than writing out 50 of these equations. It's just very shorthand Let's say this if I know these two unknowns and we usually call them beta sub zero and beta sub one If I know all of them I can do this calculation in other words That's 73 is going to equal to beta sub zero times one So I can take a scalar here. That's a single value and multiply by a made by a vector That just means I multiply each and every value by this beta zero plus If I knew this beta one what that value was plus 85.1 plus some error I'm going to get 73. So beta sub zero plus beta sub one times 85.1 plus some error is going to equal 73 And 86.6 is going to be beta sub zero times one plus beta sub one times 95.6 Plus error two that's going to be exactly 86.8. And that's the way linear regression works Now, of course This error we have to throw that away because in a model all we Trying to do is to get values for beta sub zero and beta sub one And if we know those two and if I do beta sub zero times one plus beta some one times 85.1 I'm going to get very close to 73 not really 73 but close to it and then the next one I'm going to get very close to 86.6 So instead of this vector here of actual values, I'm going to have a vector of predicted values So all we're trying to do is just this approximation So for this first one, it's going to be approximately 73 I just need to find these two unknowns and it's two unknowns Because I have one independent variable if I two independent variables, it's going to be three unknowns Because there's always this beta sub zero. I think back to school. We had this idea of y equals mx plus c Just a straight line the slope plus an intercept and the intercept is when x equals zero And if we look at this line here, this is what we're trying to predict So for any given Value if I go up to this red dot there I know exactly what the actual independent variable is but I'm going to build this blue line This model and it says given the independent variable. I go up to my model go to the left And if I read that off on this y-axis that gives me the predicted value And this is what this model does But what we need is, you know, where do we draw that line? You know, does it be a bit more slanted? Does it go up or down? You know, we we want the best possible one and the best possible one is the one that makes the least amount of errors And that's exactly what the error is It's this difference between what the model would predict and what the actual value is So if we look at this blue line, you know, it under predicts some and it over predicts some So it's just that difference there between those two From what the prediction is to what the actual one is that difference So everywhere we go, it's just got to be that little difference And we call those errors the residuals Those are called the residuals and we want to minimize these residuals And the way that we do that is to look at the least squares So what I'm just trying to represent here graphically is there's an error there And we just square that error Because remember some are going to be less than some are going to be more than and if we add all these errors We're going to get to zero. That's not what we want. So we square everything to make it positive And that's what we do. So all we really want and remember squares is going to give you a square So what we want is these little squares. We want the minimum size for all of them. We want to minimize This error that we make and that's what linear regression is all about minimizing those now There are various techniques to go about it. We can do either gradient descent Or we can use something like ordinary least squares here and this is where this a and a comes in Those are our matrices and the matrix has the one in it So it's just a way that we're now going to get a vector from this and that'll be a vector of two elements A value for beta sub zero and a value for beta sub one. Don't worry about that In case you're interested you can go look up that or we can use as I said this method of gradient descent Doesn't doesn't matter what we are interested in here This thing called r squared. So this is an upper case r squared and that's the coefficient of determination And that tells us how well our model does. That's the one thing we're interested in This as we had the lower case r Pearson's correlation coefficient to tell us how How good that correlation is Here we're expressing how good our model is And that's the coefficient of determination So in this course, I'm not gonna it's not about this ordinary least squares Equation that you see in six or how to use gradient descent We're just going to use a line of code and that's going to give us our model But we need to express You know, how good that model is that's what we're interested in here And we express that by the coefficient of determination r squared And r squared is simply this equation here It is the this variance in the mean model residuals Minus the variance in the best model residuals divided by the mean of the Mean model residuals. So it's this this fraction this ratio. So what am I talking about? So let's just Let's just go do this The way that this works is we want to compare our best model to what we call a null model The worst possible model. So let's come up with the worst possible model A model that will do the following it will say give me any independent variable Value and my prediction is just going to be the mean of the dependent variable So give me any independent variable value The the its prediction is always just going to be the mean. So let's calculate the mean of the dependent variable there We have it That's 90.46. So it says doesn't matter what you give me what independent variable value you give me I'm always going to just predict the mean of that dependent variable. So let's look at what a model like this looks like This is the a flat line. So no matter what value you give me. I'm always going to predict this red line 90 point what was it 90.456 That's all we're going to do now. That's going to be a bunch of errors That's definitely not the best line, but that's our null model line our worst line and we're just going to see Our model that we come up with the best possible one how this compares as a ratio to this one So if we look at this very first value It had I think yeah, what is that? 89 as far as what it was but our model predicts this 90.456 So, you know, we get that little error term there. So so we're going to go across all of these and we're going to get all these error terms So let's look at the variance of our mean model Because we go all the way up here. Remember, we want this variance of the mean model residuals and The best model residuals now because we're working with the mean The residuals those errors that we make that's just going to be the variance of that variable anyway So my Va of my mean model is just going to be the variance of the dependent variable How each of them is different from the mean just subtracting everyone from the mean squaring them and that's our variance So our variance in the mean model. There we go. It is 50.56 Now we've got to get our best model and as I said, this course is not about how to calculate this We're just going to use a an equation and that's sm. Remember from stats models that we imported stats models that api is sm It has a very nice function. Oh ls ordinary least squares and all I have to do is to pass y and x Remember y was all my dependent variable values and x was this design matrix with my dependent variable values But another column of all ones so I pass that y comma x and then I use the dot fit Model and I'm assigning that to a computer variable linear underscore model And it's as simple as that that one line of code is going to give me The best model no problems whatsoever. Let's look at a summary of this linear model And there we go. There's quite a bit to unpack there But the interesting part is right here where it says coefficients right there. I have my two coefficients my intercept A coefficient and my independent coefficient. So the top one is my beta sub zero and my bottom one is my beta sub one The y intercept and the slope that's all we're interested in here And we can see all sorts of other things we can see our f statistic there Remember how f statistic is right there and the probability of our f statistic So there's our f statistic 50.53 and the p value the probability of having found this one is basically zero So it's a very very good Model that we have there. So let's plot this out. Don't worry about this code I'm just doing this to have some x and y values so that I can plot out a line And that line uses this idea of our beta sub zero That's our our y intercept and our slope beta sub one And that is the best model. So if I look at all the residuals All these differences They are going to be if I square all of them add all of them. That's going to be at a minimum. This is the best model So I can look at all these residuals the little errors that it makes by just using the residual Attribute so linear model dot residual and I see all the errors there some are negative some are positive And I can actually see what the predictions would be if I just use the dot predict method and I pass all my x values to it Remember that's my design matrix if I pass that to it I see all the predictions and I can You know look at my predictions and I can look at what the actual values were so that we can see that comparison What we're interested in though to calculate our squared Is this idea of the variance of these residuals the variance of These minimum errors that we may let's look at their variance And there we go. It's 24 and remember now how to calculate our squared Well, that's just going to be the variance of the mean model Minus the variance of the best model divided by the variance of the mean model And there we're going to get r squared and that's 0.512 and that's exactly what the model summary showed us So we're going to be on this interval of zero to one We do get an adjusted r squared. That's something slightly different and penalizes for having too many variables in it But in essence it goes from zero to one and zero being Well, it's no different from the bad mean model Up to 1.0. And that's a you know, a perfect model So our coefficient of determination r squared is 0.513. We've calculated ourselves We understand what's going on here and how do we interpret that? Well, we say our model Given this independent variable values explains 51.3 percent of the variance in the dependent Variable value Because we can see that from this little equation that we used here for r squared So whatever the r squared value comes out that is the percentage just multiplied by 100 So 0.51 times 151.3 if we round it off there Our model given values for our independent variable explains 51.3 percent of the variance in the dependent Variable so it's simple as that and I just want to show you the equation here for the the f statistic I just put this in here for interstake not what the course is about. That's how we calculate the f statistic So it's the variance of the mean model residuals Minus the variance in the best model residuals divided by p best model minus p Mean model and that p stands for how many parameters? How many parameters went into the best model? Well, that was two remember I'd beat us up zero and beat us up one. How many went into the mean model? Well, just one Is this summing up all of them and dividing by how many they are? There's just one there. So we can do that. I'm saying p best equals two And p mean equals one just one parameter went into the mean model, but two parameters go into The best model. Why only one? Well, it only has a y-intercept There's no slope. It's just a straight line and the number of Cases we have is just the length of either one of them This is the sample size because in the denominator of this is another little ratio That's the variance of the mean model residuals divided by the sample size minus the number of parameters in the best model So in case you're wondering that's how the f statistic is calculated. So in our instance, let's do that Oh, let's save that first Those computer variables there and then let's calculate an f statistic. There's our f statistic exactly what our Value said and then to calculate a p value. It's stats dot f dot cdf past f value The p best minus p mean So I've got to have that difference and then the n minus p best So the two denominators the denominator that was in the numerator and the denominator that was in the denominator We pass those and we're going to get a p value and you can see it's the same. It's almost zero So it really is as simple as that but that I want you to have that intuitive understanding of what this linear model is It's this idea of given independent values. We predict The dependent values and as I said, we use this in circumstances where it's very difficult to get the dependent variable values And it might just be much more useful for us to have this model because now with our model, you know We can look at other scenarios We we can do a lot and this is just a very simple linear regression, of course Under this term of generalized linear models. There are many models that we can build much more sophisticated models and That can really help us out and of course in this day and age of artificial intelligence machine learning That's another approach to this There's so many approaches to building these models to predict some outcome and that becomes very useful And as much as it can predict What movie you would want on your streaming service self-driving car You know, that's all about machine learning But linear models is really the the basic the basic element of of of all of this So it's good to build a linear model like this, but there are some assumptions So we just have to look at one or two diagnostics Once we've built these models. So a linear model. We've built the straight line model There's this, you know, we have we're we're accepting the fact that there is some linear relationship between these And that might not always be Be the case and one way we can find out about that is let's just save the residuals as A computer variable. So that's the linear model dot residual That result would be What we save inside of this residuals Computer variable and let's have a look at the plot of this so there we go And We see that these residuals are scattered all over the show So all over the show. So that's just the residual now versus the independent variable And we see the zero line there and there's good representation all over the show So there's no no relationship between the independent variable and these residuals And that's quite important to have that so we can also just look at the correlation coefficient here If we just work that out, we see It is very small. It's zero there times 10 to the power negative 16 So that's between the independent variable and the residual So it's very important in this kind of plot that we don't see any pattern. Sometimes you see this Pyramid pattern pyramid lying on its side there might be indicative of some what we call heteroscedasticity Etc. So important. I think one of the important things to look at this that these residuals nicely spread So now we've seen what it looks like if we have a single independent variable What what if we have more than one independent variable? Let's look at an example of of a model with two independent variables So we call these These models multivariable linear regression. So just just pay mind There's also a term called multivariate linear regression and that refers to More than just a single dependent variable. So here we have still have a single dependent variable, but we're going to have more than one Independent variable as far as what we can use to predict our dependent variable and we call that multivariable linear regression So here we're going to have two variables. I'm going to call them var one and var two And they go both going to be taken from rand int And we see them there and our dependent variable. It's going to be this nice linear relationship It's just the one plus the other plus a bit of random noise And we're going to save that as a data frame and we're going to look at the first 10 elements there So I have my variable one and variable two. So I have two Two independent variables and then I have a dependent variable in the way that we Design this is that there is this relationship between these So we can do a little bit of a scatter plot And here's one way to go about it and perhaps for this data is not the best because we're trying to do Senior miracle variables here. So we have variable one and variable two and then a heat map Color map as you can see there on the side for the dependent variable Perhaps this would be a better way to represent the data as a scatter matrix. So px dot scatter matrix Now if you have many variables, this can become a bit unwieldy, but for us we only have these So we can see variable one and against variable one. Of course, that's going to be beautiful Correlation there and two against two and dependent against dependent But here we have variable one And variable two knows the any kind of correlation between those two. It certainly doesn't look like that and variable one and the dependent variable not so much and if we look at variable two And the dependent variable there seems to be a bit of correlation there So once again, if we want to build a linear model, we're going to Create these design matrices. So once again, I'm going to say d matrices and there's our formula Very easy to read. It says take my dependent variable and please please use variable one and variable two as predictors The data comes from the df data frame object And it's going to return for me two sets of values. The one is just going to be a vector of My dependent variables and the other one is going to be look slightly different. So let's have a look at those So remember to put the one in the first column That's exactly what we're going to have here the ones in the first column But there I have my variable one and my variable two They're still both there. It's just about the linear algebra for this ols function. So there we have sm dot ols ordinary least squares I'm passing y and x and then the dot fit method I'm assigning that to the computer variable multi underscore lin underscore model And let's run that model Quick and easy and let's look at the summary. So the dot summary two Method there And once again, we're going to look at these intercepts. I see my beta sub zero there 8.9 I see a beta sub one value and I see a beta sub two value Very nice to do. So let's do this Let's have the va mean model So we're going to have a mean model and the best model once again because we're interested in that r squared value So by the way, there you can see the r squared. It is there r squared right there 0.091 And we see our f statistics and the probability of the f statistic right there So let's create those two variances. Now remember of the mean model That's just going to be the variance of the dependent variable Because I'm just using the mean of that as my predictor But as far as the variance in my model, my best model is concerned with taking the variance of the residuals So we have that save when we're using that equation seven for the r squared The variance in the mean model minus the variance in the best model divided by the variance in the mean model As simple as that there's our calculation looks exactly the same as in the summary And just to remind you of this f statistic now my best model has three parameters in it Three parameters it has a beta sub zero beta sub one and a beta sub two My mean model still has any one. It just has a y intercept And my length there so if I just save those then I can do my f statistic and one minus Remember when we looked at that First one of those first graphs is one minus The total on the left hand side gives me the right hand side and we see that p value there So very very simple to do when you use python So for the sake of interest I'm now going to show you how we can use the f distribution instead of the t distribution to compare Two means to each other and then we can use analysis of variance and the f distribution And I'm going to show you how to compare the means of more than two groups So it's just a bit of interest in the last section of this lecture. So revisiting the t test So what we're going to do here is just to simulate some values. I'm going to have group one and group two And both of them are going to be taken from a normal distribution one with a mean of 100 It's standard deviation of five the second with a mean of 103 standard deviation of eight And I have 100 subjects in my first group and 110 in my second group And then I'm going to have this Nampire area I'm going to call it group all and all I'm going to do is take group one and append to it group two So I have this very long Listing of 210 elements So let's run that and let's create this Boxing whisker plot and what we can see here is group one So we have a continuous numerical variable same for both groups But I have two different groups group one and group two And we can start to think is there a difference between these two So students t test was very easy. That's the stats dot t test dot underscore i and d at least and I just pass the two arrays to that And it's going to give me a t statistic and a p value But instead of that, let's use the f statistic And we calculate a p value for that and the way that this works is We're going to have this idea of the sum of squared areas for the mean of all the values as a model And then we're going to have a best model. So look, let's look at this sum of squares for the mean Let's put that together It's going to be the sum of I take the group all And I subtract from that the mean of group all so I've put all the subjects together in one group And I'm just doing this sum of squares sum of squares So I'm not dividing by how many there are we're not doing that So it's just this very simple idea of take the whole lot throw them together And we're just going to do not the variance just simply I'm going to take that difference between each value and the mean of all the values And I just square that and I sum all of those up and that's the sum of the The sum of the squared areas there and that's what we see Now let's do this individually for each group same thing. I'm taking group one Minus the mean for group one and squaring each of those and summing each of those that would be the sum of squares For group one and I'm doing exactly the same thing for group two And now what we have is the following this the idea of best It's just the sum of those the sum for group one and the sum for group two And that gives me the sum of squares as far as the best model is concerned And there's my f statistic once again It's the sum of squares of the mean model minus the sum of squares of the best model divided by these number of parameters So let's have a look at those parameters in my best model. There's going to be two parameters One for the one group and one for the other group as far as the mean is concerned And then for the mean model where I just throw everyone together There's just going to be a single mean so my parameter there is one and of course My number of samples the 210 is this the length of all of the group So if we save that we can calculate an f statistic again And there we go. We have our f statistic and once we have that it's one minus that little On the right hand side and we're going to get exactly the same p value here As when we use something completely different an equation that looks completely different using the Just using the t test and the t distribution So just a bit of interest there just how useful the f distribution is because we can expand on this And do some analysis of variance. So here what we're going to have here is I'm going to have three groups a b and c we're going to call them and we're just going to create some random values And let's have a look Once we put that inside of a data frame what this looks like So I've got group c b c c a b and then I've got this variable Right down here and it comes from Rand dot random dot rand n So that is this idea of a standard normal distribution. So mean of zero standard deviation of one so You know on the minus side and on the positive side of zero We have this normal distribution and that's where my values come from So let's separate all of them out as NumPy arrays and I'm going to call mine group a group b and group c And I use df and then these conditionals that we know very well now df dot group equals equals a And then b and then c of the variable column And I'm extracting those or converting those to NumPy arrays. So there we go So let's just look at a little bit of descriptive statistics for these just using group by So we can see the mean For those three and the standard deviation for each of those Let's have a look at a box and whisker plot because if you look at these you should have some idea Where do you think that is going to be a difference between these three groups? So one thing I want you to be very aware of when we use analysis of variance like this is I'm not making pairs of these. I'm not comparing group a and b or a and c or b and c Individually no, it's a comparison of all three of them together And only if we find a very small p value here Will we go on to do what we call postdoc analysis? Where we just do pairs of them to see where the difference actually is But when we have more than two groups like this, we've got to analyze them all in one go And that's what we do with analysis of the variance Now once again, very easy to do. I'm just going to show you the code for this. It's stats.if underscore one way So it's a one way analysis of variance. I pass my three values there One two three my three arrays and there I find a if statistic and I find my p value And so very very easy to do now if we want to do this by hand, I'm just going to show you how to do that I again have this idea of ss mean this mean model and that df dot variable remember that gives me a Panda series object. That's all of them And so each one of them I am just subtracting from that the mean of all of them and squaring them So this idea of the sum of squares and then I'm going to do it individually For each of the groups so in group a I subtract from that the mean of group a square that sum over all of those in the same For b and c and then my sum of squares for my best model just adds all three of them Now my mean model Remember it just has a single mean So one parameter my best model has three the length is All of my sample space combined No difference in how to do the if statistic no difference in how to calculate the p value for that So this if statistic very very important. So I hope you took away from This at least this notebook. I put in a lot of extra stuff Um, which is much more advanced topics and I want you to be aware of them I wanted you to be enthused about them to start learning more about them What it was about in the beginning though is understanding that there's a covariance Then from that we can look at the strength of that covariance by the correlation coefficient and once again We can calculate the proportion of if we were to simulate that over and over again The proportion that will be more than the value that we found In other words, we can simulate a p value and we also You know can measure this uncertainty that we have by bootstrap resampling And that is you know the deeper understanding of what really goes in here and builds on exactly what we did in the previous notebooks