 Welcome back to our videos on multiple linear regression. So in the previous video, we conducted this visualization and sort of indicated that maybe the state is an important explanatory variable for our model. So in this video, we're actually going to conduct our first round of multiple linear regression by including that variable state. And so this is where we're actually going to use the statsmodels.formula.api command that we learned about in the linear regression lesson. So I'm just going to call my model results. And the command is smf.olls, all lowercase letters. So if you recall, we've got two different OLS commands. One is all capitals, one is all lowercase. For multiple linear regression, we want to use the all lowercase command. And then we in quotes give our formula. So we give our y variable, which is nox, rggi, tilde, our x variable. And this is where we stopped for linear regression. But now with multiple linear regression, we can add on additional variables. And so what this does here is we're now adding state as a explanatory variable. I've wrapped it in that C parentheses, which tells Python that this is a categorical variable and to treat it as such. And then this minus one removes the intercept from our data. And then outside the formula, we need to tell it that our data is called df merged. And then we also need to set has const to true. And this is a little bit of a possibly, you know, an odd thing, because we remove the intercept here. But then we set has const equal to true, we still need to tell it has const, we still need to tell it to look for an intercept. Otherwise, it tries to do a centered analysis. And we don't want that. And so this keeps it as uncentered. But then this tells it, okay, now that you've looked for the intercept, actually just remove it, but still keep it as an uncentered analysis. And then I've tacked on the fit command to the end here. So then we can print results about summary. And we can see this is our OLS regression results. And there's lots of things that are being shown here. First, we've got our adjusted R squared, which we can see is different from our regular R squared, which usually happens. We still see our f statistic and the p value. So the p value is zero. So we reject the null hypothesis in favor of the alternative that or essentially our p value is low enough that we can say that at least one of the predictors we included is effective. It's got some other measures of model performance. And then we have all of these coefficients. And so you can see that because we included state as a categorical variable, it's actually created an explanatory variable for each category within the variable. And we can see that some of these are not very significant. So 0.525 for state 45. We've got state 32.765, state 29. But some of them like state 16, state 34, state 37, these are all still really important. So we kind of need to keep the whole categorical variable in our model in order to make sure we get these important relationships. And so then this is how we would interpret our results. But we can go back to visualize that new fitted model. And so to do that, we need to create a data set or a variable called y hat or in our DAF merged, we can say results dot fitted values dot copy. So essentially we're pulling out the fitted predicted values for multiple linear regression, creating a copy of it and storing it as y hat. And then in our ggplot, I'm actually going to come up here, grab this. And so we can keep our x and y the same, we can keep our g on point the same. But what I'm going to change here in the stat smooth method is I'm going to specifically create a new y variable. So normally, when we set it up here, we can carry it throughout the entire analysis, but in this case, we want to change stat smooth to be y hat. And we also want to give it a color or state. And so now we can see that we've got different best fit lines for different colors. So they're all still very linear, but we are starting to see some differentiation between the Pennsylvania 37 and this gray line 45. See some differentiation to show that the predicted values are different for each of the different states, but it's still not great. And so there's definitely some other variables that we're missing here that would make this better. But before we get on to those results, we need to do the residual analysis. So the residuals are actually how we determine if our model is statistically valid. And so unlike the linear regression, where you just needed to plot your actual data to see if all of those conditions fit with multiple linear regression, we're comparing our residuals. And so for that, we need to actually have ran the model before. So it sort of reverses those steps, we run the model, and then we check for statistical validity. But to do that, we need to create a new residuals column, which is just our actual data, our actual y data knocks our GGI minus our predicted data, which we called y hat. And then we can add a ggplot command. Our data is df merged. And we'll just do a point plot where our x data is now y hat. And our y data is residuals. And so we can see that this appears to be breaking at least one major condition. And that is the constant variance. So this residual plot is showing a very distinct band shape or wedge shape. And so the variance is increasing in our residuals, which means that we're not statistically valid. We've broken that condition. And then one could argue that maybe these are some outliers that are breaking that condition. But also these are quite large numbers. So we're at one million to three, negative three million. These are very large errors. So in addition to breaking some of those conditions, it also doesn't appear to be an ideal model in terms of minimizing the residuals. In the next several videos, we'll get into how we can add additional variables to our analysis and possibly improve this prediction.