 Hello and welcome to the videos on multiple linear regression. In this lesson, we're going to extend what we learned about linear regression to including multiple explanatory variables within our model. So with that, let's go ahead and dive in. So in this video, we're mainly going to talk about the data that we're going to use in multiple linear regression as well do a preliminary visualization. So I've got several libraries here and I want to point out that we are using the stats models formula dot API library. So you learned about this in lesson eight. This is the way we did linear regression where we developed a formula using that tilde. We're going to use that here in the multiple linear regression code. And so what the data that we're going to work with throughout this lesson is the Pennsylvania regional greenhouse gas initiative data that we first introduced during the ANOVA lesson. And so there are several Excel files that we will use. And I've already got the code set up to read this in. And so we're going to be using the 2030 data. And we're essentially going to be trying to predict the base NOX predict the RGGI NOX pollution levels based off of the base case data. And so I'm going to go ahead and run this. And so as this is going, just a reminder that what this code is doing is it's first reading in the Excel files specifying that we're working with sheet 2030. And then we're setting the indices for stuff that we don't want to necessarily add these suffixes to then we have the suffixes for to differentiate between these different pollutants because we're going to in this code combine them all. So we want to have these different suffixes so that we know which column is for the base case, which is for RGGI and which is for no APS. And then finally I reset the index. And so if we look at say the first 10 rows. We can see that we've got this ID Latin lawn, we've got so many columns that it's just shortening it for us. And we can come down here, and we've got all of these different stack columns as well. So we're not going to be working with this data set throughout our multiple linear regression lesson. But before we get into multiple linear regression before we develop the models the first step is to always visualize the data try to figure out what exactly is going on gives us an idea about what maybe we want to think about including in our multiple linear regression model. So before we can actually run this code. I need to convert the state data into a category. So I'm going to say, whoops. Df merged state got as type category. And this is because the states are actually labeled as numbers and so when we read the Excel file in Python automatically assumes that they're numeric. But for the sake of plotting, we're going to want them to be categorical. And so then we can use ggplot to do our main plot with our df merged. And I'm going to set the aes statement inside this main command, because our x data is going to be this x and y data are going to be the same across all of our plots. And so we have x is Nox base, y is Nox Rggi. And then we can add point plot. We do still need one extra aes here, we need to set the color equal to the state. We can add a stat smooth. We don't need to do an aes. Aes here we can just say the method is LM. And then I'm going to show you a little bit of an extra thing with scale color brewer. And this is a way that you can very quickly change your color scheme to match one of the preset color brewer color schemes. And so we need to tell it that this is a qualitative data that we're plotting so categorical data. And then we need to tell it which palette. And these all have set names that you can find online if you just Google color brewer, but we're going to be using set one. And so we can run this. And we can see when we include just Nox base and Nox Rggi. A lot of the data is going here, but there's this data down here. State 37. This happens to be Pennsylvania. And so our best fit line for just this linear regression isn't too great. It's trying to sort of split the middle between all of these different variables. It looks like maybe that the state itself is an important explanatory variable. And so perhaps when we do get into our multiple linear regression, this suggests that maybe we should look into including state as well as Nox base as our explanatory variable.