 So in this video, we're going to continue on with our multiple linear regression analysis and at the end of the previous video, we got this residual plot and realized that maybe this isn't the best set of predictors that we could possibly include for our model. And so we're going to, in this video, introduce a way to do a correlation based variable selection. And so so far we've sort of been throwing we through this state variable at the analysis because we thought maybe this is important. But we didn't include any other variables. And when you have a large data set like this, you can try to put through all of the variables at it and see which ones are important. There's so many that at times it can become important to maybe take a step back and see which ones are already highly correlated and then focus on including those variables. So I've already selected down selected a few of these variables so we've got our response variable here, and these are just some explanatory variables that I thought might be important. Note that they're all numeric variables. So the first step is to actually create this correlation matrix. And I'm going to call it core matrix. And we take our selected data set. And we just say, core. And this is the same correlation command that we've learned in the linear regression lesson. It's just not being applied to any single variable, but it's just not being applied to any single variable, but it's just not being applied to any single variable. So that's what happens when we leave it empty. And so we can see that we now have this correlation matrix where we can see where variables are highly correlated to NOX, RGGI, and where they aren't highly correlated. And so what we are going to do is actually plot that to be a little bit more a visual way to show what these variables are. You could easily go through and make your selection just based off of these numbers. Before we can plot using GG plot, though, we need to melt the data set. So GG plot, if you recall, really likes those long form matrices, and this is a bit too wide. So we're going to melt. And it's going to look a little different from our previous melt. For one, we're not going to tell it which variables because we want to melt all of them, but we are going to say to not ignore the index, because we this is currently an index column here and we want it to be included. So don't ignore the index. And then after we melt, we're going to reset index. So we're not telling it what to melt or what to call any of these variables, just saying melt them all, including the index. So if I give you an idea about what this looks like, I forgot my dot melt command. So core matrix dot melt, and then you have the ignore index false and run it that way. So now we can see that our index one variable, and then our variable became all of these column names. So now we can connect variable column or row variable to column variable to the correlation. And then we can plot that using ggplot. And so our data is now core matrix melt. And our GM this is a new GM that we haven't used before is GM tile. Within our AS, we can say x is now index. Now variable. And we're going to fill it based off of value. And these are just these column names up here. And then, just to make it a little bit easier to read, I'm going to add a theme element where I actually change the axis text angle to be 90 on the x axis. So we can run that. And we can see a correlation matrix. And so the correlation matrix dark purple is going to be an okay negative correlation, and yellow is going to be a perfect positive correlation. And so we can see here that there's a block of yellow that are all fairly in green that are fairly correlated with each other we can see some dark correlation here and here. And then we've got a lot of sort of in between stuff around here. But what we are interested in is this RG GI role. And we want to go across here and see, okay, CO2 base is pretty highly correlated. Less so here. Okay, Knox base is good. It's one to one perfectly correlated with Knox RG GI. We've got PM 10 and PM to five SOX start to get some less good correlations here. And then some of the stack things seem to be decently correlated so stack diameter stack height, and so forth. So I'm going to go through and write those down and figure out which ones we actually want to include in our next round of multiple linear regression. So I'm going to create a new variable called results to so that I don't override my original results. And once again using that SMF OLS. So we start with our why data tilde x data. So I'm going to do an OX base, which was important. Plus, CO2 base was important. PM 10 base. PM to five base. So X base. And we still want to keep our state variable. And I'm also going to add in a second categorical variable and fuel type. And then finally minus one. And then we need to tell it our data is DF merged. And we said has const equal to true and tech the fit on to that. So then we can print results to summary. And so we can compare now so if we go back up here we can see our adjusted our squared was 7.761. And now it's 0.786. So we have seen quite a large improvement there. To sort of boot bolster that adjusted our squared. We're still seeing a significant p value. And then we can come in here and now we've got a lot more data. And we can see that in terms of fuel type, the coal category is critical for is very significant for this prediction, which makes sense. We tend to have quite a bit of Knox emissions so it makes sense that whether or not a power plant is coal based or natural gas based or nuclear does have an impact on their knocks emissions. And we can also see what wasn't significant down here so S O X was actually not very significant. So perhaps we could remove that in a third round. We could add more variables so I didn't add any of those stack based variables. The stack height or the stack diameter so perhaps adding those could lead to an improvement. But there are lots of things that you could continue to do to further increase this adjusted our squared value.