 In this video we're going to talk about some transformations and demonstrate how you can use these transformations to avoid breaking some of those linear conditions that we talked about earlier. So with that, let's go ahead and jump in. So in this lesson, we're going to use the RGGI data that we used during the ANOVA lesson. So I've already got the data loaded and ready to go so we can jump right in to linear regression. And so, as we start off most things, I want to start off with a plot. So I'm going to do DF merged, which is what I call that data frame. So X, we're going to do NOX base and compare it to NOX RGGI. I forgot my Y equals. And so this is what the NOx base and NOx RGGI data set looks like. So it sort of follows this linear. And then there's all of these over here. And maybe we can classify those as outliers. But I think more likely that this is breaking our constant variance rule. Because this is a very obvious fan shape, even though there's nothing really here, this is sort of the outline of a fan. And so it probably is not the best linear model. We can go ahead and find out. We can say SMF dot OLS. We can do NOX RGGI tilde and OX base. Our data is the F merged. And to specify has const equals true. And I'm going to tack on the fit right onto the end. So then we can print results dot summary. And see what how the model is actually doing. So it's not a terrible R squared. So it's 0.748. That's surprisingly good given all of this out here. But nonetheless, this banding shape clearly breaks our constant variance rule. And so even though the linear regression is good. And in terms of the R squared, technically it's not statistically valid. So here, if we did a transformation perhaps in order to ensure that this that this linear model can be considered statistically valid. And so what we are going to do is we're going to do a log transform in the hopes of remedying that increasing variability. So we can say merged log 10 NOX base. It's just NP dot log 10 BF merged NOX base. And then I'm going to just repeat that, but changing base to RGGI. And so here we have this, we do get a warning that we encountered a divide by zero, but it didn't stop it from running. So we can just sort of move on, see if anything else causes an error, because it could have just created NANs that won't necessarily be bothersome. If we go back and do our geom point plot, this time plotting log 10 NOX base and log 10 NOX RGGI. We can look at this and we can say that the variance is no longer increasing. Like there's no obvious fan shape. In fact, most of them are now clustered around even stronger center line. These points out here maybe now they can be considered outliers. We can't be sure though until we test it. But so far it looks like our transformation did successfully reduce our variance. Now we just need to make sure we're not breaking another one of those rules, which is no outliers. So in order to do that we could do our IQR method and quickly calculate that. But I'm just going to do the visual method. So technically the geom box plot that we use, it calculates outliers for us. And so we can just say x equals zero. So this is the one where we need to sort of provide that weird x value. Just needs to be any constant. So x equals zero is common. And then y is long 10 NOX base. And so in the base value, there's no outliers. If there were, we would see these little stars appear, right? But we don't have any. So this is a good indication that there's at least no outliers in the base data. I'm going to come down here and paste that same code, change the suffix to RGGI. And there doesn't appear to be any outliers here as well. So in that case, our newly transformed data is no longer breaking the increasing variability long, and it's not breaking the no outliers rule. So we are good to go to grab to do a regression. And so I'm going to call this regression results too. But we're still using the linear regression command OLS. But now we are regressing the RGGI log data with the log data of base. And again, data is merged DF as const is equal to true. And then we can do the fit at the end. Then we can print results to that summary. And run this had the wrong order BF underscore merged. So now we get this rate, this error. And all it's telling us is that SVD did not converge. So we're having a lack of convergence. And the reason we keep this error in our demonstration is because it's a very strange error, but it's very common with log transforms. So the output from this SVD did not converge. But what is really being caused by this is the same thing that we got a warning about up here. We ended up with a zero in the log 10 data, which actually creates negative infinities for our data frame. It doesn't matter for the plotting the plotting will just ignore those values. It doesn't matter for the actual regression. So I'm going to leave this in here, so that when you see this code, you can recognize this error, and just be aware that if you do run into this that this is to go back up. Make sure you check to see if you got that warning saying that you divide it by zero. And then all you need to do is remove the negative infinities. So in order to do that, I'm just going to override DF merged. And I'm going to say I want to keep DF merged and just do two conditions. The first is where DF merged log 10 and 0x underscore RGGI is not equal to negative NP dot INF. So this is how we define infinity in Python is NP dot INF. And specifically we want to get rid of the negative versions. And we want to get rid of, we want to get rid of those same values in our base code. So this is saying keep everything except where log 10 and 0x RGGI is keep everything where this is not equal to negative infinity. And where log 10 and 0x base is not equal to negative infinity. So we run that, I'm going to come up here and grab the same regression line. Or the to avoid overriding our previous results, I'm just going to change that to results three. And now we can see that it's running. In fact, we've got these really high R squared values. So even though our linear regression if we go up here, 0.748, not terrible, but now with this log transform, we're up to close to 0.9, which is an incredible R squared. Our model is now explaining 90% of the variance in the data. It's just important to remember that if we use any of these coefficients that they're on the log transform of the data and not on the original. If we wanted to make a prediction, for example, we could do that. But if we wanted to apply it back to the original data, we'd need to undo the log transform and just figure out what that real value is at that point.