 In this video, we're going to actually conduct the logistic regression. So much of what we did earlier was prepping the data that we needed. We got this nice plot, and now we're going to use the statsmodels.formula.api library in order to fit a logistic regression to our data. There's some information here that you can use to read up on what the logistic regression is doing. So I'm going to create a variable called logregmod logistic regression model. The library is nicknamed SMF, and this is up here. You can see that I've nicknamed this library SMF. And the command is smf.logit for logistic regression. And this is the formula. So we do the success num tilde np.log10 total base water volume. And so this particular model requires that numeric excess value, which is why we had to create both the yes-no version and the zero-one version. And then our data is merge def. And I'm going to pack the fit right onto the end of that. And so then we can print the results. We can say print logregmod.summarine. And we can run that. And the summary is much like what we got for earlier, when we did linear regression. We can see a lot of information. It even gives us the pseudo r squared value. So now technically logistic regression isn't linear. And so a correlation coefficient would equal zero, r squared would equal zero, because we can't apply those variables that measure to nonlinear functions. However, what this command does in logit is it calculates a pseudo r squared, which is a way to assess goodness of fit for nonlinear functions. And so this can tell us that technically only about 11% of our data is being explained by this logistic regression model. So not ideal, but still good exercise to do. So now that we have the model ready, we can use that fitted model to make predictions. And so I'm going to create a new variable called log10x, which is just going to use our linspace command. Start from 4 and 8, and add as many data points as there are in mergeDF. And then I'm going to create a predictions variable, which is just our logregmod.predict, so a slightly different function. And there's this variable called xog, which is just another way to say y variable. And so then it is a little bit funny. It needs to be in a dictionary, and it needs to have some connection to the original data set. So I'm calling it totalBaseWaterVolume, which is what our original data is. And that's also why up here I said log10 totalBaseWaterVolume instead of creating a new column for the logTotalBaseWaterVolume just so that these two values can be the same. But then we can print predictions. And so we can see here that now we have these predictions for our optimized variable log10x. And so then we can actually go in and visualize that data. And so we're going to start with visualizing the actual prediction. And so we're going, or actual data, sorry. So we're going to say ggplot mergedf geom point log10 water, which is a variable that was created. That is just the log transform of the water variable. And then our y is, again, going to be our success num. So we want it to be in our success num. So this is not how the plot is supposed to look. And I believe that it is because I forgot to do my AES statement so it didn't know what to do with x and y because that's not a normal inclusion outside of the AES statement. So let's run that again. And there we have what we actually would expect. So we've got success rate. We've got log10 water here. And so we can see that there was some successes over here, but not as many and a lot of successes over here. But there's still a lot of failures at this higher range too. So this is the original data. And now we want to add in our logistic regression curve, which is going to give us the probability between these values. So in order to do that, we need to first create new variables that are in our data frame because GGplot doesn't like to work with non-data frame values. So we can create a variable log10x, which just includes our log10x values up here, as well as a predictions column that just contains our predictions. And then I'm going to come up here, grab the same plot, and we're just going to add to it. And so we can add geomline, AES. And so here our x value is going to be log10x. Our y value is going to be predictions. And then I'm going to add a color, blue. I'm going to make it a little bit bigger, so do a size 2. And then I'm going to add on to that and add a y label. So this is not something we have done too much, but technically what the y axis is showing is the probability of success. So technically, although we're plotting success num and later predictions on the y axis, what it's actually telling us is probability. So this is a little bit more descriptive. And so here we can see this idealized curve as it goes up. So this is our actual data. But even though there's some data points over here, what the logistic regression is doing is saying that this has to fit some sort of S curve. And so it's more likely to be a failure on the low end with these being some outliers. Then it's more likely to be a success on the high end with some of these being outliers. And so this is ultimately the result of a logistic regression so that then we could come in and say that at log 10, 6, so that's of water, sometimes there were failures and sometimes there were successes. But if we come up to this curve and start estimating over here, it's about just over 0.25, so maybe 0.3. So there's still a 30% chance of success. So it's on the low end, whereas once we get up to sort of this inflection point, we start to see higher chances of success for the higher amounts of water.