 In this video, we're going to talk about logistic regression, which is a form of regression that allows us to come up with an equation of a line when we only have two options. So think about a situation where you can either succeed or you can fail, and you still want to know the probability of success or the probability of failure. This is what logistic regression can do and what we're going to demonstrate today. So we're here in Google Colab. I've got my libraries loaded, and we're going to be working with the Marcellus Wells and the PA Wells Frac dataset that we used during the correlation lesson earlier in this lesson. So I've already got the data loaded and I've already gone through our grouping and merging. If you want more details on this process, go back, check out the correlation video where it's explained in detail. But let's move on to the logistic regression. So logistic regression only works with a binary data set, a 0, a 1, a yes, no, success, fail. So we need to come up with a way to form one of those variables with our data set. In order to do that, we need some constraints. So we're essentially going to compare the cost and profit of drilling a new well in the Marcellus shale based off of the total water needed to extract the most gas. And so we have some constraints. The first is that water costs one cent per gallon. The next is that renting a truck carry that water costs 200 that the truck can hold 6257 gallons. That drilling a well costs $850,000 per 1000 feet and that there's this base of $2 million for renting land, paying employees, etc. Now these numbers are not for anything specific. They're just estimates that we pulled out specifically for this demonstration. And so essentially the cost of our well is the cost of water, plus the cost of a truck, plus the cost of drilling, plus the cost of just that base load. And then our profit is based off of a price from a couple years ago where gas was selling for $5.20 per 1000 cubic feet. And so that will be our profit. And then we define success is whenever cost is less than profit. So that's what we want. And throughout this will be using a new function called empty where, which will allow us to assign a variable based off of a condition. The first thing we need to do is set up our new variables cost and profit. So I'm just going to create a variable cost, and it's just going to follow this function right here. So 0.01 times our total base water volume, which is already in gallon so we don't need to do any unit conversions. And then we have 200 per truck. And we need to figure out how many trucks we need, which is the total base water volume divided by 6257. And then we do a plus, and I'm going to put a slash mark here. And that slash allows me to enter down while maintaining the same function line of code. So if you're doing this on your own, and you're just in a single line, you don't need to put that slash in. But for me that helps so that you can keep seeing everything that I'm doing. And this part of the code is the cost of drilling. So that's 850,000 times merge DF total depth divided by 1000. And then finally, the two million base pay. So that's about what is our cost. And then we can define our profit, which is just 5.2 times our total gas. So now we've got our, our cost and profit so we need to calculate our successes. So I'm going to create one variable called success. And this is where our NP dot where comes in. So the command is NP dot where, and inside this command, there's going to be three parts. First, you're going to give the condition, comma, what to do if true, comma, what to do if false. And so our condition here is when merge DF of cost is less than merge the F profit. And so, if this is true, we want to write the word yes into that particular row. Otherwise, we want to write the word no. And then I'm going to create a second column, which I call success num, which is just the same statement. I'm just going to copy it. But instead of words, I'm going to change it to the number one, and the number zero. And this will become important later on when we need to have numbers to represent our success. And so if we look at the first five rows, we can see that now we've got our cost, our profit successes, and the number of success. Now this where statement is essentially performing a mixture of an if statement and a for loop all in one. So alternatively, you could do this. And essentially, that process ends up being a bit longer. But I'm going to go ahead and give a demonstration just for the sake of providing contrasting. So we would have for I in the range of merge DF. So for every row in merge DF, then we need to test that row to see if cost is less than profit. And we say if merge DF. I look. I so the I throw that look in the cost column is less than merge DF. I look again the column, but in the profit column. If that is true, we need to say merge DF. I look. I set the column in the success or the I throw in the success column equal to yes. And then repeat that for the success num. If our cost is less than profit. Then we need to do the other statement so we can say L if so else if combined. I'm going to copy the same basis here because we want to make sure it follows the same process. Instead, we're saying if cost is greater than profit, then our success is no, and our success num is zero. And then to backward spaces to end up outside of both the if statement and the loop. We run this and we can see that it's taking a little bit longer to run. It has to go through every single world test every single condition. It's telling me that it doesn't like how I've sliced the data. We can come down here and see the same results as we had before. But instead of having two lines of code, we've got seven lines to do virtually the same thing. And so either way is appropriate, but in this case, the where statement does work a little better. And so then the last thing that we are going to do in this video is to visualize the variables colored by success. So I'm going to use ggplot with merge the f. I'm just going to do a point plot. And so our x variable is total base water volume. Our y variable is total gas and then inside the aes statement I'm going to specify a color as being success. And then outside of the aes statement but inside the point. I'm going to say alpha is point to five. So that our plot points are partially opaque. And so here we can see our success data where we have some nose down here and some yeses up here there's a few outliers. But by and large they do seem to follow this line so there might be a good model to use this logistic regression, which we will get into in the next video.