 On this channel, I've made videos explaining some intuition behind logistic regression, digging into the map. Now we covered topics like the sigmoid function, linear decision boundaries, and so much more. But I want to get a more visual take on this. And when we say we are training a logistic regression model, what exactly is going on? That's what we're going to look into today, so sit tight. Logistic regression, as you may know, is typically used for classification in machine learning. Did this passenger survive on the Titanic, or not? Will this person default on their credit card, or not? In each of these cases, we have a binary outcome. And typically in machine learning, we model such outputs as a probability. So instead of having our model predict 1, the person defaults, or 0, the person doesn't default, we instead have it spit out, something like, this person has a 60% chance of defaulting. Because of this, such classification models have a fixed range of outputs to spit out, between 0% and 100%. We basically need a function that takes in any real number, from negative numbers to positive numbers, and squishes it into a probability. Logistic regression specifically uses a function called the sigmoid to do this squishification. Fair warning, the implications of using sigmoid can get fairly technical, and I'd be happy to make a separate video for it. However, for this video, let's keep it simple and stick with the intuition that we use sigmoid because we want to squish any value into a probability ranging from 0 to 1. So the sigmoid function, what does that look like? Let's start by first drawing a number line with two coordinate axes. The x-axis will represent independent data points, and y-axis will represent the values of our sigmoid function. A simple sigmoid function would look something like this s-curve. And no matter what value x may take, the sigmoid always has a value between 0 and 1. This fact remains true for any such sigmoid curve. The y-coordinate always ranges from 0 to 1. So this verifies that we can use a sigmoid function to model probability. For example, this point could represent the conditional probability that y belongs to a positive class given a value of x. I'd be remiss if I didn't give a shout out to Grant Sanderson, aka 3Blue1Brown. The animations in this video were created using his math animation engine, Manim, and I'll leave a link to it down below, so check it out. We now have a function that spits out a probability. Great. But in the end, we want a binary output. That is a yes, a person will default, or no, the person won't default. Now, how do we do this from probability values? We assign this using a threshold. Let's take it as a standard 50% here. That is, if the sigmoid function for a given value of x is greater than 0.5, it should classify the point as 1, and if it is less than 0.5, it should classify the point as 0. But note, how you assign the threshold depends on the application. In cases where we want less false positives, a higher threshold is good. In other cases where we want a higher recall, a slightly lower threshold is good. It's up to the developer to choose this threshold. Now question, how do decision boundaries fit in? For those not familiar with a decision boundary, it is basically a boundary that the model uses to make decisions. Simple enough, every point that falls on one side of the boundary is categorized as 1, and points that lie on the other side of the boundary are categorized as 0. I'm speaking for the binary classification case, but this definition can be expanded to multiple classes in multiple dimensions. What you see here is a two-dimensional plane with axis variables x1 and x2. In a two-dimensional plane, we have a two-dimensional decision boundary in the form of a line or a polyline. If we look at the one-dimensional case with a single axis x, we have a point decision boundary. Every point on one side is classified as 0, and every point on the other side of this boundary is classified as 1. Now for multi-class classification, we could use multiple points as the decision boundary. Now that we've visualized decision boundaries, how exactly are they created? Restated in another way, given a set of these data points, how does logistic regression know where to put the decision boundaries that work best? Well, let's start with the one-dimensional case. We mentioned how a function called the sigmoid function is used for determining probability. In this one-dimensional case, the function looks like this. We've been here and done that. It squishes the linear input into a range between 0 and 1. In the context of logistic regression training, we could also write this equation. Just replacing the x with b plus wx. b is an intercept bias term. w is the weight of the independent variable x on some response variable y. If we give a threshold of 50%, then this final value is basically 0.5. Let's simplify this to get an equation. Take logs on both sides, and then we get this final form b plus wx is equal to 0. Or this corresponds to a point x is equal to negative b by w. But what does this represent? This is the point decision boundary in the one-dimensional case for binary classification. It is this point that splits your one-dimensional data points into two parts. If you don't quite understand, pause and think about how this is the case. When the value of x is at the value negative b by w, there is a 50% chance that y is equal to 1 for binary classification problems. Also, if x were greater than this value, the probability of y being 1 would be greater than 50%. And if x is less than negative b over w, then the probability of y being 1 would be less than 50%. I hope you understand this. Great, so now we know the equation of a point decision boundary. It's x is equal to negative b over w. But we don't have w or b. But we can find this out through the training phase of logistic regression. What does it mean to train a logistic regression classifier? It means finding w and b that maximizes the probability of seeing the training data. I made a gentle explainer video on logistic regression and maximum likelihood estimation before, so I won't get into the math details right here. My objective is to visualize logistic regression and training. During training, we can use a step by step technique of changing w and b until they reach their optimal values. This is done with optimization techniques such as gradient descent. It is a technique that involves changing the values of w and b ever so slightly to maximize the probability of seeing the training data. I'll get into the details mathematically in my next video, but know that the update equations for the weights and the bias look something like this. We start by initializing w and b to some random values, or just one in this case. For some m iterations, we change the parameters ever so slightly that w and b will convert to their optimal values w sub m and b sub m. Here alpha is the learning rate, aka how fast do you want to learn? n is the number of training samples. xi and yi are the ith training sample. x is the features and y is the binary label. And sigma is the sigmoid function. Don't worry if you don't understand this completely, but let's visualize what this math is doing. Maybe you'll get a clear picture. Let's consider the one-dimensional logistic regression case for binary classification. We have some training sample points. They are labeled as zero or one by color here. Let us just take the initial value of w and b to be one each, like we stated before. I'll set the learning rate to be a typical value, like 0.05, 0.01, whatever it is depending on your data. From this, we can determine the initial decision boundary. x is equal to negative b over w, which in this case is negative 1. And now we're going to start the training phase. We apply the update rules for w and b, and because of this update, the decision boundary changes. Continue to the second iteration, and now the third iteration. And if we keep applying this for some time, we eventually see w and b converge. So basically the decision boundary stops changing by much eventually. I'll play this entire training process again, but this time let's also visualize how the sigmoid function changes at every iteration. For every point x, the height of the curve above that point represents the probability of that point belonging to the positive class that is y is equal to 1. Pretty slick, right? Now during the testing phase, when we are past some data point and our models ask to label it, we first determine the probability that this point belongs to the positive class, and then assign its class depending on the probability. So what we did until now was visualize two major concepts, the sigmoid function and the decision boundary. But both of these are just for the one-dimensional logistic regression case. Let us visualize how these change for two dimensions when we have two features. We have the original sigmoid function here. For the one-dimensional case, the sigmoid function squishes the linear input of one variable x into the range 0 to 1. So we have this curve. But now we have two features, x1 and x2. The sigmoid should now squish a linear function of features x1 and x2 into the range 0 to 1. So we have a surface with this equation. B is again the bias term. W1 is the weight of the independent variable x1 on some response variable y, and W2 is the weight of x2 on the same binary response variable y. If we give a threshold of 50%, then this entire value becomes 0.5. Let's simplify this to get an equation in the same way we did for the one-dimensional case. Take logarithms on both sides, and we get the final form. B plus W1 x1 plus W2 x2 is equal to 0. Now this equation corresponds to the equation of a line with x1 and x2 as the two axes. So this is also the line decision boundary in the two-dimensional case for binary classification. It is this line that splits your two-dimensional data points into two parts. Once again, if you don't quite understand this, pause and think about how this is the case. For a data point x1, x2, when the value of the equation is 0, there is a 50% chance that y is equal to 1 for the binary classification problem. Also, if this equation is greater than 0, the probability of y being 1 would be greater than 50%. If the equation was less than 0, then the probability of y equal to 1 would be less than 50%. To find this line, we need to know the coefficients B1, W1, and W2, and this is done during the training phase of logistic regression using the same gradient descent algorithm. So here was the algorithm for the one-dimensional case, and now here is the update for the two-dimensional case. The only difference that I made is now making W and x bold. This is just simple math notation to show that these are vectors. In the two-dimensional case, they are vectors of two dimensions. For a clearer picture, like we did for our one-dimensional case, let's visualize how this two-dimensional line decision boundary is created. Back to our black screen. Consider a 2D plane with axis x1 and x2. We have some training sample points that are scattered in some way in this plane. Remember, for each point we have a label indicated by colors. Let's take the initial values of the weights and bias to be 1 each. I'll set the learning rate to be some fixed value. From this, we can determine the initial decision boundary, B plus x1 plus x2 is equal to 0. To start the training phase, we apply the update rules for the weights and bias to get that slight shift in values. Now the second iteration of gradient descent. Now the third. And if we apply this for some time, we eventually see the weights and bias converge. And look what we have here. We have a decision boundary in the form of a line. And guess what? This line has the equation of the form B plus w1 x1 plus w2 x2 is equal to 0. Now, where does the sigmoid fit in? I'll play this entire training process again. And this time, let's also visualize how the 2D sigmoid function changes at every iteration. For every point in the x1, x2 plane, the height of the sigmoid function would represent the probability of being in the positive class that is y is equal to 1. So during the test time, when you are given a point x1, x2, you first determine the probability value. Then assign the predicted class based on that probability value of it being either the positive class or the negative class. Great, we visualize the sigmoid curve and the creation of a decision boundary with logistic regression for the 1 and 2 dimensional cases. But this begs the question, how do we go beyond this? How can we think of multiple dimensions greater than 2? Now, I can't create visuals for this myself because I can't effectively demonstrate 4 dimensional plotting on screen. But we can easily expand our intuition on this. For the 1 dimensional case, we have this. And for the 2 dimensional case, we have this expanded equation with w1 x1 and w2 x2. But we can also rewrite this in vector notation treating w1 and w2 as a w vector. And also x1 and x2 as an x vector. Now for some arbitrary m dimensions, we can write the same equation form. But x and w, instead of being vectors of 2 dimensions, they are now vectors of m dimensions. And that's the only difference here. We can use the same argument for every process. Gradient descent, for 1 dimension, we have the updated equations that look like this. In the 2 dimensional case, we introduce w and x as being bold as they are now vectors of 2 dimensions. And for the higher dimensional gradient descent, all the equations are the same. All we do is replace the 2 with the order of dimensions. And that's it. I hope this video helped you visualize logistic regression training in 1 and 2 dimensions and gave you an intuition on how you can think of higher order dimensions. Note, there is a lot more to logistic regression and I'm going to be deriving all these equations from scratch in my next video. So that's going to be fun. Some learning resources are down below, but stay subscribed and I will see you in the next one. Buh-bye!