 This is going to be a mathematical video of me deriving gradient descent update rules from scratch. And even if you do know the derivation, I still suggest watching the video as we take a vectorized approach, which is not really common in other tutorials. So logistic regression. It's a type of generalized linear model used to model classification problems in machine learning. Before getting into the specifics, let's begin by introducing some key variables. So logistic regression is a supervised learning algorithm. We require some features x and corresponding labels y. Let's say that we have multiple features to represent a sample, in which case x is no longer a scalar quantity. It's represented as a vector of some d dimensions. Although we can use logistic regression for multi-class classification, let's build the intuition for the binary classification case. So that means y can only take onto values, 0 or 1. Like this, we can take n such samples and construct a data set represented by some capital D. Logistic regression is a parametric model that is the model itself is characterized by a fixed set of parameters. Let's call these parameters collectively theta. In this context, theta is the weights of each feature. As such, it is a vector that has the same dimensions as the feature vector xi. Great, so we laid out some of the ground variables. Let's now talk about training. The main goal of training is to find the model parameters that is the weights theta. And we want to find the weights that maximizes the probability of seeing this training data. In machine learning lingo, this is maximum likelihood estimation. Let this definition sink in because it is a very important concept in machine learning theory. Let's translate that into math notation. The P of D colon theta is the probability of seeing our training data, considering some parameters theta. Mathematically, the goal of training is to find the value of theta that maximizes this probability, that is theta hat MLE, the likelihood estimate of theta. This is what argmax does here in math notation. It turns the theta that maximizes the probability of seeing the training data set. In place of the data set D, we can write y given x. As in training, we want to maximize the probability of the specific labels y, given the specific features x. Now this P of y given x colon theta here is some unknown probability distribution that the values y can take given some x and parameterized by theta. This is the same as maximizing the product of the probabilities of seeing every training label yi, given its corresponding xi. This big pi symbol is product. It's similar to the sigma for some. I hope you're following along because so far, this is the general formulation for maximum likelihood estimation in the training phase of supervised learning. Remember, the goal is training here, and for training, we need some loss function to minimize. So let's derive this loss function. From here, the derivation gets specific to logistic regression. Well, I just mentioned now that P is an unknown probability distribution of the labels y. For logistic regression, we make the assumption that this distribution can be approximated with another function, which is a linear combination of features and weights. I'll rewrite it in vector form. Now, whenever I say vector form, basically what I'm doing is I'm making the symbols bold so that you understand that it's a vector as opposed to a scalar. So xi and theta over here are two vectors, whereas yi is a scalar. I just hope that notation is clear. This linear combination of weights and features that is the theta transpose xi here, it can be any real number, but the output of this has to be a probability value that lies between 0 and 1. And so, the function needs to map a real number to a probability value that lies between 0 and 1. Generalize linear models use something called link functions to do this. Now, logistic regression specifically uses the sigmoid function as its link function for this squishification. For the one-dimensional case, the sigmoid function looks like this, where x is some real number and this small sigma of x represents a probability value, and hence it lies between 0 and 1. In logistic regression, we make the assumption that the distribution of the training labels is approximated using the sigmoid function. Let's take our original likelihood equation and split our training samples by label. The first product is a set of training samples with the positive label 1, and the next set is a set of training samples with the negative label, which is 0. Since there are only two classes, the probability of seeing a negative label is basically 1 minus the probability of seeing the positive label. Now we can substitute our sigmoid approximation for these probabilities. The first set of sigmoids have values as close to 1 as possible, so it makes sense to maximize their product directly. And the second set of sigmoids should have a value as close to 0 as possible, so it makes sense to maximize the product of their compliments directly, their compliments being 1 minus their probability values. For the binary classification case, we can recombine the two sets of products in this way. For every sample, one of these two terms will have the value 1, and the other will be the probability. Now we want to maximize this value, but a product of values can become computationally inefficient to deal with. It makes more sense to convert it to a sum of values, which is a more tractable form. To make this transformation, we start by taking a logarithm of these values. This will still have the same optimal value of theta, as a number and its logarithm are proportional to each other. A property of logarithms is that a log of products is the same as a sum of logs. We use this in our log likelihood formulation to get a logarithm of probabilities. And we apply this to the internal term too. Another property of logarithms is that a log of a number with an exponent is the product of the exponent and the log of the base number. We can use this rule to bring the exponent yi and 1 minus yi to the base. And that's it. This is the equation we want to maximize, but a loss function is something that we want to minimize. This is equivalent to the negative log likelihood. So we take this entire equation and we put a negative sign in front of it. And so, yep, this is now our loss function that we want to minimize, and we derived it from the maximum likelihood estimation. I hope this entire derivation until this point is clear. We derived the loss function. Now, loss functions represent the equation to minimize in training, but how do we actually learn the values of theta that minimize this equation? This is done through optimizers. In a past video, I derived a second-order optimization method called the Newton-Raphson method for logistic regression. This time I'm going to derive a first-order technique called gradient descent. Very popular. Let's go through some differences between the first-order techniques and second-order techniques for optimization. Now, first-order techniques only require computing the first-order gradients to learn the weights. Second-order techniques require us to compute the second-order derivatives of the loss with respect to the weights. Programmatically, the computation cost of a first-order gradient vector, that is the Jacobian, is less than the second-order derivative matrix, that is the Hessian. However, by computing a Hessian, we are able to make more accurate jumps in the correct direction. So, clearly it's a trade-off between speed and accuracy. For now, though, let's direct our attention just to the first-order optimization technique, which is gradient descent. Gradient descent is an optimization technique that allows us to learn the weights of theta in an iterative method. Starting at a random point, the objective is to change the values of theta ever so slightly that eventually it takes on a value that minimizes the loss. Here's the algorithm for gradient descent. I'm using vectors instead of scalars for representation here as well. Now, why is this the case? Well, from a math perspective, vectors are much more easily generalized than scalars. And on the implementation, programming perspective, computers are faster to deal with vector operations than scalar operations. So, from this equation, we see the delta NLL divided by delta theta transpose. NLL is the negative log likelihood. And theta transpose is the transpose of the weight vector, which is now going to be a row vector. So, this gradient represents how much the loss changes for a unit change in the weights. This is then multiplied by some factor alpha. Thus, alpha, the learning rate, indicates how fast we should learn. For every iteration, we update the weight parameters, taking a step depending on the learning rate alpha. And after some m iterations, we hope to find a theta converged to its optimal value. That is the value that minimizes the loss. Before we determine the derivative of the loss with respect to theta, let's go over some basic rules of differentiation for matrices. Matrix calculus. So, let's start with the derivative of a scalar with respect to a column vector, x. It becomes a row vector of derivatives of the scalar a with respect to every element in the vector x. The shape of this vector is actually very important to note, and that is why I use delta NLL by delta theta transpose in the update equation instead of delta negative log likelihood divided by delta theta with no transpose. You can compute the shapes of the left-hand side and the right-hand side yourself in order to verify that both are d-dimensional column vectors. Here's a second rule that I think is very important for our discussion. The derivative of a column vector y with respect to a column vector x is a Jacobian matrix. Note the orientation of derivatives here. The denominator vector is expanded along the columns and the numerator vector is expanded along the rows. These are the two basic rules I think in matrix calculus that we would need, but aside from this we also require some basic rules of derivatives. The logarithm rule, which states that the derivative of a logarithm of variable is the variable's reciprocal. In the exponential rule we have the derivative of e to the power x with respect to x remains the same, that is e to the power of x. And we have the chain rule, which is used in deriving composite functions. And then we have the generalization of a polynomial differentiation. So we have two rules for matrix calculus and four rules for basic differentiation. But apart from these rules I'll also be deriving some useful derivatives that we use in gradient descent. Consider the derivative of the sigmoid function with respect to x. We can write it out in its expanded form, that's 1 over 1 plus e to the negative x. We then convert the reciprocal form to a polynomial form with an exponent. This is in the form of f of x to the power a with respect to x. So we can make use of the polynomial differentiation rule. Expanding the derivative we have the derivative of 1 is 0 because it's a constant. And then we use the exponential derivative rule to get negative e to the power negative x. Let's rewrite it without any negative exponents. We split the denominator into two parts. Add and subtract one from the numerator part of the first term. Split this up into two terms. And notice how some of these terms can be written in the form of the sigmoid function. We then end up with a nice little compact form for the sigmoid derivative. Let's call it that, the sigmoid derivative. We can now use these derivative rules that we discussed until now, that is the rules for matrix calculus, the basic rules of differentiation, and now this derived sigmoid rule, that is the sigmoid derivative. We can use all of this for computing the gradient descent update. Let's start by applying good old chain rule for both terms. Now use the logarithm rule to get rid of the log terms. The sigmoid terms are not quite in the form of the sigmoid derivative. So let's apply another chain rule to change that. Now we use the results from the sigmoid derivation that we just did. We cancel the common terms on the numerator and the denominator. Now let's get rid of the negative constants. Take the xi vector common from both terms. Expand all the inner brackets. Now we cancel the yi sigmoid times theta transpose xi terms to get the final form. And now we can use this result in our gradient update rule. Putting it all together, let's see what's going on in the training phase. We iterate over some m iterations, and each iteration we substitute the value of xi and yi to compute the step. And we then update the values of theta. Once m iterations are complete, the final theta should have our final weights. During the testing phase now, when we have a sample point, some x, the goal is now to compute the label y. In the binary classification case, we can determine the probability of being a positive label. And since this is logistic regression, we assumed that this probability could be computed by using the sigmoid function. We already determined the optimal values of the weights theta, so plopping in the values of x will give rise to the probability that y belongs to the positive class. If it's above a certain threshold, like 50%, the point is classified as positive. Otherwise, it's classified as a negative sample. This wraps up the math intuition for the training and testing phases. But as a final parting gift, let me throw in a bit of a bonus here. There is an alternative faster derivation of the likelihood equation. So remember, we have this likelihood equation here. And we know that yi is a Bernoulli random variable, as it can only take on two values that is 0 or 1. Bernoulli variables have one parameter that is the probability of success. But in logistic regression, this is the same as the sigmoid function because we make that assumption. A Bernoulli variable has the probability mass function that can be expressed in two ways. Now, the probability mass function, by the way, it gives the actual probability value of a given input. From here, we can just directly replace them into our sigmoid approximations. And the derivation of our loss function can continue from here by taking logs and doing all of that. I thought this was pretty neat how we can think about deriving our likelihood equations using some probability theory. Now, here's a summary of some takeaways in this video. Logistic regression is a type of generalized linear model where we approximate the probability distribution of labels to be a sigmoid function. We then use maximum likelihood estimation to derive a loss function. We use the gradient descent optimizer to define how do we learn the value of theta or how do we learn the optimal parameters of logistic regression. The training phase basically involves finding the optimal theta using gradient descent, and the testing phase involves using the theta that we found to make predictions on some input x that comes in. And that's it. I hope this video gave you a really clear picture of how gradient descent works internally with logistic regression. If you want to visualize these equations in action, I suggest watching my video on logistic regression visualized, where we visualize the sigmoid function and decision boundaries for one-dimensional and two-dimensional training. Some references are down in the description below. Subscribe for more content, and thanks for watching. I'll see you in the next one.