 Hello everyone! A.J. here. Remember in my last video how I said that I would be recording in different locations? No? Well that's because you're not subscribed, so hit that subscribe button, you dummy! And hit that notification bell too while you're at it, please. In any case, last video I said that I would be changing locations across my apartment to see what would work best for me, but I don't want to do that because this is so convenient! This is very convenient for me. So yeah. By the way, who cares about location when we have quality content? So let's get right to the video on how a neural network learns. If you've gotten started with deep learning research in any way, you would have heard of a famous algorithm called backpropagation, a very important phase in the machine learning process. Two things I've noticed while learning this concept is one, lots of math equations, and two, every single person says the same thing but just in different ways. This can be pretty difficult while learning because you can't connect the dots right. And so I'm going to explain neural networks such that you don't have to memorize any book formula, but I'll also use standard notation to ensure that you can map my explanation to whatever is given in any kind of source that you may read for neural networks. To start, let's not just jump into forward and backpropagation. There are hundreds of books out there that can tell you how to do exactly that and I will get to that in a bit. But for now, I think it's more important to understand the concept of how neural networks actually learn and also how to define an objective. Just a disclaimer, this video has a lot of technical details, so I may be throwing some words around without explaining them. If you find a word or a few words that you just don't get, Google your friend. A typical feed for neural network consists of an input layer of size D, one or more hidden layers, the number of neurons per layer is a tunable hyperparameter, and an output layer, the size of which depends on the type of problem we are dealing with. Classification or regression. Classification and regression are the two broad categories of problems that encompass supervised learning. A statistical learning method where we are given labels Y with training data X. When we say a neural network should learn, we mean that it should determine the weights of edges between the layers and biases for every layer. Since neural networks are parametric models, the problem of modeling the given data is reduced to just determining the fixed set of network parameters, that is the network's weights and biases. So how are they determined? Such values are determined by an estimator, and we consider the maximum likelihood estimator, a core ML concept. I'm illustrating estimation with MLE because it is one of the most fundamental estimators that drives basic algorithms like logistic regression, naive vase classification, mixture models, and can even drive linear regression. Because of its importance in core machine learning algorithms, I'm thinking of making an entire video dedicated to estimators, or at the very least maximum likelihood estimation. So let me know your thoughts about that in the comments down below. However, in this video, I will derive MLE in the context of neural networks. We have N samples X with corresponding labels Y used to train the neural network. How does maximum likelihood estimation work? In the supervised learning context, MLE determines the parameters which maximizes the probability of seeing data. Here, theta represents the neural network parameters, weights and biases, and theta hat MLE is the maximum likelihood estimate of these network parameters. Assuming the samples are independently and identically distributed, that is IID, we can multiply the probabilities for the end samples. We're going to evaluate this function in two ways. The first is when the neural network is used for classification, and the second is when it is used in regression. In the classification setting, the labels can be modeled as Bernoulli variables, as they can only take a small set of specific values, like in the flipping of a coin, only two values can be taken heads or tails, or in the rolling of a die, that's only six values, one to six. Regardless of the value of n, the number of values each sample can take is fixed and is equal to the number of classes. Here, Tnk is either zero or one and indicates the target class, that is the actual value. We get the matrix T by one hot encoding the given labels. Maximizing the likelihood is the same as maximizing the log likelihood. We take logarithms to convert product of products to sum of sums, which is much easier to compute. The negative log likelihood is the error we wish to minimize. For a single sample xn, doesn't this error kind of look like cross entropy loss? You've probably seen this term used as an objective function in classification problems. Hope you understand now why we use this form. Similarly, let's get the maximum likelihood estimation for regression. We have our likelihood function, the range of yn for regression problems is any real number. So we cannot model it as a Bernoulli variable as we did in classification. However, here, let us assume that the conditional distribution of the labels y, given the samples x is Gaussian, with the mean being the estimated label y hat, and the variance being some sigma squared. We then substitute this in our likelihood equation. The product of exponential terms is annoying to evaluate. So let's consider the log likelihood. Removing the constant terms with respect to data, which is every term without the y hat, we then get an equation that resembles residual sum of squares. Once again, I'm sure you have seen the RSS being used as a measure for cost in regression problems. And now you know why I'm keeping the half because it cancels out when I take the derivative with respect to y hat, it will still converge at the same point as it has the same minima. The maximum likelihood estimator is an unbiased estimator. So the expected value of the parameters predicted using MLE is equal to the true value that is theta star. This means that given an infinite amount of training data, theta MLE will converge to theta star. In other words, the estimated values of your parameters will take their optimal values. This is why in machine learning we claim data is king. If you have more data, then the chances of fitting a good model are much higher. I hope you understand now why we use RSS to measure regression loss and cross entropy to measure classification loss. We now need to find the value of theta that minimizes this cost. And this is where neural network training takes place. We'll use stochastic gradient descent for this training. This involves minimizing the cost every time we look at a random sample by using gradient of the cost with respect to the weights and biases. Hence, we need to find these gradients. Determining these gradients and updating weights involves a two step process, forward propagation and back propagation. Let's first introduce some notation for our neural network. xn is a d dimensional input. w superscript l subscript ji is the weight on the edge connecting the ith neuron in layer l minus one and the jth neuron in layer l. blsabi is the weight connecting the bias neuron with the ith neuron in the layer l. And here we let n subscript l represent the number of neurons in layer l. We can consider you as the inner product between the output activation of the previous layer a and the weight matrix w. And we have an activation a as a function of you, which is typically nonlinear. Through this video, the math may look a tad messy with all the superscript and subscript notation. And this can be easily solved through vectorization that is by converting these scalars into a vector representation. But in order to do that, I would have to make sure that I explain Jacobian matrices and ensure that the vector multiplications are in the right order. And I don't want you to be bogged down by mathematical notation. It will take away from the intuition. So I left it out for this video. We have some special cases in our notation. First, a of zero is equal to x that is the output of the zeroth layer is the input to the network. Then for the last layer l, the activation for the classification is a softmax. But for regression, it's just a linear activation. And third, we have the activation output of the last layer equal to the prediction. As a consequence of two, in the regression case, since there is a linear activation, we can say that a is equal to you. Now that we have defined the notation for any l layer neural network, we can now begin the neural network training process. The first stage of this is forward propagation. We start with a random initialization for the weights and biases, and then compute you and a for every neuron using these two formula alternatively, hence the values propagate forward in the network. This is used to compute the prediction y hat. y hat along with the actual label t is used to compute the total cost for this sample and represented as e subscript n. We consider the total cost e as a sum of the costs incurred by each sample, E n. This is a reasonable assumption. Now that we have the cost, we need to compute the gradient of the cost with respect to the weights and biases. This gradient computation is done with back propagation. There is a lot of toxic air surrounding the back propagation algorithm, specifically it being too difficult to understand without being a math whiz. But I'll tell you now that you really don't need to know much to know back propagation. In fact, back prop is just chain rule. Just like how forward propagation gets its name from the fact that the values of u and a propagate in the forward direction in the network, the back propagation algorithm gets its name from the fact that the gradients with respect to you and a propagate in the reverse direction. And so we determine the cost gradients with respect to you and a throughout the network. And this is used to determine the cost gradients with respect to the weights and biases, which is what we are set out to do. Starting with the last layer, the cost gradient with respect to the activation can be simplified into the difference between the target and predicted value in regression, or the negative of the ratio between the target and predicted value in classification. This can be obtained by simple differentiation. Next, we compute the cost derivative with respect to you for all neurons in the last layer. A well known convention using textbooks and blog posts is to let this gradient be some sort of delta. Note that you don't really need to know this delta convention. However, it makes the entire notation much easier to read, and is also very useful to read in the recursive form that I will describe soon. To compute these delta values for all neurons in the last layer, just apply chain rule with the activation a on the same neuron, we have both terms. So the deltas for every neuron in the last layer can be computed. So far, we have computed the cost as well as the gradients with respect to a and you for the last layer neurons. Now let's compute the value of the gradient with respect to the activation a for the L minus one layer neurons that is a layer before the last one. And how do we do this using a good friend chain rule with you? So in this form, why am I summing over all K? To understand this, we need to know what the term on the left hand side represents. It is the change in the cost with respect to the change in the activation of the jth neuron in the layer L minus one change in this neurons activation affects all neurons in the next layer, all of them. So when the gradients propagate backwards, we can visually see neurons in layer L converge on this jth neuron in layer L minus one. The overall effect of the gradients is the sum of products of the deltas and the corresponding weights on the edge. This is the exact information conveyed while using chain rule. Generalizing this for any layer L, it is just the dot product of delta and weights for the next layer. Similarly, let us try to express the cost gradient with respect to you for the jth neuron using the gradient with respect to a that we just computed. First start with good old chain rule. And that's it. We know the value of the gradient with respect to AJ from the previous step. And the second term is just the derivative of the activation function with respect to you. Like this, we just compute the cost gradients with respect to you and a for every neuron in the network. And that's pretty simple, right? What you see in many textbooks and other references is the same equations but written in the delta notation. Substituting the first and second terms with their corresponding values, we get the following form. Bring the derivative of the activation F outside the summation as it is independent of K. Now the left hand side corresponds to delta j for the lth layer. And lookie here, we have a recursion in delta. In the first case, we express the gradient of ul in terms of al, which in turn is expressed in terms of ul plus one. So that's two separate equations. But in the second case, we express ul directly in terms of ul plus one. As far as complexity, they both do the same thing. It's just the first one is easier to read because it's broken down into two equations or two phases. While the second one has a nice little recursion. Once we compute all the cost gradients with respect to you and a for every single neuron in the network. The next step is to compute the cost gradients with respect to the weights and biases. The derivative of the cost en with respect to wji represents how much this edge affects the cost. We can apply good old chain rule through you j. And we simplify substitute the first term with delta and the second term can easily be computed with differentiation. And what about the bias? We do the same thing apply chain rule with you to get the delta form and substitute the values. Once you determine all the weights and biases in the network, perform the stochastic gradient update and get the new weights for the network. This is repeated for every sample in our training set. To summarize the process, the forward propagation algorithm is just determining you and a for every layer by repeatedly applying these two forms. The back propagation algorithm is just chain rule to determine the cost gradients of you and a. Clearly, you don't need to memorize the formulae to accomplish it. But there are five basic ones that we derived, or only four if we combine two of them together. In case you question the authenticity of this formulae, I'll have you know that these are the same formulae given by Michael Nelson in his neural networks and deep learning book, but in scalar notation. In case of a batch update, consider the simple mean of the gradient for all m samples in a batch and substitute this value in the weight update. There are other topics of interest of how we can optimize these weight and bias estimates much better through like momentum. But I think that's a topic that will be left for a separate video. I don't want to overdose it on this one. Here's the simple algorithm for neural net learning using stochastic gradient descent. Pick a random sample xn compute you and a in the forward direction, find the cost incurred by the sample that is en determine the gradients with respect to a and you from the backward direction, compute the cost gradients with respect to weights and biases, and update w and b using the stochastic gradient descent update rule. We repeat this for a number of epics until convergence. Like I mentioned in the beginning of this video, different authors say the same thing in different ways. If you're going to learn any concept, pick a verifiable source and stick to that. Don't get caught up in the notation. Furthermore, there's nothing you need to memorize here. Everything from maximum likelihood estimation to determine the right cost to minimize to the core concepts of back propagation can be derived entirely by hand as we just did. And that's all I have for you today. If you like the video, hit that like button. If you like videos like this, hit that subscribe button. If you want to be notified every time I upload, hit that bell icon links to blog posts and papers referenced in this video down in the description below. So check them out links to my equipment and other resources are also in the description below. So check them out interested in other hot topics in the field. Click or tap one of the videos right there and I will see you in the next one.