 Hello everyone, this is Alice Gell. In this video, I'm going to talk about the back propagation algorithm. There's a huge hype around neural networks right now. If you haven't learned the back propagation algorithm, you're probably pretty excited about this video. So what is the back propagation algorithm? It is essentially an algorithm to learn the weights in the neural network by using the gradient descent optimization algorithm. I will use a two-layer feed-forward neural network as an example. Let me explain some notation. Let's start from the inputs X sub i. W1 sub ij denotes the weights between the input layer and the hidden layer. A1 sub j denotes the j-th weighted sum of the input units. And z1 sub j denotes the j-th hidden unit. It is a result of applying the activation function g to a1 sub j. The notation for the next layer is very similar. z1 sub j is the value of a hidden unit. W2 sub jk denotes the weights between the hidden layer and the output layer. Then a2 sub k denotes the k-th weighted sum of the hidden units. Finally, z2 sub k denotes the k-th output unit. And it is a result of applying the activation function g to a2 sub j. You can also think of the network structure as a chain and I've written an example at the bottom of the slide. Let's discuss the back propagation algorithm. The most challenging step in the gradient descent algorithm is to calculate the gradient for each weight. We can write down an expression for the gradient for each weight and calculate it using the expression. Unfortunately, this approach is quite inefficient. Nowadays, neural networks tend to be quite large and there could be thousands or more weights that we need to learn. The back propagation algorithm is an efficient way of calculating the gradients for the weights. Consider a setting where we have n training examples. For each training example, x consists of the input feature values and y is a label. In other words, x is the input to our neural network and y is the expected output. To evaluate how close the actual output values z2 are to the expected output values y, we will use an arrow or loss function e. When we execute gradient descent, our goal is to minimize the arrow or loss e by adjusting the weights in the neural network. Given the training examples, we will calculate the gradients by performing two passes in the network, a forward pass and the backward pass. The forward pass takes the input values x, the current weights w1 and w2, and calculates the arrow or loss e between the actual output values z2 and the expected output values y. The backward pass computes the gradients which are the partial derivatives of the arrow function with respect to w2 and w1. For our network or for any other network, the forward pass flows from left to right and the backward pass flows from right to left. For each training example, we will calculate one gradient for each weight. Then, to update each weight, we need to add the gradients for this weight for all the training examples. We will update the weight proportional to the sum of the gradients. Let me give you an intuitive description of the algorithm. Our goal is to set the weights in the network to minimize the arrow or loss. How do we do that? Using each training example, we will compute the gradient for each weight. The gradient tells us how much the arrow or loss changes if I change the weight by a tiny bit. If the gradient is positive, we should decrease the weight and vice versa. The gradient guides us about how we should change the weight locally to minimize the arrow or loss. Why do we need to add up the gradients for all the training examples? The reason is that different training examples may want to change each weight differently. One example may suggest that we should increase the weight, whereas another example may suggest that we decrease the weight. We do not want to only predict one example well. We would like to achieve a high prediction accuracy over all the examples. Therefore, we need to sum up the gradients for all the examples and use the sum to adjust each weight. Let me go through the forward pass first. The forward pass starts from the input values on the left. First, we will calculate a1 as the weighted sum of the input values using the weights w1. The hidden unit values z1 is the activation function g applied to the weighted sum a1. We will calculate the output values in the same way. a2 is the weighted sum of the hidden unit values using the weights w2. Then, the output value z2 is the result of applying the activation function g to the weighted sum a2. Finally, the arrow or loss is a function of the actual output value z2 and the expected output values y. Let's look at the backward pass. This is the core of the back propagation algorithm. Our goal is to calculate the gradients for the weights, the partial derivative of e with respect to w1 and w2. We will calculate the gradients by going backwards in the network, from the outputs on the right to the inputs on the left. Starting with the outputs on the right, we will first calculate the gradients for the weights w2. I've written the expression as the product of two terms, the partial derivative of e with respect to a2 and z1. Let me define the first term to be delta2. The second term z1 is the input going into the edge for the weight w2. At this point, it seems not unnecessary to define delta2. However, the delta values will be extremely useful for us to understand the back propagation process. I will define one set of delta values for each layer. You will see shortly that the delta values for different layers form a recursive relationship, which will allow us to calculate the gradients efficiently. Next, let's calculate the gradients for the weights w1. This expression is similar to that of w2. The gradient is a product of two terms, the partial derivative of e with respect to a1 and the second term is x. Similarly, I'll define the first term to be delta1. The second term x1 is the input going into the edge for the weight w2. Note that the gradient for the weights in each layer has a similar expression. Each expression is a product of two terms, the delta value and the input going into the weight. The remaining question is, how do we calculate the delta values? On this slide, I'm showing you the expressions for delta2 for the output layer and delta1 for the hidden layer. Note that we need the delta2 values to calculate delta1. This gives you a hint of what the recursive relationship looks like. Let's take a closer look at the recursive relationship of the delta values. This recursive relationship allows us to propagate the arrow backward through the network and calculate the gradients efficiently. This recursive relationship is also how the back propagation algorithm got its name. Consider a unit j. This unit may be in the output layer or may be in any hidden layer. In general, delta sub j is the partial derivative of e with respect to a sub j, where a sub j is the weighted sum of the values from the previous layer. To calculate delta sub j, we need to consider two cases. When j is an output unit, we have the base case, and we will calculate delta sub j directly. When j is a hidden unit, we have the recursive case. In the recursive case, calculating delta sub j requires delta sub k. Delta sub k is the delta value for the next layer, the layer to the right on the side closer to the output layer. Let's think about the recursive case intuitively. The hidden unit j is connected to multiple units k in the next layer. Therefore, unit j is responsible for some fraction of the arrow delta sub k in each unit k that j connects to. For each unit k, we will weigh each arrow delta k by the strength of the connection between j and k. This is the weight w sub jk. This weighted sum allows us to take the arrow's delta sub k from the next layer and propagate them back to calculate the arrow's delta sub j in the current layer. Also note that the expressions for the two cases are quite similar. In particular, the second terms are identical. They're both the derivative of the activation function g. Also, since our network only has one hidden layer, we only need to use a recursive case once. If a network has multiple hidden layers, we will need to apply the recursive case multiple times. Here's a practice question for you. Construct a three-layer neural network with two hidden layers and write down the delta values for all three layers. So far, I've shown you all the formulas you need to execute the back propagation algorithm. If your goal is to implement the algorithm and run it, this is all you need. However, I hope you are also interested in how the expressions were derived. On the next few pages, I will derive the partial derivatives by using the chain rule many, many times. Once you understand the derivations, you will realize that the recursion with the delta values is nothing mysterious. The back propagation process emerges directly from the derivation of the gradients. That's everything for this video. Let me summarize. After watching this video, you should be able to do the following. Describe the back propagation algorithm. Given a multi-layer feedforward neural network, calculate the gradients for each weight in the network. Thank you very much for watching. I will see you in the next video. Bye for now.