 Hello everyone, this is Alice Gao. In this video, I will derive the expressions for the gradients in the back propagation algorithm. First, let's derive the partial derivatives for the gradients for W2. To remember what the network looks like, you can look at the picture on the right or the chain on the left. The chain is a simpler illustration of how the values flow through the network from inputs to outputs. In the picture on the right, I've highlighted one path, the one for the weight W2 sub 1, 2. Before I start the derivations, let's do a super quick review of the chain rule. Deriving the gradients boils down to repeatedly applying the chain rule. Suppose that we have an expression. In this expression, we have applied several functions in sequence. For our example, I applied f first and then g second. Taking the derivative of the final expression with respect to x is equivalent to taking the derivatives of the functions in reverse order, from the last one applied to the first one applied. I often think of this process as peeling off the functions one by one. For our example, the derivative is equal to the derivative of g with respect to f multiplied by the derivative of f with respect to x. We're ready to derive our first gradient expression, the partial derivative of e with respect to W2 sub jk. Our expression is basically tracing through the highlighted path from right to left. First, e is a function of z2 and y, therefore we have partial e over partial z2 sub k. Next, z2 is g of a2, therefore we have partial z2 sub k over partial a2 sub k. Next, a2 is a function of the weights W2 and the hidden unit value z1. Since we want to take the derivative with respect to the weights, we have partial a2 sub k over partial W2 sub jk. If you got here and understood how the expression was derived, great job! You got to the most difficult part, driving the most general and abstract expression for this derivative. It only gets easier from here. If you still have questions, no worries, I will show you some concrete, more concrete examples right now, and they should help clear up some confusion. Let's make some of the terms more concrete. What is the middle term? z2 is g of a2, so the middle term is the derivative of the activation function g, or in other words, g prime of a2. How about the last term? Well, a2 is the sum of W2 multiplied by z1, so the last term should be z1 sub j, the input corresponding to the weight W2 sub jk. This expression is equivalent to the previous one, but it's easier on the eyes. Let's do an example for a particular weight. Consider the weight going from z1 sub 1 to a2 sub 2, the highlighted one in the picture. All we need to do is take the general expression and plug in j equals 1 and k equals 2. Here's the result. If you want to make this expression even more concrete, you'd have to pick the arrow function e and the activation function g. Here are two examples. For these examples, the arrow function is the sum of the square difference between the actual output and the expected output. And the activation function is a sigmoid function. Interestingly, the partial derivative of the sigmoid function can be written as an expression involving two copies of the sigmoid function. If you don't believe it, please verify it yourself. I've also included some other equations to help you understand how I simplify the expression the first time. Next, let's derive the expressions for the partial derivative of e with respect to w1. These expressions are more complex but also more interesting. Let's look at a specific example. Consider w1 sub 1 to the weight between x sub 1 and z1 sub 2. I've highlighted the weight in the picture. How does this weight influence the arrow or loss? This weight affects a1 sub 2, which affects z1 sub 2. From there, it affects the arrow or loss e through two paths, one through each of the two output units. Because of this, we will have a summation with two terms in our expression. Let's derive our expression now. To apply the chain rule, we'll look at the highlighted path and go from right to left. We'll start with a summation over k since we need to consider two paths, one for each output unit z2 sub k. e is a function of both z2 values. So the first term is partial e over partial z2 sub k. Next, z2 is g of a2. Therefore, the next term is partial z2 sub k over partial a2 sub k. After that, a2 is a weighted sum of the z2s. So the next term is partial a2 sub k over partial z1 sub j. At this point, the two paths merge into one, so we can end our summation. Let's keep following the single path from z1. z1 is g of a1. So the next term is partial z1 sub j over partial a1 sub j. a1 is a weighted sum of x using the weights w1. So our final term is partial a1 sub j over partial w1 sub ij. That's the entire expression. Similar to the previous derivation, you might find it helpful to make this expression more concrete. If we simplify some terms, we'll get the second expression. Whenever we have a derivative of z with respect to a, the result is a derivative of the activation function g. We have two other similar terms. The derivative of a with respect to z or w. a2 is a sum of z1 weighted by w2, so partial a2 over partial z2 is equal to w2. Similarly, a1 is a sum of x weighted by w1, so partial a1 over partial w1 is equal to the input value x. This is all the simplification we can do without knowing the expressions for the arrow function e and the activation function g. You can also derive the specific expression for w1 sub 1, 2. Here it is. We have derived some complicated derivative expressions so far. Finally, let's connect these to the delta values. Here are the simplified gradient expressions again. Note that the last term in each expression is the input value for that layer. Let's disregard the last term in both expressions. Take the rest of the expression and define it to be the delta value. I've denoted the two delta values as delta 2 sub k and delta 1 sub j. Note that delta 2 sub k appears inside the expression of delta 1 sub j. If we rewrite the delta expression separately, we get the delta expressions that I showed you on a previous slide. Here's a final practice question for you. Construct a three-layer network with two hidden layers. Each hidden layer has two real nodes and one dummy node. The output layer has two nodes. Calculate the gradients for all the weights in the three layers, write down the general expressions, and then write down the expressions using the delta values. When using the delta values, be sure to write down the expressions for the delta values for every layer. If you understood everything in this video, you should be able to complete this question. That's everything on the back propagation algorithm. Let me summarize. After watching this video, you should be able to do the following. Derive the expressions for the gradients in the back propagation algorithm. Explain how we define the delta values in the gradient expressions. Thank you very much for watching. I will see you in the next video. Bye for now.