 In this video we will go through the derivation of the backpropagation algorithm. We will start with a review on the forward propagation. Then we will introduce the training data and training labels for a supervised training framework. The training will have as objective the minimization of a loss function. We will introduce, therefore, a local error through the network. We will start by defining an output error. And then we generalize the error to a nonspecific L layer. We will also review briefly how to compute the Jacobian of the composite of functions. Finally, we will see how to compute the parameters gradient. And therefore we will review the equations for the backpropagation algorithm, which can be used for stochastic and mini-batch gradient descent for minimizing the loss function that we have introduced. We start with an overview of the equations we have already seen so far. x is going to be the vector of components x1, x2, up to xn, which belongs to r, size of layer 1. Then we have that A of layer L plus 1 is equal to our nonlinear function, in this case the sigmoid of a quantity that we call weighted input. And we write Z of layer L plus 1, which is simply a sigmoid of our mappings for the layer L, which is mapping the activations of layer L to the L plus 1, where L here is equal to 1, 2, up to L minus 1. A hat is equal to the vector A with bias on top. A L plus 1 will be of size S L plus 1. Finally, we have the H, the hypothesis based on the parameter theta, capital theta, applied to our input x. It's equal to the activation of the last layer, capital L, which therefore belongs to R as capital L, or we can also say R of K, capital K. Our theta, J, is the mapping that is mapping the layer J to the J plus 1, and therefore has dimensionality size of the layer J plus 1, and gets as many input as the current layer, so S J plus 1, which is the bias. And our capital theta is the set of all the mappings for all the layers, which is mapping the layer 1 to 2, the 2 to 3, up to the last one, which is mapping the layer capital L minus 1 to the final layer, capital L. To train our system, we provide some examples. Here we have a capital X, which is providing the actual data, which is organizing this way. So X is the matrix of row vectors, X1, which is the first example. Then we are going to have X2, which is the second example. We have again X3, up to the last one. This is going to be X, M, and Y is the matrix of labels, which has M rows, like X. And then we have the first label, Y1, second label, Y2, third label, Y3, up to the last one, YM. The number of columns of M, as we saw, it's K, because we saw that H of theta of X, it's belong to RK. And we have that, of course, we have M rows. In a similar way, of course, we have M rows for the X matrix. And the number of columns is going to be equal N. Let's write this in a more extended way. So X can be written as a table, basically. So here we have T vector 1, X1. Here we have D vector X2, up to the last one, vector XN. And therefore we have here X1 of first sample. Then we have X2 of first sample, up to XN of first sample. And same below. So we have first feature, sample number 2, second feature, sample number 2, up to the last feature, still for the sample number 2. And then we can go down to the last one, which is going to be the M example, feature number 1, M example, feature number 2, and up to the last one, feature number N of the M example. We are going to have something similar for Y as well. So Y can be written as Y1, Y2, up to YK. We are going to have first sample, label number 1, first sample, label number 2, number K, and so on to the last one. So we have Y, first label, YM, second label, YM, K label. If K is equal 1, then the matrix Y is going to be exactly just the vector Y. We can now define a loss function, which is evaluating how far is our prediction from the actual label. Our loss function L, which is a function of the parameters only, can be expressed as an average over all the errors of the given example. So I equal 1 to M. We provide to our neural network all the M examples, and we can compute the average, and this is going to be our loss function. Here we are going to drop the index I because otherwise it becomes a little bit messy, but this one basically expressed the output error of the network. The error, which is a function of the output of the network, will tell us how far we are from the true labels. And an example, a standard example of error function, could be the mean square error, which is equal to one-half. The one-half term is going to be helpful in later steps of the difference between Y, which is for the I example, but we said we are going to exclude this for the moment, minus the output of our network of which we have fed the example I. And this one, it's simply one-half summation over K, YK minus. So we have that this one can be also written as our A of layer L. So YK minus A component K of layer L squared. This is still considered for the example I, but we are going to take away this notation otherwise it becomes really unbearable. We want to keep in mind that E is a function on the output of our network, where X and Y are fixed and theta can be wired. Our goal would be to find the rate of variation of our error when changing a given parameter, theta ij of layer L. In this way we can move from the evaluation of the error done in theta towards a point which is going down the slope of the error based on the parameter space. In this way, if we have our cost function and we are here, we can try to go towards the minima of our cost function. Let's introduce now the local error. If we have our weighted input Zi of layer L and we replace it with Zi, so itself, plus a small variation, so delta Zi layer L. Then we expect the error E to change from E to E plus the variation rate of the error with respect to this particular weighted input. Multiply, of course, the variation of the perturbation. Here we can see two different cases. The first case, we consider the absolute value of the variation of the error to the specific weighted input to be greater than zero. Therefore, we know that some variations in the weighted input Zi can bring some improvement. Instead, when the absolute value of the variation of the error with respect to the specific weighted input is nearly zero, we have that this neuron is near optimal. Therefore, variations in the specific weighted input won't bring any relevant influence on the final error E. We see that the partial derivative of the error with respect to the specific weighted input can be considered as a term of error. If this error is greater than zero, then we can still improve our cost function. If the error is zero, then we can no longer change much. We define delta i of layer L as the partial derivative of the error with respect to the specific weighted input of layer L component i. And this is the error i's neuron. Let's compute, therefore, the output error delta i of layer capital L. So we can simply write the partial derivative of the E with respect to the A of last layer component i. And then we have the A component i of last layer with respect to the Zi component i of last layer, which is equal to what? So the first one is the partial derivative of our error function with respect to the output of the network, multiplied by simply the derivative of our sigmoid or the component i of the weighted input of the layer capital L last layer. And let's see what these two elements are, because we already know basically everything. So let's say we consider our error E equal one-half decimation over k of yk minus ak of last layer squared. Well, therefore, we have one-half multiplied by two, then decimation disappears because all the components are going to be zero, but the i-th component, which is going to be, therefore, A i L minus y i. So this one here, if we use the sum of square errors, it's simply A i L minus y i. And here we're going to have the derivative of the sigmoid. Let's make a zone. If we have sigmoid of z, which is equal to one plus xp of minus z minus one, then we simply have the derivative of the sigmoid is going to be, so the minus one comes down here. We have one plus xp minus z minus two, multiplied by xp minus z by times minus one. Then we can multiply, and we can write this one, one plus xp minus z squared, times plus xp minus z and minus one, and therefore, we have these two simplifies and they become one plus xp minus z minus one, and then minus one plus xp minus z minus two. We can easily recognize this one as to be the sigmoid of z, and this other one, sigmoid squared of z. So we can write a sigmoid of z, which is multiplying one minus sigmoid of z, and also we have that sigmoid of z is simply A. So this is A, which multiplies one minus A. So in this case, we can simply write that this one is simply A i of layer L, which multiplies one minus A i layer L. For sake of generality, we can keep our more compact form where we still have the partial derivative of the error and the derivative of the non-linear function. Just in case you would like later on to change the error function or the non-linearity. Finally, we can write the vectorial form of the delta, the error at the last layer, which is going to be equal to the gradient with respect of the input of the error function, which is component twice multiplied by the derivative of the non-linear function utilized. Again, in our specific case, this component would simply be the last activation minus the label, and this component would be last element, component twice multiplied one minus A of L. We would like now to compute the error at any L layer of the network. To do so, we need to express our error at layer L as function of the error at the following layer. We have written the error at layer L as the composition of the error at layer L and the error at layer L plus 1. Therefore, we have to compute the partial derivative. Let's refresh how to compute the Jacobian of a composite function. Let's say we have a function G, which goes from R to Rn, which is differentiable in U0, and then we have a function F, which goes from Rn, which is scalar field, so it goes to R, which is also differentiable in X0, which is equal to G of U0. We have that H, which is a composition of F and G, which goes from R to R, is derivable in U0. The derivative of H in U0 is equal to the Jacobian of the F in X0 multiplied by the Jacobian of G in U0. And what is this? This is the classical matrix multiplication. But more specifically, since F is a scalar field, we have the Jacobian, it's simply the row vector. The F, the X1 in X0, the F, the Xn in X0 multiplied by the derivative component-wise of G. So we have G1 prime in U0 down to the last one, Gn prime in U0, which is equal to the scalar product of the gradient of the first function computed in X0, the derivative of the second one. So we have here that the current error at layer L has been expressed as a function of the error at the following layer. We have that the delta I of layer L is simply the E to respect to the Z component I of layer L and delta J of layer L plus 1. It's going to be equal to the E. And here we have the ZJ of layer L plus 1. We can also write this one as the scalar product of the two Jacobians. So it's the summation of all the components, some on J, of the E, the ZJ of following layer multiplied by the ZJ of the following layer with the current layer. We actually know what is this first part which is actually delta J of layer L plus 1. We just have to find out what the second part is, which is not that complicated. So this one is the partial derivative of the weighted input at layer L plus 1, component J with respect to the weighted input, component I of the layer L. So as we said, partial derivative of the weighted input, component J of layer plus 1 with respect to the weighted input, component I of layer L. To perform this partial derivative, we can simply write ZJ. So the J component of the weighted input, layer L plus 1, is function of the current layer. So this one is going to be equal to the summation over K of the parameter theta JK of layer L, multiplied by A hat layer LK. This one can be simply written as summation over K on theta JK layer L of the sigmoid of ZK layer L. And then if we perform the partial derivative for respect with Zi, we simply have that the summation goes away and the output is going to be the argument of the summation where K is set to I. So simply this one is going to be theta JI of layer L, the derivative of the linearity of Zi layer L. Finally, as you saw from the previous slide, we had the error delta component I at layer L. It can be expressed as the summation over J of the component J's of the error at the layer plus L plus 1, multiplied with the partial derivative of the weighted input, component J. With respect to the component I of the weighted input at layer L, and we can simply substitute and we have that this is equal to summation over J of theta JI layer L, which is multiplying the delta J of L plus 1 and the final derivative of the sigmoid Zi layer L. Finally, we can write the vectorial form. So given that the component I of the error at layer L is equal to the summation of the parameter JI, not IJ, L multiplied by the error component J at layer L plus 1, multiplied by the derivative of the nonlinear function of the same component, can be written in a vectorial form as the metric theta at layer L transposed, which multiplies the error by the following layer and then we have component multiplied by the derivative of the nonlinear function at layer L for the weighted input at layer L. Our last effort will be done to compute the partial derivative of the error with respect to the specific parameter. Now that we have all tools and amenities over these slides, we can see that the final equations are pretty easy to write. So we have the partial derivative. It's going to be equal to the partial derivative with respect to the following layer weighted summation delta Zi of the following layer multiplied by the derivative of Zi of the following layer with respect to the parameter IJ of layer L for every I and J. And we know that this one is simply our error component I of layer L plus 1 this guy over here. It's simply the J component of the activation at layer L. We computed this one from the forward pass and we have computed this one on the left hand side from the backward pass. Finally, we can write the last formula, which is writing the total partial derivative of the error with respect to the parameter at layer L. This can be written as at layer L, which multiplies the error at layer L plus 1 transposed. So we have the first one, it's a column vector, and the second one is going to be a row vector. So we can easily tell that the product is going to be the matrix, which has the same dimensionality of capital theta of layer L. And finally, we have the list of five equations we are required to compute in order to perform back propagation. So at the beginning, as first step, we have that the first activation is equal to the input to which we additionally add the extra plus 1 as a bias term. Then as second point, we compute every other weighted input at layer L plus 1, which is equal to the mapping of layer L applied to the activation of layer L. Then we have the activation, it's simply the non-linearity applied to the weighted input. And then again, we have the A hat is going to be simply the A with the plus 1 bias additional term. As a third point, we compute the error at the last layer. So we have the gradient with respect to the input to the error, or basically with respect to the output of our network, which is then multiplying component-wise with the derivative of the non-linear function utilized. Then we can backpropagate the error through the network by utilizing this formula where we see the transposition of the weight matrix multiplying the error at the following layer and then component-wise multiplied by the derivative of the activation function. We can see here that we haven't specified any particular error function or any activation function. So based on your choice of error or activation function, you can simply modify this formula to adapt to your specific training. Finally, we can compute the variation of the error with respect with any of the parameters used in the network as simply the product of the activation of a specific layer L multiplying the transposed of the error at the following layer. Finally, if you would like to perform gradient descent, we can choose between the stochastic gradient descent, so stochastic gradient descent, which is simply saying that the current parameter should be updated with the following rule, so equal to the current parameters minus learning rate eta which is multiplying the activation at layer L and the error at layer L plus 1 as we just saw or if we instead would like to perform a batch gradient descent or mini batch. We simply have that the update rule is going to be that theta L is equal to theta L minus the learning rate and the average of our errors. So our updates as well for the example I.