 So now is learning possible with convolutions and max pooling? Well, first it just walked not like you just built a system, but let's see why it works and how it works no, so effectively if we want to Backprop through a convolutional layer we can express the forward pass as this now where we have a matrix on the left-hand side with the weights we have the outputs and and We can calculate not like what we have here on the right-hand side adjust the standard equations for well convolutions like Here we have 1 1 1 2 2 1 2 2 these four and then here we have these four and Here we have those four and Lastly here we have those four and our goal is to compute the local gradients we want to know the derivative of the weights after The derivatives of the outputs after the weights and the derivative of the outputs after the inputs So let's see what we have there not like we can apply it here to the left We're going to get four times we can apply it here to the right We're going to get four times and so on and so forth And now this means that we have this as the gradients look this x4 x2 2 is going to be used by all four Locations, that's why the derivative after it has four different times Now you see each of the four outputs has an influence on that the others always have at most two of them But this way we can basically calculate the derivatives Now importantly, we can also generally express convolution as a matrix where we could say we take the input We convert it into a vector and then convolution is just a multiplication there So we have a linear operation Convolution is simply a dot product between the carnal and local regions of the input and we can therefore produce a Scratch a stretched carnal now what you have here is you have 16 assuming that we have a four by four input if we have that then these Positions here appear different places in here now in each row we have nine of them three rows here and Now for each of these you sometimes have four that are relevant or few of them Now we can express convolution as a matrix multiplication. We take this stretched out carnal We take that flattened input we flatten input into call matrix and we just have a regular matrix multiplication between these two and for such a multi matrix multiplication, of course the gradients are perfectly well defined Now that leads us with the other part, which is we need to do back propagation through a max pooling layer Now locally max is linear with respect to the maximum value No at any given point of time one value will be larger than the others unless you're at that place where they're all the same which Which it will effectively never happen and now what do we have here? No, why is max of x1 through xn the derivative is one if it's the maximum and it's zero otherwise And so in addition to storing the maximum value all autograd has to do is score the in Store the index corresponding to that value and then treat it as if all the others had no input Now I want you to think a little bit. What's happening there? Well, there be zero Gradient with respect to some of the weight think and think our stride max pool and so forth affect the the gradients let me highlight why this could be happening now like Only out of the inputs only the one that is the largest will have any influence So and that means that means that all the others have zero influence Now does that mean that we might should expect that some of the weights will carry a zero gradient here