 This may seem like the long way around of doing things, but by breaking it out into digestible steps, steps that are straightforward to visualize and to talk about and by seeing then where it works and where it doesn't work when I was coding this up. It was a great way to develop this algorithm and to build confidence that at each step it was doing what it was supposed to be doing. At this point, you can see that there is plenty of opportunity to take this code, to sandwich it down into many fewer lines and to pull out some of the documentation and also probably to speed it up. But for our case, for what we want is clear research code, this for now suits our needs just fine. Now we're going to go back and repeat this process with calculating the weight gradient and calculating the input gradient. So with calculating the weight gradient, now we're on the backward pass. Now our signal is our output gradient and we have our inputs as well and using the relationships that we derived previously, those equations, we're going to take and calculate the partial derivative of the loss with respect to each weight in each of our kernels. We'll break this out into several steps to keep it nice and clear. The first step is we go from a whole kernel set of output gradients to a single kernel. We initialize the result here. We calculate what the result of the whole kernel set should be. So number of channels by kernel width, by number of kernels, that's the total size of all the weights we have. And then we iterate taking one kernel at a time. So we take that whole set of weights and we just take it one slice at a time. Now we're dealing with a two-dimensional set of weights instead of a three-dimensional set, one kernel's worth and then we can stop and calculate a single kernel weight gradient. So we again have our full set of inputs. We have just one kernel's worth of our output gradient. So just a one-dimensional set of output gradients. And we can step then to our single kernel weight gradient. Again, we calculate what the size of the output should be. So now it's just to the number of channels by the width of a kernel. And we iterate through each of those channels. We pull out the inputs associated with that channel and we have our full set of outputs. And then now we're down to where we can do a cross-correlation. So remember to get the gradient of the weights, we had to first flip the kernel and then do convolution. Because cross-correlation is already convolution after the kernel has been flipped again. That's two flips. So we can leave it unfliped and jump straight to a cross-correlation. Saves ourselves a couple of steps there, speeds things up a little bit, simplifies the code. Also, this explains why we went to the trouble to separate out our cross-correlation function. So we could get to it directly. So by doing cross-correlation of our inputs with our output gradient, we get the result for that particular channel, which is the weight gradient for that channel. And so we can fill in the results channel by channel and then the previous function fills in those results kernel by kernel and then by the time we're done we have a three-dimensional array of weight gradients, one for each weight. We do the same thing now with the input gradients. We pass it all of the output gradients, the entire set of kernels and we use the relationships that we derived in the previous section. We pre-calculate what the size, the full size of the results should be and we initialize that. Here we have to do an extra step, which is to add some zeros to either end of our output gradient. So that when we do our valid convolutions, where the signal has to completely overlap with the kernel, we get the right number of convolutions to get our inputs. This is another way of doing convolutions in the full mode, which means when there's any overlap at all with the tails just touch between the kernel and the signal, calculate those convolutions. This is how we go from a shorter signal, which is the output, to a longer signal, which is the input. It's the reverse operation from doing a valid convolution. So we do the same thing here and we take and break down the whole set of kernels so that we're only operating on one kernel at a time. We calculate the size and the shape of the result, which in this case will be the input gradient and then for each iteration, for each kernel that we handle, we add to that result. So we calculate the input gradient due to a single kernel. We pass that a copy of the padded output and a copy of the one kernel that we're interested in and then we add that result to the total. Then we go to the function calculate single kernel input gradient and we handle just one kernel at a time. Again, we calculate the shape of the result that we expect from this. Number of channels times number of inputs, we have our padded output gradient now and we can take and do our valid cross-correlation where we pass it the full padded output gradient and our kernel from one particular channel and do that one at a time and then fill in that result one channel at a time and pass that back. So you see in both of these cases, we've taken and reduced the complexity of the calculation each time. We've eliminated one dimension of our array each time until it becomes a nice one dimensional convolution or a convolution with a reversed kernel, which is a cross-correlation and then we fill that in and pass it all the way up. All of these functions being decorated with Engit means that the first time they get visited they will all be compiled down to C and all of these four loops nested four loops actually become very quick and they operate fast enough that we are happy with them and that we don't care how big they are. Now when we get to two-dimensional and possibly three-dimensional convolutions it'll be worth looking at this even more carefully, perhaps profiling it and seeing what the slowest bits are and seeing if there are any ways to speed those up at all, but for now this is just great and we'll move ahead. So to step back to the top we now have our conv1d block. We have our forward pass and our backward pass which one of which our forward pass takes in inputs and calculates the convolution result. Our backward pass takes in the output gradient. Internally it calculates the weight gradient and updates the weights and the biases and it also calculates the input gradients and then passes that backward to any previous blocks. So this is now a building block for a convolutional neural network.