 You can see we'll jump down here, the very first thing that will happen in the forward pass, is that we'll read in the inputs and we'll save it as a member attribute of this class, and then we'll check whether our weights are still none. If they are, then we'll go and initialize, and we have our first set of inputs to use to do that. So we'll use the shape of those inputs to pull out our number of channels, our number of inputs, and our number of outputs. With that, we'll be able to then create an array of weights. We'll use our initializer. Our initializers were built to create two-dimensional arrays of weights. We actually need a three-dimensional array of weights. In order to reuse them and not have to start from scratch there, we'll force our three-dimensional array to be a two-dimensional array of the right size. We'll flatten it out by making the number of columns, the number of channels, times the kernel size. Then we'll initialize that and reshape it so that it has the right three-dimensional structure. The order equals C makes sure that it unpacks those values in the right order to get that three-dimensional structure. The other reason that we do this, other than just convenience, is that we want our initializer, for instance, our LSUV initializer. It intentionally sets up so that given an input with a unit variance and a mean of zero, it will tend to produce outputs with a unit variance and a mean of zero. The way that our convolution works across channels, the different channel outputs all get added together. We actually want not only a single kernel but the kernel across multiple channels when added together to have a unit variance with a mean of zero. To do that, we put it all in one row, let LSUV do its work that we covered in a previous lecture, and then we can unpack it and stack those chunks for the individual channels into separate dimensions to get a three-dimensional array of weights. But they still have the right relationship and they'll still produce the right distribution of outputs given that same distribution of inputs. Then we'll also initialize the bias values for now. The bias is just what gets added to the output of the level. So it's the number of kernels times the number of outputs. And we will initialize it to zero because there are so many other non-zero things, all the other weights, it's not in danger of getting stuck at zero. There will be plenty of gradient signal to adjust it if it needs to be adjusted.