 So let's talk some more about pooling layers. What we want to do in a way is we want to reduce the dimensions of the data without introducing additional parameters. And we saw that we can do that with max pooling. We wanted to introduce some translation invariance. Compare that with what we learned about complex cells all the way in the beginning. And we are able to do that. So now we pool over a window of the image. We used max pooling here. We used average pooling, but max pooling seems very meaningful in this context. Now there is a rule of thumb. Which is what we want to do, or two rules of thumb. We want to have lots of small max pools and lots of small convolutions. Why? In a way, as you saw, as a feature moves from one of these max pool boxes to the next, it does something very different happens. Now like in one case, one will be high. In the other case, the other will be high. So these local discontinuities, a system can learn to minimize them. And if we use lots of small ones, it can do so better. So we want lots of small max pools and convolutions. It's better to have more layers than to do a lot of convolutions at the same time or max pooling. And the second one is that as we go up the levels, keep in mind that the max pooling means that we will have a smaller number of locations. In general, as we produce smaller numbers of locations, we will produce larger numbers of features. Why? In a way, we want to avoid that we have strong information loss at any place in the network. And so as we go through the network, we will have fewer spatial channels and more features. We want some information loss. Now like we want to ignore aspects about the image that are not indicative about which class it is. But we don't want to do too much of that at the same time. Now let's focus a little on the parameters because that just shows how incredibly useful convnats are. Let's say we start with a fully connected network in images as input. We down-sample small images 256 by 256. That's roughly 64,000 parameters. Layer 1, say we down-sample by a factor 2. That would be 128 by 128, 16,000. Now what we have here is we have 1 million parameters times 16 times 64, which is roughly 1 billion parameters. 1 million parameters is a mightily big network and we'd be using that entire network exclusively for the first rather boring fully connected layer there. And that gives you an idea on how fully connected networks are not very feasible for a lot of real-world computer vision prompts. Now let's compare convnet to that. Now we will also want to go from 256 by 256 to 128 by 128. What will we do for that? Well, we might want to use four kernels in that case. No, we go up by a factor 2 each direction. So we will end up with factor 4 fewer spatial channels. We might want to have factor 4 more kernels. So we will have four kernels. We might have big channels, big convolutions. So we'll have 5 by 5, so we have 100 parameters. This example also shows you, you know, like it's 100 parameters versus a billion parameters. So it's massively useful in giving us smaller numbers of parameters. There's another thing that you can see here, which is why might it be useful to do more layers of small convolutions? Well, we could do a 5 by 5 convolution that's 25 parameters. Alternatively, we could do 5 layers of, say, 2 by 2 convolutions, in which case we would have fewer parameters. And so there's many cases where it's useful to do small steps of convolution and max pooling. So now I want you to really understand how the hyperparameters, the choices about convnets and stride and so forth, how that affects the number of parameters. And we build a little widget for you, and I want you to see how max pooling affects the number of parameters. So go play with that widget a little bit.