 So, let's do a little recap of the last tutorial. What did we do? We've seen how brains seem to progressively extract larger features, while at the same time becoming more and more invariant to things like location. And we thought a little about the problem, what makes a representation useful. And then we've seen how convolution works, and how they produce equivariance, which is that we produce the same activity just at a different place in the map, if we move the image a little bit. And then we've seen how max pooling allows a sudden amount of invariance, and how ultimately by alternating between these two we can progressively build up invariances. What will we do today? We will start putting things together and train them. Fast on simple data sets and standard architectures, and then talk about the broad applicability of confnets. So let's just recapitulate how architectures of confnets look like. We might have an image as input, and this image might be 28 by 28 pixels, as in the case of MNIST. Then we will have a convolution that will basically give us, provided that we do some amount of padding, that convolution will give us six features here, also 28 by 28. And then here's a max pool. The same number of features, six features here, but now it's a 14 by 14 map, which means we did a 2 by 2 max pool with a stride of 2. Then we will again do a convolution. Now we will have 16-fold this of 10 by 10. Now, why did it just get smaller while it's an issue of padding? So here we have another convolution, and then we'll have another max pool. So at this point of time we will have 16 features here, and they will each be 5 by 5. So now we can of course flatten this, producing a long vector, and then we will switch to a dense network, where we go here to 120 units, 84 units, and 10 units at the end. So this would be how a larger confnets would look like. Another way, much more convenient to visualize that, we take an image, 5 by 5 convolution with padding of 2, 2 by 2 average pool, stride 2, now like here, this isn't max pool, here's average pool. Another way of doing it, similar results, okay? 5 by 5 convolution here, 16 channels now. Another average pool, and then a dense layer, another dense layer, another dense layer. Observe also another thing here. Like here we will go to a smaller spatial scale by a factor of 2. Here we introduce 6 times as many features as we have. So here at first we have more channels if you want than we have in the beginning. And then after the max pool we have a little less, which is 16 channels of 5 by 5. Then we have another average pool. Like we basically, when we do the convolutions, we make this space bigger, and the max pool then, or the average pool as we use it here, makes them smaller again. And then we have multiple dense layers. They always get a little bit smaller. In general architectures where we build in a little information, last layer after layer, they often do really well. Now let's look a little bit at the output dimensions. And this is part of a recap of week one. What will be the output dimensions of a single two-dimensional convolution operation that has an input of size 300 times 400, a kernel size of 5 by 5, a stride of 1, and a padding of 2? And once we are there, what will be the output dimension of a subsequent 2D max pool with a 2 by 2 filter, a stride of 2, and no padding? And what will be the dimensions of these two? Your turn to calculate.