 In this practical we will go through very famous architectures for convolutional neural network that have contributed substantially to the state of the art in image recognition. We will start with the net, which was trained on the modified NIST dataset of 60,000 images representing numbers from 0 to 9 of size 28 pixels by 28 pixels. Then we will introduce the image net challenge and how it has incentivated the submission of very powerful convolutional neural network models. We will start with AlexNet, then we will move on to the GoogleNet, which introduces the inception architecture, and we will conclude with the ResNet, which introduces the bypass connection. ResNet 5 has been one of the first convolutional neural network. Its architecture has a convolution, a subsampling, another convolution, again another subsampling, and three fully connected layers. Let's see how to implement this in Torch. As we have seen, we have a special convolution that goes from one layer to six layers, with kernels of five by five. Then we have a max pooling with a kernel of two by two and a stride of two by two. Then we have a tanh non-linearity. It follows a special convolution going from six layers to sixteen layers, using also a convolution of five by five. There follow a special max pooling with a running window of dimensionality two pixel by two pixel and a stride of two and two in both directionality. Moreover, there is one more tanh non-linearity. Then we have a view of the 3D output converted into one D output, so that we can apply a linear layer going from 16 times five times five to 120. Then we have a non-linearity, tanh, going from 120 activations to 84 activations. The last non-linear function. And then the output connections that goes from 84 neurons to the last 10 output neurons. And then at the end, we return the network. Let's see how it looks in the terminal. We can open our wrapper by typing Torch. We will require printing and then so that is going to be easier to understand the structure and architecture of these networks. Require Lennet. And here we have the architecture. As we have seen, we have 12 layers starting from the special convolution max pooling, tanh, special convolution again, special max pooling, tanh. Then we have a view, which gives us a monodimensional view of the 3D picture map. And then we have a linear for 400, 220, non-linearity, another linear, non-linearity, and a final output layer. The image net challenge consists in correctly classifying one out of 1,000 classes. And for each class, there are 1,000 training images. Overall, there are 1 million images on which convolutional neural network can be trained. The first network has been AlexNet. You can see that its top 1 single model, single crop accuracy, it's roughly 55%. That was a huge increment with respect to the traditional computer vision techniques. Then we have batch normalize AlexNet, batch normalize networking network, and then we see the GoogleNet or GoogleLennet in honor of the first network we just saw before Lennet, which achieves an incredible 68%. Then we have the introduction of the ResNet, which goes up to an astounding 77%. At the end, we see the inception v3, which gets an even higher accuracy, still for single crop and single model evaluation. Not all this network have been designed in order to maintain computations under a specific value. As we can see from this chart, on the x-axis I represented the number of operations that each network requires. On the y-axis instead, you can see the amount of top 1 single crop, single model accuracy. The size of these blobs represents the size of the networks in terms of million of parameters. We can see in the bottom left there is AlexNet and the batch normalize AlexNet. Way smaller, just above AlexNet, we have batch normalize network and network, which obtains higher accuracy with a similar amount of operations but with way less parameters. 10% above of batch normalize network and network, we can see GoogleNet, which is the smallest network over all these architectures. It has the highest accuracy density. It achieves 68% with roughly 3.5 giga operations. Then we can see that the network turns to the right and they follow an hyperbole. If you check my article, an analysis of deep neural network models for practical application, I explained with greater detail this figure and many other more relationships of the best network trained on the ImageNet challenge. Finally, we can see that ResNet 101 has 5 giga operations more of the Inception V3, which is instead a better compromise between high accuracy and still small operations. We will then start with seeing how the first architecture has been created. The AlexNet, the author, Alex Krzeevsky, submitted the first convolutional neural network to the ImageNet challenge. Its architecture is as follows. Let's see how we can implement this in torch. As we have seen, there are two identical branches. Let's start by building one of these two branches and then clone it. The first branch, FB1, is a sequential module, which has a special convolution which goes from 3 channels RGB to 48, using a convolution of 11 and 11, applying a stride of 4 and 4 and a padding of 2 and 2 for each side, which accounts for 4 pixels in both directionality. Then we have an rectifying linear unit. We have a spatial max pooling with a window of 3 pixels by 3 pixels and a stride of 2 pixels and 2 pixels. This means that the size of the windows is higher than the stepping stride. This was done in order to prevent loss of information, but it's not any more commonly used. Instead, the rectifying linear unit introduced by Alex Krzeevsky is now extensively used and extensively and very commonly used, although it has some drawbacks, like duplications of filters in opposition of phase. Then we have a spatial convolution going from the 48 maps to 128 maps, using a convolutional kernel of 5 pixels by 5 pixels, a stride of 1 and 1 in both directions and padding of 2 pixels and 2 pixels. In this way, the output size does not change. We start with 27 pixels in input and we also end up with 27 pixels in output. We have again a nonlinear function. Follows a spatial max pooling, again with the running window of size 3 and 3 and a stride of 2 and 2. Another spatial convolution going from 128 maps to 192 maps, using a kernel size of 3 by 3, a stride of 1 by 1 and a padding of 1 by 1, which introduced 2 pixels in both directionality and therefore the input and the output are the same size, 13 pixels times 13 pixels both in input and in output. Again, a nonlinear function. Another spatial convolution going from the 192 maps to another 192 using 3 by 3 kernel, a stride of 1 by 1 and a padding of 1 by 1, which preserves again the dimensionality. Starting with 13 by 13 activations and also outputting a 13 in height and 13 in width activations. Another nonlinear function. Finally, there is the last spatial convolution going from 192 channels to 128 again using 3 by 3 kernels, a stride of 1 by 1 and a padding of 2 pixels in both directionality, which preserves again the dimensionality of the maps. There is a nonlinear function and then a spatial max pooling with running window size of 3 by 3 and a stride of 2 by 2, which half the dimensionality of the maps. Going from 13 pixels square to 6 pixels square. Then we clone our first branch and then we reset the kernel values in order to break symmetry. Otherwise, the two branches would be converging identically to the same solution. Then we concatenate the two branches, one after each other with the nn.concut module. Finally, we apply the classifier, which is a fully connected layer as we have seen in the previous lessons. Therefore, there is a sequential container, which contains a view of the 3D input feature map. There is a dropout connections, which introduce regularization in the network. There is a linear connection going from the view of 256 times 6 times 6 to 4096 activations. Then there is again a rectifying linear unit. Another dropout, another linear layer, which goes from 4096 to 4096 which therefore has a matrix of 4096 square. Again, a rectifying linear unit. And then the final linear layer, which goes from 4096 neurons to the output 1000 neurons. Finally, there is a logsoftmax, which is part of the criteria which we shouldn't care too much at this moment. The last instructions simply put in a sequential, the features which are the two concatenated branches and the classifier. Let's see how it looks. We can run our interpreter by typing TH. We can require pretty NN. And then we can require AlexNet. And here we have it. We can see there are two identical branches with the convolutional, non-linear and max-polling layers we have just defined. After the two parallel branches, we have the fully connected layer. Now we see Guglenet, which first author is Christian Segedi. This is the architecture. There are two interesting points. It introduced the inception module where we can see different convolutional layers with different kernel sizes applied to the same input. Moreover, we have auxiliary classifiers which are used for improving performance. Moreover, there are other versions of the inception architecture so Guglenet was the version V1. Then we have V2, V3 and V4. This architecture is a little bit more complicated with respect to the one we have already seen. Let's start by defining some shortcuts. In my case SC means spatial convolution, SMP means spatial max-polling, RLU means relo. This way I can save space afterwards. The first part is the function which allows me to create an inception module given the input size and the configuration table. An inception module is a concatenation of several modules in parallel. Therefore we use the NNConCAT module. The convolution one is a sequential module which is a spatial convolution with input size equal input size, output size equal to config 1,1 and a kernel of dimensionality 1 by 1. Then there is a relo non-linearity. At the end we add the COM1 to the depth-CAT module. We continue with COM3 which is a sequential as well. It has a 1 by 1 spatial convolution layer and then it has another spatial convolution with a kernel size of 3 by 3 with stride of 1 by 1 and padding of 1 by 1 so that dimensionality is preserved. Finally, both the layers have a rectifying linear unit non-linearity. Also COM3 is add to depth-CAT. Following we have convolution 5 which is also sequential of two spatial convolutions. The first one is performed with a 1 by 1 kernel and the second one has a 5 by 5 kernel stride of 1 and padding of 2 by 3 so that we have 4 pixels for padding for both directionality which preserves the dimensionality. Both layers have a relo non-linearity and we add COM5 to the depth-CAT. Finally, there is the pool layer which is a sequential as well in which we have a spatial max pooling and a spatial convolution and we add also this module to the depth-CAT. The first layer of the Google net is a factorized convolution. We start with a sequential module. We enforce contiguity of the tensor we insert and then we want to view the RGB color input image as 3 of 1 plane so that I can use a convolution to go from each of these planes to 8 separate maps. Using the spatial convolution that goes from 1 input to 8 output a kernel size of 7 by 7 stride by 2 by 2 and a padding of 3 pixels by 3 pixels. Then we have a depth-wise convolution which is performed with an n-parallel module to which we add 3 convolutional layers. Finally, the factorized convolution is a sequential of the depth-wise parallel module to which we apply a non-linear function and then to which we apply a spatial convolution going from 8 times 3, 24 to 64 maps using a kernel of 1 by 1. To build the rest of the network we will build separately these blocks. We will have a main0 block which is sent to both a main1 and softmax0. Main1 is sent to both main2 and softmax1 and then main2 goes to softmax2. The main0 is going to be a sequential to which we add the factorized convolution. Then we have a spatialmax pooling a spatial convolution another spatial convolution a spatialmax pooling and then we have two inception layers another spatialmax pooling and another inception layer. The main1 block instead is made of three inception modules. The main2 instead is made of a one inception module a spatialmax pooling and other two inception modules. The first auxiliary classifier called softmax0 is a sequential module which has a spatial average pooling and then it has a spatial convolution a view, a linear, a dropout another linear and then the final logsoftmax0. The softmax number one has also an average pooling spatial convolution a view, a linear, a dropout linear logsoftmax0. And the final softmax number two will have again a spatial average pooling a view, a dropout, a linear and logsoftmax0. Here we use a spatial average pooling instead of a spatialmax pooling we want to minimize the error introduced by so many layers by averaging it out. Now we can assemble all these blocks like they would be Lego blocks in the final architecture. We will start from the end. We see that block2 is going to be a sequential which simply has the concatenation of main2 and softmax2. Then we are going to use split. Split is performed with a concat of block2 the cascade of main2 and softmax2 with softmax1. Again, these two branches that are in parallel are in series with the main1. So we are going to use a sequential which has main1 and split1. Main1 together with all block1 is in parallel to softmax0. Therefore we use a concat to which we input block1 and softmax. Finally there is the last series connection. We have the last series connection where we see main0 connected to split0. So this is very similar to how someone would draw an electrical circuit with resistors in parallel and in series using concat in order to make branches in parallel and sequential in order to make branches in series. Finally we can call model block0 and we return model. Let's run our interpreter require pretn require google.net Let's go fullscreen and it's quite long. Let's try to understand what's going on. So at the beginning we have the main sequential which contains two blocks, the main0 and the split0. Then if we go inside every block we will end up with the first block where we have that the RGB map is split in three separate layers and a convolution to each separate layers is performed. Then we have a real non-linearity a spatial convolution and again a real one. Then there are the classical layers of the network like spatial max pooling, spatial convolution, real low, spatial convolution and again spatial max pooling. Then number 8 we have our first inception model where we can see it's a concat of four modules in parallel. The first one is a spatial convolution with a kernel of 1 by 1. The second one is a series of spatial convolution of kernel 1 by 1 and a convolution of with layer 3 by 3. There is another one with 1 by 1 and 5 by 5. And the final one is the max pooling layer. Follows another inception model spatial max pooling, another inception model. Then here we have our concat which is our branch for the auxiliary classifier. So in the first part we're going to see more inception models and if we go follow the red line on the left we can see we have our first sequential which is our first auxiliary classifier. We can go up and we can move inside the network here. Inside the network we have an inception model, another inception model, another inception model and here we have number 2 goes into an inception model on the first right and we can go down following the second red line where we see our second auxiliary classifier. We can climb up follow again the network with an inception model, a spatial max pooling another inception another inception model and then we have the final classifier. The final architecture we're going to encounter today is the rest net which comes from the publication deep residual learning from image recognition Kaiming and Microsoft research. This paper uses the theory from the highway network in order to train a neural network for the image net challenge. We see that the main principle here are some connections that are coming out from the network and bypass a couple of layers. We can see here a better view. Each couple of layers is bypassed by an identity connection. In this way the network can be simplified by creating descent. Moreover the gradient signal has a clear path in order to backpropagate through the early layers of the network avoiding therefore gradient degradation. For this last network I will show you how the network looks like in torch and then you for exercise should try to write down this network by yourself so that you can be sure you know how this network can be generated. So here I have my ResNet 18 I can do th require pretty nn. Also we require cudnn which is a library for CUDA used for this convolutional neural network. And then I can do torch load ResNet 18 and here it is. So from the beginning we have a sequential with 11 modules. We start with a special convolution going from 3 to 64. Then there is a special batch normalization. Also this layer similarly to dropout helps regularization. Then we have a redefined linear unit. Follows a special max pooling. Then we have our first block of bypass connection. We have a sequential of a sequential. In the second sequential we have a concat table. Where we can see the first branch it's a sequential of special convolution special batch normalization ReLU. Special convolution special batch normalization which is bypassed in parallel by a nn identity. Finally the two outputs which are tables since we use a concat table are then added together using c add table to which is appended a rectifying linear unit. Follows in the sequential the second sequential out of 2 which number 5 has. So we see again exactly as the one above. So we have a concat table which split the network in two parts. We have a special convolution special batch normalization rectifying linear unit. Again a special convolution and a special batch normalization which is bypassed parallel by a nn identity. Therefore the two outputs which are going to be contained into a table are summed together with the nn c add table and a rectifying linear unit follows. Then we have the layer 6 which is simply another sequential. So basically these two modules number 2 here and number 1 could have simply have been in the main branch. I believe this specific architecture was done this way because allows generally to add more branches and more features in case we use more complicated networks. This was unnecessary here. So module 5 could have been simply removed and module 1 of 5 and module 2 of 5 could have been simply been in the main branch. Then module 6 is again a sequential which contains a first sequential and a second sequential as before these two modules could have been in the main branch and this special convolution and special batch normalization special convolution are bypassed by a special convolution which changes the number of maps using a 1 by 1 kernel. So before we were seeing a identity because the input of the convolution was 64 and the output of the last convolution was also 64. In this case the input of the convolution was 64 but the output is 128. So since we had incremented the number of channels we have to as well increment the number of channels in the bypass connection. That's why we use a special convolution going from 64 to 128 with kernel of 1 by 1 and stride of 2 by 2 because also the first convolution here has a stride of 2 by 2. In this block we see that the output number 2 is a special convolution whereas before it was a identity. This because we go from 64 maps to 128 before we were going from 64 to 64 but the stride was of 1 whereas now we have a stride of 2. So since we make the size of the output feature map a quarter of the input we have to double the number of channels in order not to lose a high amount of information. Follows another bypassed connection of 128 to 128 convolution with a 3 by 3 kernel special batch normalization and ReLU. Special convolution again 128 to 128 with a 3 by 3 kernel stride of 1 and padding of 1 so to preserve dimensionality and follow a special batch normalization which is bypassed by an identity since we preserve the same number of channels across these modules. Finally we sum the 2 we apply a nonlinearity and so on where you have another sequential with another bypass connection which uses a special convolution because you go from 128 to 256 and in this case also we use a kernel of 1 by 1 but a stride of 2 by 2 because also the convolution at the layer number 1 has a stride of 2 by 2 and it goes from 128 maps to 256 using a kernel by 3 by 3. Follows a summation of the 2 and nonlinearity another sequential where we have the preservation of the number of channels of the feature maps therefore there is an identity summation and nonlinearity another sequential again we have a special convolution which has a stride of 2 by 2 and therefore we also increment the number of maps going from 256 to 512 and therefore the bypass connection is performed with a convolution with a kernel 1 by 1 going from 256 maps to 512 using a stride of 2 by 2. Summation and nonlinearity are following. Another sequential in this case stride is 1 by 1 and therefore we find an identity and we have again summation of the 2 and a nonlinearity Finally we have the final special average pooling which also in the case before with GoogleNet improved the robustness against noise by averaging out the noise. We have a view of 512 and therefore we have this singular linear layer going from 512 to 1000 so here we have to notice that this network really does not have any linear layer since you remember from the previous videos we have seen that the linear layer are very expensive especially for AdexNet and the network we have shown in the practical 3.0 Thank you for listening and let's stay tuned because the next practical I will show you how to train these networks using the OPTIM package from Torch