 In this chapter, we are going to introduce the convolutional neural networks, or CNNs, specifically for image understanding. The concept here illustrated can be easily extended to different kind of input, like audio or video. All it requires is going to be a slightly variation in the type of convolution that is going to be utilized. We will start with the rational and motivation. Why are convolutional neural networks needed in order to process images? We will see that the fundamental aspects of a convolutional neural network is sparsity, which exploits locality in the input data. Specifically, we will introduce the receptive field and the hierarchical view of deep models. The second main important aspect of convolutional neural network is that they perform parameters sharing, which is based on the assumption of the homogeneity of the input data. We will see then how a convolutional layer will perform a 3D convolution, in the case of image processing. Then we will talk about the nonlinearity layer and the decaying learning speed across the network. So we will revise the logistic sigmoid and we will introduce the rectifying linear unit. Moreover, we will introduce the LP pooling layer, where two specific cases are the average pooling and the max pooling. Then again we will see the rational and we will have some conclusions, where we will show what are the benefits of using convolutional layers and pooling layers. Then we will switch to the shell and the command line and we will install the pre-TNN package in order to have color-coded network architectures, which is going to ease our understanding of the models and will be fundamental for the next practical. And finally we will introduce the ELAB Torch 7 profiling tool. It's a tool I wrote around October 2014 and now has more than 200 commits from all the members of ELAB and others, with which we will profile a convolutional neural network in Torch. And compare the results we have computed before manually. In the field of image understanding, let's find out why convolutional neural network had had a major role. Let's start with a neural network which goes from let's say N neuron in input to three hidden layers of size twice as the input, so 2N, 2N, 2N and then capital K output. Let's compute the number of operations that a similar network may require if used for image processing. Let's say we start with an image which is a color image. So we have three planes, RGB, and which are of size 256 times 256, and then in depth we have the three color channels, RGB. So in this case N is going to be equal to 3 times 256 times 256, which I can write as 3 times 2 to the power of 8 and 2 to the power of 8, which is 3 times 2 to the power of 16. Overall we can count here 2N, 4, we can have 2, 4, 6, 7N. So we have 7N plus K neurons, which in this case is going to be 7 times 3 times 2 to the power of 16, which is roughly 1.4 million. To go from the first layer to the second layer we are going to use a matrix which has height 2N with N. Then here we are going to have a square matrix, 2N, 2N, another square matrix here, 2N, 2N, and then here something like K and 2N. So if I count how many multiplications and additions we are going to have, so we are going to have 2N squared plus 2 times 4N squared plus 2 capital K. So we can neglect this term here. We have overall 2 times 4N plus 2, roughly 10N squared, which is equal to 2 times 5 times 2 to the power of 32 and 3 to the power of 2. So this is actually going 2 to the power of 33 and this is roughly 390. Multiplication, max, which is multiplication, accumulation. Let's say we are on my MacBook Pro, which has an Intel Core i7, which runs at 2.2 GHz. So we can see how much time it is going to take to process one image, which is a color image of 256 times 256 pixels. So we have the 390 GHz divided by 2.2 GHz, which gives us 180 seconds, which is roughly 3 minutes. So we have here seen that to process this image here, it's going to take us 3 minutes just to send forward one image. Think about training where we are going to use several images in both directions forward and back propagation. This is really impossible to go this way. So what can we do? We can be smart and we can reduce the number of computations by exploiting the underlying distribution of our data. So if we actually think about what we are trying to achieve here and we are not just blindly applying a neural network to whatever problem it is, we see that our images have very distinct specific characteristics given that they are natural images, they are not random noise. So the two main points we want to keep in mind are the images highly specially correlated, means if two points on these screens are closed nearby, it's likely that they will have the same color. Most of the time they will be black in this case. But if you take two points close to some natural images, both those values will be very close in value. Moreover, in images, objects are made of parts. Let's see how to exploit these two particular aspects of images in order to reduce the amount of computation and therefore speed up the whole system. So the first thing that it's very important to keep in mind when we speak about convolutional neural network is that they introduce sparsity. Here we can see that we have a fully connected layer, L-1, which is connected to the layer L with all the connections as we have seen. Moreover, there will be the plus-1 for the bias here on top, which is connected to the three neurons on the right-hand side of layer L. The same here, we are going to have the bias on top, which is connected to the single neuron of layer L plus-1. So how can we exploit this correlation in order to reduce the number of computations? We can say that, for example, the first neuron here can simply see a subgroup of the input, because input pixels nearby will carry some kind of information relative to the object or to the location where they are coming from. But it's quite unlikely that pixels that are far away will have some kind of contribution for characterizing the specific location at which the first neuron may be looking at. So sparsity basically means that the first neuron of the layer L is going to just look at this neuron here, this second neuron over here, and the neuron over here. And of course, there will be also the bias for the same way. The center neuron will just see this guy here, this way. And again, we have the plus-1 for the bias. Analogously, we have the three connections for the last one. In this specific neural network, we say that this layer here will have a receptive field Rf of 3. That means that we'll see just three neurons from its previous layer. Let's move on and let's connect also the layer L to the layer L plus-1. So we are going to have first connection, second connection, third connection. Also here, the receptive field is 3. But the receptive field with respect to the layer L minus-1 is going to be 5. Why is that? Because the neuron at the output layer can actually see five neurons from the input space. So in this case, we have that the neurons at the intermediate layer, the red one, will just see a portion of the input space. And therefore, we'll have the notion of edge. The neuron afterwards in the higher, as we go high in the higher C, we'll see more pixels and we'll be able to have a better understanding of what's going on in the input image because they can combine the response of neurons that are in the intermediate level, which builds up to generate a more abstract and a high order representation. We can draw here, therefore, here in this direction, a global view direction. Eventually, the last neuron of this network will be able to see the whole input image and will be able to distinguish between different objects that we are going to propose to the network, for example, for a classification task. As we can see here, the number of computations is being greatly reduced. We will see later a numerical example that is going to give us some better understanding of the order of magnitude of reduction. So we said once more, sparsity give us a reduction in computations by exploiting local correlation of the data. The second main principle on which convolutional neural network base their strength is the parameter sharing. So the first one we said it was the reduction of computation due to sparsity and this one instead is parameter sharing which is going to ease the convergence of the training of the model. How is this done? How does it work? So we see basically that each of these neurons in the red layer will share all the weights for the same kind of connection. So we can represent this by drawing here, for example, in purple, this connection here. Then, for example, in blue, we have this one and then we have the last one in pink. Finally, we don't want to forget our bias, of course. Which is the same parameter associated with. In this case, each of these neurons are looking for the same kind of feature in different regions of the input. As we saw when we performed the computation of A1 here, so A1 is going to be just this partial input here. We can call it X for the moment. So we are going to have a projection of the X for the neuron one. So we can put one here. So in this case, we project the first part of the input on the specific feature theta, which is shared across the whole neurons in the hidden layers. And each of these neurons is going to look at different regions of the input space, to which it's going to still look for the same kind of target, the same kind of activation, the same kind of kernel we can call. So what we are going to do, basically, it's the projection of portion of the input with respect to the same specific parameter vector, which is a kind of feature we are going to train our system on, basically. And then we are going to take, of course, our non-linearity in order to boost the correct result and to attenuate the opposite. Parameter sharing allows us to reduce the number of parameters, which in turn helps convergence in terms of time and in terms of relative error, less time and less error. So let's see how the convolutional layer actually works. So the normal case, the generic case, we are going to have an input that I'm going to call just X, which is a 3D tensor of dimensionality. N1, 2D maps, N2 times N3. N1 maps, basically, or channels, and these maps have height N2 and width N3. For example, if we are in the input, so if we are actually talking about the real input, so image input. So in this case, N1 could be 3, which are the 3 RGB channels. And the N2 and N3 are going to be the height and the width of our image. For example, in the case we saw before, it could be, for example, 256 times 256 height and width. Then our output instead, Y, it's also going to be a 3D tensor of dimensionality N1 maps. So we have N1, 2D maps of dimension, N2 times N3. And how do we get these maps? So we also have a collection of kernels. We call them K, which is a 4D tensor in this case. And it's made of N1, 3D, N1 times P1 times P2 kernels K, K, I. Finally, we have a term B, in this case it's a vector, so it's 1D. And this is my M1 of dimensionality M1 bias term. So how do we compute Y? We said we had M1, 2D, M2 times M3, Y, I feature maps. So each feature map is M2 height times M3 width. So we have a 3D input, which is N1 maps or channels of height N2 times N3. Then the output is going to be, and again, a 3D tensor of M1 channels of maps of height M2 times M3. Then we are going to use a collection of M1 3D kernels of dimensionality N1 as the input times P1 times P2 kernels K, I. And I here goes from 1 to M1. And then at the end, we have also a term bias, 1 dimensional, which is the same size of the number of channels of the feature maps, Y, which is the bias term. Now we can see how to compute these feature maps. We have that each and every feature map Y, I is going to be equal X convolved with each of these K, I, and to which then I sum my scalar B, I. And we have this one for I, that is going to be equal 1 up to M1. And we have that this guy over here, this is the convolution symbol. And basically what this equation says here is that we project our input X onto different subspaces, K, I, and then we shift them by a specific bias term. It's very, very similar to how we were computing the weighted input Z, which was equal to my theta matrix times my activation, or X, if it's in the first layer. So as we said here, we have that each of our feature maps, Y, I, is going to be equal to the convolution between the current input X, which is the activation basically from the previous layer, convolved with each of the kernel M1 kernels for the specific layer, K, I, and then we shift by a specific bias B, I. So how does it work? This one is a 3D convolution, which can be expressed this way. So if we have, for example, an image or a feature map F of variables L, M, N, and we convolve with a G, L, M. It's going to be equal to the summation over all the variables u, v, and w of F in u, v, w, multiplied by the kernel which is flipped and transposed as we sum. So we have L minus u, M minus v, N, N minus w. If this is still very cryptic, I'm going to draw something. Perhaps it's going to be more easy to understand afterwards. So I have here my X, which we said it's N1 here, N2 here, N3 here. Inside we have our kernel, that could be, for example, 3 by 3. And this guy, as we see from this equation, is going to move this direction and this direction. From where it is, for example, in this first location, it's going to be outputting my first value. It's a 1 by 1 pixel, so this is 1, 1, 1. So then it's going to move across towards the right and down. So at the end I'm going to have something that is like this and maybe also down here. Just one here in the depth dimension is going to be of height M2 and width M3. And this stuff is repeated times M1 times. So that we can end up having our final output Y with M1 maps. And the height is going to be M2 and the width. As you can recall from the classical neural network case, after we compute z from the linear operation where we have the matrix multiplied with the activation from the previous layer, then we are going forward and computing the new activation as the nonlinear function applied to our weighted input. In this case, in our case, our weighted input is called feature map and the projection is done with convolution. We have now applied the nonlinear function to our feature map or our weighted input. We can start by having a look to the nonlinear function we already seen in the previous lesson. So here we can draw, for example, sigma of z and here we have z. And if this is 1, it's going to be half. Roughly if this is going to be 5 and if this is going to be minus 5, we are going to have something like and then it goes this way. And the problem that we have with the sigma is that if I compute the sigma prime in z, we have that if this was 1, this is going to be half. This is going to be a quarter. Still the same x-axis, minus 5, 5. We are going to have something that is kind of very annoying. As we can remember, every term delta in our neural network was multiplied by something here, was multiplied by the derivative of my activation. So every delta, every time we go in layer after layer, it's going to be multiplied by something that is at most one fourth when the z is close to the origin and otherwise just goes to zero. So this creates some problems because the training basically slows down across the network. The layer more close to the cost function can actually change their values quite accordingly to what gradient descent says and establish, but the layers in the lower layer, the one closer to the input, so the one with less abstraction will have some gradient that are basically very, very, very small and therefore the gradient with respect to the parameters is going to be also very small and therefore the learning appears to be slowed down incredibly for the early layers. So since we have to learn here networks that are quite longer than the, they are called deep neural network because they are deep. They are deep in length, meaning there are many layers stacked one after each other. We cannot have this problem of learning that slows down so much due to this particular effect. So we can try to remove this effect by changing our non-linearity. And we can use this one, the rectifying linear unit. As a definition, actually this is defined simply as Z plus, which is the positive part of Z. So it's completely trivial and let's say this is one and this is one. So notice that I've changed the X scale between these two charts from the left and right hand side. For the derivative, we are going to have, of course, it's not differentiable in Z equals zero, but we don't really care too much. So if this is one, we are going to have simply one here and this is zero. And we can define that, it stays here in zero at this point. In this case, we can see that the problem of having the derivative very low is not going to happen here. And we don't have this slowing down across the network due to the non-linearity. This technique was introduced in the paper of Alex Krzyzewski and John Frihinton when they submitted their, the first deep neural network in 2012 for the ImageNet Challenge. And they write in their paper, they show that rectifying linear unit helps a lot speeding up the training. The last layer used by convolutional neural network is the pooling layer. So what does this layer do and how it works? Let's go in how it works and then what it does, we explain what it does. So let's think of having here a piece of feature map, one of the M1 slices, for example, coming out from the convolutional layer that is going after the point was non-linear function. And here we may think get, for example, this first region here and compute the LP norm so that we end up with our first output here. Then we apply, for example, a stride equal to this size to the width of this running window. So our second window is going to be in this position here. And then it's going to perform again the LP norm, which is going to give me my second value here and so on. So I can keep going. Just for clarity, if we would like to compute the P norm of a vector, this is going to be equal to the P norm, P root of the summation across all components of X to the power of P. Moreover, if the P norm tends to max of the components of X, when P tends to plus infinity, max pooling can be simply seen as a specific case of the more generic LP pooling. What does it do? Basically, if we see in the dimensionality, if we apply a stride equal to the size of the kernel here, the running window, we are going to obtain an output which is shrinked in size. So if we started with m2 in height and m3 in width, we are going to get, for example, m3 half and m2 half. The number of channels is still m1. So the pooling operation works across the spatial dimensionality and reduce the number of neurons by dropping them. If we choose, for example, the max operator and therefore just the maximum activations are carried across the network and are going to be then used for later on, for later computations. Sometimes the L2 norm is used instead to average out all the activations to instead having higher robustness for noise. So there are different cases which require different kind of norm, different kind of P value. So for this reason I presented here the generic pooling LP norm pooling rather than just the max pooling, which is most commonly I think used. Now that we have seen what are the main blocks of a convolutional neural network, we can see how the initial example where I was computing the number of computations and the time to process one image can change if we do apply, if we do use convolutions. We can start here by drawing our input, which was 256, 256 and 3 in depth, our input image. We said we'd like to have twice as many neurons, so twice as many pixels in the feature map, if we can say. They are not exactly pixels, so they are activations in the feature map. So in the input space we can call each point of this image a pixel. In later layer, in the feature map they will be called activations instead. So we said 2n, therefore let's make 2n, so we still have 256, 256, let's go 6 here, still 2n, so again 6, 256, 256, then here once more. And then the last one is going to be simply k vector. So to go from the first block to the second block, let's say our kernels k, k i are 5 by 5. We said there are 3 kernels, they can be 3 if we are in the first one, or 6 if we are in the other cases here. So in this first point we are going to have, if I count the number of computations of max, we are going to have 256 times 256 times 6, that is the number of activations. Which came from a convolution of a kernel that is in this case 3 times 5 times 5. So this was the first number of computations, so for the second one it's going to be the same, but the only thing that changes is going to be this 3 that becomes 6. Then again from here to here it's going to be again 6. So I can say this one multiplied by, we said we have 2 times 2 times 1 time, so we are going to have, we can write 1 plus 2 plus 2 multiplier. So this is going to be 5. And overall this is going to be equal to 147 million max. For the final part we have the, we start, we have a matrix that is of height k, and the width is going to be 256 times 256 times 6. And which gives us k times 6 times 256 times 256. It's going to be k multiplied by 393 thousand. If we think about the ImageNet classification, we are going to have that k is equal 1000. Therefore this is going to be equal 393 million. Overall we have 540 million Mac, and if we divide by 2.2 gigahertz of my nice Macbook Pro, we get something like 250 milliseconds per image. So we went down from 3 minutes to a quarter of a second, which in my opinion is pretty neat. So we would like to keep in mind these numbers here. Can we do better? Yes we can, and we can use the pooling operation, which is going to be speeding up our system by at least 15 times. So we said we start with a cube, a data of 256 times 256 and 3 in depth. And for example we can apply now a convolution with a stride. So instead of applying the kernel every pixel and moving it on the horizontal line, horizontally and vertically by one pixel, we can move it for example by two pixels. So in this case we are going to have half of less pixels in the x direction and half of less pixels in the y direction. So we will end up with 128 pixels here and 128 pixels here. And we said we were going to double the size in depth, so again we are going to have 6 here. So with this configuration, see how many computations it's going to take. So we have 128 times 128 times 6, which is the number of activations. And then times my kernel, which is going to be 5 times 5 times 3. And this one is roughly 7 million and 300 max. And this first was convolution with stride. Then we are going to apply the second convolution, 6 in depth, 128 in x, 128 in y. And again we said 6 here and this guy is going to be the same. It's going to be the same because we may use padding, so this is not the valid convolution, but this is the same convolution if you use the MATLAB kind of jargon. So in this case we don't have to subtract let's say 4 from the size of the height and width. Since the kernel is 5, it should have been from 128, it should have gone down to 124. But then it's complicated to actually have nice numbers, so we can just pad and then we have with zero padding we are going to keep consistent the dimensionality. And we don't have to worry too much. Let's see how much this convolution takes. So we are going to have 128 times 128 times 6. And then we have our kernel which is going to have 5 by 5 by 6, actually 6 by 5 by 5. 14.7 million max. Then let's say we apply pooling. So we may start with 6, 128, 128. We end up with 64, 64. The number of computations is going to be 64 times 64 times 6 times our kernel. This is going to be for example 2 by 2 in this case. And this guy is 98000 max. So it's very small compared to the amount of computations that the convolutional layer takes. Then we have the last convolution. So we're going to have 6, 64, 64, 64, 64. And therefore the output is going to be 64 times 64 times 6 times the kernel. So it's going to be 5 times 5 times 6. And this is 3.7 million. And let's even apply one more pooling here. So we go from 6, 64, 64, 26, 6, 32, 32. Exercise let's compute the last one. 32 times 32 times 6 times 2 times 2. Which is 25000 max. At the end we are going to have the last block which is 6, 32, 32. And this one goes to be a linear layer because it's going to be the classifying my k output. Here I haven't computed the nonlinear functions but the nonlinear functions, they are simply the number of neurons activations in the output layer. In the first case it's going to be 6 times 128 times 128 which is 98000. So here we have 98000 of nonlinearity, MAC. And we can see it's small enough that we don't care compared to the million of operations. So the final part we have 32 times 32 times 6 which is equal to 6000. And the final matrix is going to be k and 6000. And if k equals 1000 for the ImageNet for example, we are going to end up with 6 million MAC. If we check the summation of all the convolutions, we are going to have that this adds up to 25.7 for convolution. The last one we said it was 6 million for the linear part. So overall we end up having 25.7 plus 6. It's going to be 31 and 7 million MACs divided by 2.2 GHz. We are going to get 14 milliseconds which is roughly 17 times less than the previous number. We would like now to test and verify the numbers we just computed. Just to be sure they are reliable and let's see also how much time, how long my computer actually will take in order to process an input image in these networks that we have just shown before. Before starting with the networks, let's set up our system with some new packages. We have rocks, install, 3t, nn. Moreover we would like to get clone the Torch 7 profiling repository from the ELAB website. And there we go. So we can CD into Torch 7 profiling. And in here we see there is the profile model.lua. We can do Torch profile model.lua.h. We are going to have a help display. It shows us how to use this file moreover. It shows an example. So we are going to call th profile model, the name of the model, where we are going to define our model. And then we are going to send the specific resolution. And moreover we are going to compute MACs instead of both addition and multiplication. Let's create the model from our network. So we can go inside the folder model. Models and we can create our model file. So let's call it 3convolution.lua. So since I don't know how to do this by heart, we are going to just use the interpreter here below. So we can require nn at the beginning. And then we have, we also require pretty nn. So it true works. So we have net equal nn.sequential. And then we have our first layer. So let's copy here. So let's go here. We are going to write our first layer. So we go net add rather new layer. And then we want a spatial convolution, which goes from 3, because we were starting from input image of 3 channels to 6 channels. Using a kernel of 5x5 with a stride of 1 both directions. And let's put padding of 2 both sides. So to the left, to the right, top and bottom. So that the dimensionality of the output is going to be the same dimension of the input. Then we go net add our relu. Let's copy these two guys here. So we have first one and second one. And this one is going to be my first. So let's go with the second layer. We can do, let's go up. It's going to be from 6th channel to again 6th channel 5x5. No stride. Padding it's okay. So okay, then add another relu. Let's add these two guys. This one. And then again one more reset, right? So this one is exactly the same. And then relu. And now we can add, we had to reshape and view differently the cubic feature map into simply one long vector. So we can do this by using the view module. To which we say just resize it in one chunk only. So if we print now net, we see here it is a sequential. We can see there are seven modules in this sequential. There are three convolutions. The first one goes from three channels to six channels with a kernel size of 5x5. Striding of one in both direction and adding some padding of two pixels per each boundary. The second module is going to be non-linearity and directifying linear unit. Then we have another spatial convolution which goes again from 6 to 6. Kernel size 5x5. Stride one and padding two so we can preserve the dimensionality. Again another non-linear function. The last convolution 6265151 and 22 for the padding. Last non-linear function and then to view the 3D volume into one long chunk. So I have to add now the last linear layer. Since I don't know what's the size instead of computing by hand, I can do a trick. Let's say x is my input, so it's my torch tensor of dimensionality 3 because it's RGB. Then we say 256, 256. So if I ask what's the size when I forward to the network x. Sweet. The dimensionality is 393 whatever. So we can do the net add nn.linear which goes from this dimensionality to let's say 1000 classes for the ImageNet competition for example. So if I print the full network again here, it took some time. You can tell it's quite big as a matrix. If I print my whole network here, it's a sequential with eight layers and the last one is a linear. The last non-linear function is not required. Actually it speeds even training without the last non-linear function. This one is my third layer and this one is my output. Here we actually need to view. We need to add, I forgot, view minus 1. As we said before, we were estimating 147 million max for the convolutional part and here we have 149, so I think it's pretty reasonable. Then for the fully connected part, the linear layer, we were estimating 393 million and we have here 393 million which is also quite reliable. So overall we had 540 million and here we have 543 million. So it's very very close to the computations we have made. And overall we were estimating that CPU can perform one mech per clock cycle. So we were dividing 540 by 2.2, we were getting 250 milliseconds. Here instead we get that the forward took quite half of the amount is 129 milliseconds. And we have that this CPU, my computer, performed 4.2 giga max per second. So we have basically 2 max per clock cycle. And otherwise all the numbers are consistent. Let's make now a test for the other network, the one with the pooling. And let's see how it changes. So let's go back to our model here. So let's save this as 3-conv plus pooling. This is my first CNN plus pool. So let's put a pooling layer after this convolution and after here. And let's change the first convolution instead of having a stride of 1. Let's have a stride of 2 and 2. So we have here to modify this last layer. So let's see how to change it. So we can do torch net equal require models dot 3-conv dash pool. And this line, cool. So if we print the network, we are going to have this architecture. Where we can see we had a convolution that goes from 3 to 6 with 5 by 5 kernel. Stride of 2 and 2 and padding 2 and 2. That is basically for both directions. Then we have the nonlinearity and the new convolution. 6 to 6, 5, 5, 1, 1 to 2, nonlinearity and then the pooling. Again, one similar chunk with the convolution, nonlinearity and pooling. Then there is the view. So you net forward a torch, tensor, rend of 3, 256, 256. And then we actually ask the size of this one. We have the new number here. And we can go here and replace this one. So this is the 3-conv pool. We can go down below. Let's make it full screen. And we can call like this one instead of 3-conv. We have 3-conv pool. Everything else should be the same. Let's go. Bam. It was quite fast this time, right? So let's compare. Let's check some results. So at the beginning we were saying that we were getting the first convolution. We were estimating 7.3 million and here is 7.5. It's pretty close. Then we had 14.7 and here we have 14.8. We actually forgot to take in account the bias terms which is adding some little overhead. Then we were saying that the first nonlinearity would have taken 98,000 and 98,000 it is. Also the pooling is taking 98,000 and 98,000 it is as well. Following we had another convolutional layer with 3.7 million and here we have 3.7 million. Finally we have the linear layer was taking 6 million and 6 million it is here. We can even see in the last column the percentage of the computation which is very important. It should be always theoretically distributed across the whole network in order to have a more balanced distribution of operations and also of processing of the information. If we go down below we can see that the convolutional part takes 26 million and we were estimating 25 and 7 so I guess it's correct. Overall there is a total of 33 million and we were estimating 32. Divided by 2.2 gigahertz we were assuming one meg per clock cycle. We were estimating 14 milliseconds and my computer just finished running the profiling in 11 milliseconds. And this is pretty I would say accurate estimate we got on paper so far. Now we could hypothetically write down the network that is using only linear layers but we can see that it's not going to even be loading into memory. But just for exercise we can do this. We can do linear for example and we have my huge MLP multi-layer perception or a classical neural network. So we can write a local n equal 256 times 256 times 3 and then we have our local net equal nn sequence. And then we have the first layer we were saying net add we add nn.linear which goes from n to n and then we are going to do net add nn.radial. And then we have the second layer which is very similar to the previous one but here we have from 2n to 2n. And the third layer which is going to be identical in this case third layer and then the output which is going to be net add nn.linear of 2n. And then we go into the output which is 1000 and then at the end we have the return net but this one won't even load into my computer's memory so you are free to try it yourself. I won't already crash my computer previously so I won't do it again. And that's it for today. Stay tuned because in the next video we will see some pretty standard architectures from the first one like the Lynette and the AlexNet up to the most recent GoogleNet and ResNet and why some specific architectures have been chosen and what are their good points and what we can learn from this.