 Namaste. In the previous session we studied convolution operations. In convolution operation we define a bunch of filter and we use each filter to calculate activation at a particular point in the image. In this session we will study how to slide each filter across image by setting strides. The output of convolution operation is usually smaller than the size of the input image. In order to keep the output of the same shape as the input we use padding. We will study padding in this session. Apart from convolution there is second important operation in CNN which is called pulling. Pulling aggressively down samples the output of the convolution. After understanding pulling and convolution operation we will use them in a practical setting to classify handwritten digits from MNIST dataset. Let us begin. We set strides while sliding across the image. It provides a way to calculate the next position along each axis to position the filter starting from the current position. By default we take stride of 1 which is the most common choice. We can specify different strides across different axis but we use that particular thing quite rarely. Such a strided convolution tends to down sample the input by the factor proportional to the stride. So, let us try to understand how the stride works. So, this is our 28 cross 28 handwritten digit image and we have a filter positioned at the initial position of the image. So, we use a stride of 1 and the stride of 1 helps us to calculate the next position of the filter. So, the current position is 1 1. So, we will essentially take a stride of 1 in this particular direction. So, the next position of the filter will be. So, the filter got shifted by one column to the right. Once it exhaust all the columns we start shifting it downwards by rows. This is how we slide the filter across the image and try to match the pattern in the image. So, let us say we have 28 cross 28 cross 1 image and we have filter of size 3 cross 3 with depth of 1. So, you can see that we will be able to position the filter at 26 possible positions along the width as well as on the height. So, the final position of the filter will be at position 26. So, we get in all 26 possible positions of the filter along the width 26 along the height and we have depth of 1. So, this is how we get 26 cross 26 cross 1 output of the convolution. So far, we looked at how to perform a convolution with a single filter. In a convolution layer, we typically used k different filters. We define all these filters with con 2D layer. In a Keras command in a tf.keras API, we use con 2D. We specify the number of filters. We specify the size of the patch. We also specify the activation, the input shape. So, in this case, you can see that we have 32 filters. Each filter is of size 3 cross 3, the size of the filter. Then we specify the activation that we want to use after linear combination of weights of the filter with the values of the pixel in the image. And finally, we specify the shape of the input. The size of the filter and the stride which is 1 by default is applicable across all the K filters. After applying convolution of K filters, we get 3D tensor with the same number of rows and columns for each filter. So, let us say this is our handwritten digit image which is 28 cross 28 cross 1 depth of 1. And let us say we apply 32 filters. So, we will have each filter of the same size. So, we have this is 3 cross 3 filter. This is filter number 1, another 3 cross 3 filter, filter number 2 and so on. And this is let us say 32nd filter. When we apply each of this filter to the images, we essentially get 26 cross 26 cross 1 as an output. Since the filter has the same depth as the input. So, here the depth is 1. The depth is normally not specified in the filter size. We get another output from the second filter and so on. This we get from the 32nd filter. Let us say this is the output of the first filter, second filter and 32nd filter. What we do is all these outputs are combined. This is the output of the first filter, second filter and 32 filter, 32nd filter. So, we get number of channels to be 32 each having 26 cross 26 output. So, concretely for our MNIST example, we get 3D tensor as an output with 26 rows, 26 columns and 32 channels, 1 for each filter. The total number of parameters for this filter will be 320 because we have 10 parameters per filter. So, we have 32 cross 10 parameters which is 320 parameters for all the filters. If you want convolution output to have the same shape as the input, we use padding. So, after applying the convolution, what happens is the input gets shrunk by some amount. If we do not want our input to get shrunk, we use padding where we add some dummy columns and rows to the input and then apply convolution on it. So, for a convolution of size, for a convolution with filter of size 3 cross 3, we add 1 dummy row and 1 dummy column to the left and right and top and bottom. We add a dummy column to the left and to the right and a dummy row at the bottom and at the top of the image. This ensures that we have the shape of the output same as the shape of the input. Convolution have a couple of key characteristics. The patterns that they learn are translation invariant. After learning a certain pattern, convolutional neural network can recognize it anywhere. They can learn spatial hierarchies of the pattern. A first convolution layer will learn small local patterns such as edges. A second convolution layer will learn larger patterns made from features from the first layer and so on. This allows us to efficiently learn increasingly complex and abstract visual concepts. So, let us say we have a convolution neural network that is trying to recognize a human from the image. So, the first convolution might capture some local patterns like some simple patterns like edges. This happens in the first convolution layer. In the second convolution layer what happens is we combine the output of the first layer and we learn even more complex patterns. As we progress, we learn increasingly complex patterns based on the output of the previous layers. Finally, this pattern will help us to detect a person in the image. The second important concept in convolutional the second important concept in convolutional neural network is pulling. Pulling tends to down sample the output of the convolution aggressively. It is conceptually similar to the strided convolution. It consists of extracting a specific window from the input feature and compute the output based on the pulling policy. Pulling is usually done with a window of size 2 cross 2 with stride of 2. Let us look at pulling with a concrete example. So, pulling defines a window of size 2 cross 2. So, this is the positioning of the pulling window on the convolution output and we apply the pulling policy on these 4 numbers and select a number based on that policy. We use either max pulling or average pulling as the pulling policies. Second important point is we apply pulling operation on each channel separately. For example, if we have the output of convolution. So, let us say the output of convolution is in multiple channels. While pulling what we do is we apply pulling on each channel separately. So, here what we will do is we will separate the channels on each channel separately. So, we select one of we select one number from this particular patch based on the pulling policy. We use one of the following two pulling policies. The first policy is called max pulling and the second is called average pulling. So, let us say these are 4 numbers in the max pulling window. The max pulling will return 6 which is maximum of these 4 numbers. Whereas, average pulling will return the average of these 4 numbers which is 4. So, let us try to see the max pulling in action on the output of the convolution layer. So, there are so, this is the first instance of a pulling, then we tried by 2 and positioned it on this particular patch. So, this is the second patch where the max pulling window will be positioned and this is the third patch where the max pulling window will be positioned and this is the final patch where the max window where the max pulling window will be positioned and each of these windows will return one number based on the pulling policy. So, we get 2 cross 2 output on the 4 cross 4 input using 2 cross 2 window of max pulling. So, we get the value of 6 which is the largest number in this particular window 2 over here, 8 over here and 9 over here. Similarly, if you are using average pulling as a policy, we will compute average of the numbers in the patch. So, if you apply max pulling on the output of a convolution say let us say output is 26 cross 26 cross 1 and if you apply a max pulling of 2 cross 2, we get the output of 13 cross 13 cross 1. So, you can see that there is a down sampling happening from the output of the convolution when we apply max pulling on it. Note that max pulling does not have any parameters. In practice we set up a series of convolution and pulling layers in CNNs. The number of convolution and pulling layers is a configurable parameter and is set by the designer of the network. In current example, we use 2 convolution and 1 more convolution layer. In the current example, we use 2 convolution pulling layers and 1 additional convolution layer at the end. We use 32 filters in the first convolution layer and 64 filters each in the second and the third layer. Each filter is 3 by 3 in size and we use a stride of 1. We have not used any padding in any of the convolution layers. We used max pulling for down sampling with a window of 2 cross 2 with a stride of 2. It is not necessary that every convolution layer should be followed by a pulling layer. Sometimes we can have pulling after a few convolution layers as well. Now that the model is built, let us look at the model summary. Let us work out the number of parameters for each convolution layer. We will write the shape of input and output tensors for each layer and then work out the number of parameters from that. The input to CNN is a 4D tensor and the 4D tensor the first axis corresponds to samples. The second is number of rows or the height then width and the number of channels. In the case of MNIST, we have a single channel because we have a grayscale image. So, this is the 4D tensor corresponding to the input image. So, we apply convolution operation on this where we use 32 filters. Each filter is of size 3 cross 3 with stride of 1. So, this gets us because you are applying 3 cross 3 filter this input gets transformed into another 4D tensor having height and width of 26 and having 32 channels. Each channel corresponds to a filter that we used during the convolution operation. We apply max pulling on this output with max pulling window of 2 by 2 and stride of 2. So, this gives us another 4D tensor with height of 13 width of 13 and having 32 channels. We can see that there is a down sampling that happened from the output of the convolution as we applied the max pulling on it. Then we apply another convolution with 64 filters. Each filter is 3 cross 3 filter with stride of 1. So, we get 11. So, we get a 4D tensor with shape none with height and width of 11 and number of channels equal to the number of filters that we use in the convolution which is 64. Then we apply max pulling on this 2 cross 2 window size and stride of 2. So, we get a 4D tensor with height and width of 5 and number of channels as 64. We apply another convolution with 64 filters each width size 3 cross 3 and stride of 1. We get a 4D tensor with height and width of 3 and number of channels equal to 64. So, this is the output that we get after applying convolution pulling 2 times followed by a convolution operation. Let us work out the parameters for each of the layer. So, we can see that in the first convolution we use 32 filters each width size 3 cross 3. So, because we use 3 by 3 filters we have 9 parameters corresponding to the positions of the filter and one bias. So, there are 10 parameters per filter and there are 32 filters. So, there are in all 320 parameters. We learn weights of each of these parameters during training of the network. The max pulling layer does not have any parameters. Let us come to the second convolution layer. The size of the filter here is 3 comma 3 comma 32. So, we have 3 by 3 filters with depth of 32 which is equal to the number of channels in the input that generates about 288 parameters plus 1 bias. So, there are 289 parameters per filter. So, for 64 filters we get 18496 parameters. The max pulling layer again does not have any parameters. So, for the final convolution layer here we have the size of the filter is 3 comma 3. The shape of the filter is 3 comma 3 comma 64. We have we have 3 cross 3 filter with 64 channels. So, the number of parameters here will be 3 into 3 into 64 plus 1 bias which makes it to 577 per filter and we have 64 such filter. So, we have in all 577 into 64 which comes down to 36928 parameters. The number of parameters in the convolution layer depends only on the filter size. It does not depend on the height and width of the input. It can be observed that the width and height dimension tend to shrink as we go deeper in the network. We started with height and width of 28 each and after couple of convolution pulling followed by a single convolution operation we got height and width of 3. The number of output channel for each convolution 2D layer is controlled by the first argument. The number of filters used in convolution are normally 32 or 64. Typically as the width and height shrink we can afford to add more output channels in each convolution 2D layer. Now, end of this particular operation we got an output of 4D tensor which has height and width of 3 and number of channels 64. What we do is we feed this output into a dense layer to perform the classification. So, what is really happening here is we take an image we take an image we apply bunch of convolution pulling operations that gave us a representation that will feed into feed forward neural network which will give us the label corresponding to the digit written in the image. Let us compare this with traditional machine learning flow. In traditional machine learning flow given an image we used to first perform feature engineering using computer vision libraries and the features used to get fed into any machine learning classifier which after training will give us output. Now, you can see that the feature engineering part in the traditional machine learning is getting replaced by CNNs. So, we can think of CNNs as a way of generating features automatically for a given image. The beauty of this approach is that the right representation is learned during the model training freeing us from expensive and tedious feature engineering task. We will use feed forward neural network for classification task. Let us see how to set this up. So, we will add a flatten layer that will take the output of the convolution layer and flatten it into a 1D tensor. We feed in the output of the flatten into a dense layer containing 64 units. We will use ReLU as an activation. MNIST has 10 output classes. So, we use a final dense layer with 10 outputs and a softmax activation. Let us understand it with an illustration and work out the parameters and work out the parameter calculations ourselves and we will compare that with the parameters shown in the model dot summary. So, here the input is comma 3 comma 64. We pass it through the flatten layer which gives us 576 numbers which are fed into a dense layer which is the output of which is fed into another dense layer. Flatten has no parameters. It outputs 576 numbers which are input to each of the 64 units in the dense layer over here. So, each of the unit in the dense layer has 576 parameters plus 1 bias which makes it to 577 parameters per unit and we have 64 such units making it 36,928 pattern 36,928 parameters. So, this produces 64. This produces 64 values one corresponding to each unit. So, the final dense layer has 10 units. Each unit receives 64 values from each unit receives 64 values from the previous layer. Adding one bias parameter to it makes it 65 values. Adding one bias parameter makes it 65 parameters per unit. So, total so, in total we have 650 parameters for the final layer. So, if we sum across the CNNs and fully connected top layer we have total of 93,322 parameters. So, let us compare our calculation with model dot summary. So, you can see that we arrived at 93,322 model parameters which is matching with the output of model dot summary. Let us set up the training of the model. We use sparse categorical cross entropy loss with Adam optimizer. Let us train the model for phi epochs with training images and training labels. After phi epochs, the model has achieved more than 99 percent accuracy in recognizing handwritten digits on the training data. Let us check the performance of the model on the test data. We used evaluate method of model class for calculating accuracy of the model on the test data. We passed test images and their labels as input so, that the statistics can be calculated. We observe that we achieve here we observe more than 99 percent accuracy on the test set. It is interesting to compare this architecture with fully connected feed forward neural network that we used in earlier exercise. Let us set up the feed forward neural network and compare it against the CNNs. So, we input the MNIS data set and we set up a feed forward neural network with a hidden layer with 128 units. You can see that this feed forward neural network has more than 100 k parameters as compared against 93 k parameters that our CNN has. So, we can see that in case of CNN, we are defining patches and we are taking a patch performing a convolution operation with the filter. This is the image patch and this is a filter. And what we do is we perform a linear combination of each position in the image with each parameter in the filter. So, you perform linear combination followed by non-linear activation. While in case of feed forward neural network, you take the entire image, you flatten it so that we get a single array of in this case since you have 28 cross 28 image, we get an array of 576 numbers which you are passing to a hidden layer with 128 units followed by a dense layer of 10 units to get the output. If we come up with a equivalent flattened representation, what really happens is we have these 9 values and we have a node. These 9 values are connected to this particular node which is a neuron or a unit in neural network which performs linear combination followed by activation. So, you can see that we are doing, you can see that we are capturing local patterns in case of CNN. So, this is CNN and this is feed forward neural network. So, CNN proceeds by capturing local patterns whereas in fully forward neural network, we learn global pattern involving all the pixels. Today, we learn about CNNs which are extensively used in computer vision applications. We discussed key operations in CNNs and demonstrated their usage in recognizing hand written digit task. We compared CNNs with feed forward neural network. We learn about CNNs ability to automatically generate features from the given image. In the next few sessions, we will learn about using pre trend CNN models for image classification task and visualizing what CNN layers are learning. See you in the next session. Thank you.