 Namaste. So far in this course we have been building machine learning models with feed forward neural networks. Today we will learn about a new neural network architecture called convolutional neural networks or CNNs. Let us first collab environment for CNNs. CNNs trend faster with GPUs. So we will enable free GPUs in collab like this. We will open edit, go to the notebook settings and in the hardware accelerator we select GPUs and we save the settings. Now the notebook is enabled to run with GPU runtime. Let us install and import all the necessary libraries in the collab runtime. CNNs are used extensively in computer vision. They are also used for modeling temporal data. The key idea in CNNs is to capture local patterns in the data through convolutional operation. The key idea in CNNs is to capture local patterns in the data through convolution operation. It then down samples the resulting output with pooling layer. CNNs employ a series of convolution and pooling layers. Let us understand CNNs with MNIST digit recognition example. In this case, CNNs take an image as an input and predicts the corresponding digit as an output. The image is encoded as a 3D tensor with access corresponding to height, width and depth of an image. In case of MNIST data, we are dealing with grayscale images and hence the depth is 1 for us while height and width are both 28. We will download and pre-process MNIST handwritten digit recognition dataset just as before except for a small change. Here we reshape each image into a 4D tensor. This is done to satisfy the input requirements of the CNNs. So, the reshaping is done with a reshap command that you can see here on your screen. Both train images and test images tensors are reshaped into 4D tensors. Let us run this particular cell and we will examine the shapes of the tensors before and after reshaping. You can observe that the reshaping operation converts the 3D tensors into 4D tensors. The shape of the train and test tensors for each image changes from 28, height of 28 and width of 28 to height and width of 28 each and a depth of 1. Now that we have prepared the data for training, let us build our model. We will first create the convolution base. We use sequential model for creating convolution base. Let us add a convolution layer to the model. Let us understand about convolution operation. The convolution layer is added using layers.con2d command. The convolution operation involves a filter which captures local pattern and applies it on the image. The filter is a 3D tensor of a specific width, height and depth. Each entry in the filter is a real number. The entries in the filter are learnt during CNN transition. The filter slides across width, height and depth stopping at all possible positions to extract a local 3D patch of surrounding feature. Let us understand about convolution operation on handwritten digit. So, this is our handwritten digit encoded in 28 cross 28 image. Since this is a grayscale image, the depth is 1. The convolution operation defines a filter of specific height and width. Let us say our filter is of size 3 cross 3. So, this is a filter and this is the original image. What we do is, we take the filter and we position it at different places in the image. So, this is one possible positioning of a filter which is positioned at the first position. So, we slide the filter. So, if we slide the filter, we get a new position of the filter. So, let us say this is position 1, this is position 2 and we keep doing this across the length and breadth of the image. So, we can keep doing this and this will be the final positioning of the filter. So, here since it was a grayscale image, we had depth of 1. Let us try to understand how the convolution happens with images that are coloured. In case of coloured images, we have 3 channels. For every image, you have 3 channels red, green and blue. So, this is red channel, then there is a green channel, then there is a blue channel. So, let us understand how a filter is defined on an image with multiple channels. So, we will have a filter which also has 3 channels. So, we have let us say 3 cross 3 filter which has got depth of 3. So, this 3 cross 3 filter is represented as a 3D tensor. This is corresponding to height, this one corresponds to width, this one corresponds to channels or depth. In this case, the depth is 3, 1 for red, green and blue and the way positioning happens is as follows. We position this filter at the beginning over here. There will be a corresponding positioning in green and also in blue. So, first position, we slide the filter across width and height of the coloured image until we get to the last portion, until we get to the last patch. What we do is, we transform every such patch with a convolution filter into a number or 1D array. Let us understand what happens when we position the filter at a particular patch of an image. Let us say this is our handwritten digit from MNIST dataset which has got 28 rows and 28 columns which has got height of 28 and width of 28. So, 28 cross 28 grid and we define a filter which is a 3 cross 3 filter. Filter has 9 entries and each entry is a weight, let us say w1, w2, w3, w4, w5, w6, w7, w8 and w9. So, initially what we do is, we take this filter and we position it at this particular patch of the image. Let us say the image has the following inputs. The patch of the image also has 9 positions. So, what we do is, we position this particular filter on top of this image. So, we perform linear combination of the weight in the filter with the feature value in the image patch. So, in this case we have w1 x1 plus w2 x2. So, w2 gets multiplied with x2, w3 gets multiplied with x3, w4 gets multiplied by x4 and so on until x9 gets multiplied by w9. So, when we position this particular filter on this particular image patch, so we multiply w1 by x1, w2 by x2 so on up to so so on up to x9. We multiply x9 with w9. So, writing this particular equation, we can write it as w1 x1 plus w2 x2 plus w3 x3 so on up to w9 x9 and we add a bias term to it. You can see that this particular equation is very very is a similar to equation that we saw in feed forward neural network and we apply a ReLU activation on this linear combination which gives us a variable Z. Z is a scalar for a given positioning of a patch on the image. Let us try to understand how the 3D patch works. Let us say this is the image and we have corresponding filter. Remember, we have the same number of channels in the filter as the input image. Since the input image has 3 channels, we also have 3 channels in the filter. So, let us say this is 3 cross 3 filter with 3 channels. So, when we position this particular filter in the image, let us say this is the first position of the filter. So, this filter is positioned across the channel. So, what we do is we have a red patch which has got inputs x1 to x9. Then we have a green patch which also has let us call these inputs as and we have a blue patch. So, we have a patch which has got 27 inputs 9 for each channel. We also have corresponding filters. There is a filter for blue channel, one for green and one more for the red channel and the corresponding weights are w1, w2, w3, w4. So, we have 27 weights in the filter, 9 corresponding to each of the 3 channels and we have one bias unit for this particular filter. So, we have 3 cross 3 cross 3 weights, we have 3 cross 3 cross 3 parameters and one bias. So, which gives us 28 weights per filter and what we do here is we actually calculate the linear combination of weight and a corresponding value in the image patch and add a bias to it and apply an activation like ReLU to get a scalar number. So, this is how we apply a filter, a 3D filter to a 3D image. This filter generalizes to number of channels more than 3 in exactly the same manner. Only thing that will be different is we will have number of, we will essentially get number of channels which are same as the input. And that will also increase the number of parameters for a filter. Let us take a concrete example of a 3 by 3 filter on a grayscale image. So, let us say this is a patch extracted from a grayscale image which has got values and let us say this is a filter which is also which is a 3 by 3 filter has got 10 parameters, 9 parameters corresponding to the positions and one is a bias term. So, let us say bias is equal to 1. So, the way the linear combination works is as follows. What we will do is we will write the values in the image in blue and each of which is multiplied by the weight of the parameter in the filter. So, we superimpose the filter on the patch and multiply the number in the image patch with the weight of a parameter in the filter. So, you can see that this is equal to, so the linear combination comes to 4 plus we add bias and if you apply a ReLU on it, ReLU of a positive number is the positive number. So, we get 5 as an answer from this particular patch of an image. So, 5 signifies the strength of this local pattern as represented by the filter at this particular image patch. So far in this class we studied about convolution operation. We learn how filters are created and how we evaluate a filter at a position at a particular position in the image. We slide the filter across the image at all possible positions and assess the strength of local pattern represented by the filter at each position. See you in the next session. Thank you.