 Namaste. In the last lecture, we looked at concepts behind convolutional neural network and use them for building an image classifier. We also learned that these models have a large number of parameters and in order to train them without overfitting, we require a reasonably large amount of data. On the other hand, there are models that are trained with a large amount of data. How do we use these models somehow in order to train custom models when we do not have large amount of data in our possession? For example, say a manufacturing company wants to train a CNN model for recognizing faulty machine part. How can it take advantage of mobile net to solve their problem? We will learn techniques behind solving these kind of problems in this session. We will demonstrate these techniques to classify cats and dogs. A pre-trained model is at the center of our strategy. A pre-trained model is a saved network that was previously trained on a large dataset. Typically on a large scale image classification task. We either use a pre-trained model as it is or use transfer learning to customize this model for a given task. The intuition behind transfer learning is that if a model trained on a large and general enough dataset, this model will effectively serve as a general model of the visual world. We can take advantage of these learned feature maps without having to start from scratch training a large model on a large dataset. In this exercise, we will try two ways to customize a pre-trained model. Here is the situation. We have a CNN model that is trained on a large number of images. As we studied earlier after CNN, we have normally a layer of feed for a neural network which outputs the label. CNN over here is made up of convolution and pooling layers. There can be multiple convolution pooling layers that to make a CNN. For example, we might have two convolution pooling layers followed by a convolution layer whose output is then fed into the feed for a neural network to generate a label. Now, the idea here is to use this particular CNN model, to use this CNN model which was trained on a large dataset and use it for performing some other task. For example, we have a dataset of machine parts and we want to recognize let us say faulty machine parts. So, we want to build a CNN followed by feed for a neural network to get the label which in this case is faulty. We know that a CNN model has a large number of parameters and if you do not have enough data points about machine parts and their label, you are likely to overfit the CNN model. So, here what you want to do is, you want to take advantage of a pre-trained model which is this particular CNN model and you want to use this CNN model over here. There are really two ways in which we can achieve this. So, let us say this is our CNN model, a CNN model which of course has in a bunch of convolution pool, chain of convolution and pooling operation and the output of that is served to feed forward neural network which earlier flattened it, which first flattens the output of the CNN and then feeds it to, feeds to a dense layer which might feed it to the output layer which is another dense layer which will finally give us. This entire model that is CNN followed by feed for a neural network is trained on a large data set to recognize images in that training set. So, these two models together are called as pre-trained model. Now we use this particular pre-trained model in two ways. In the first part what we do is let us say these are my machine parts, few images of the machine part. I used the CNN as you can see over here exactly the same architecture I use over here and I replace the feed for a neural network. So, what happens? So, this is called as using CNNs for, so in this case we use CNNs for feature generation. So, the whole idea is that we take machine parts, pass to the CNN and whatever features that are coming out of this you are passing them to feed for a neural network to get the label. Just to make the point clear this is the new thing that we build specific to our problem and we use this particular part, this particular part is used from the pre-trained model. This is exactly the same CNN that we see over here. This particular CNN is being used over here. This is one way in which we can use the CNN. In the second approach what we do is we use the weights from the pre-trained model only for few layers. So, we use weights learnt from the pre-trained model only in this particular part which is shaded and we learn the weights in this particular part along with the feed for a neural network. So, in the feature generation part we use the complete CNN part of the pre-trained model whereas the CNN model is partially frozen meaning we use weights from the pre-trained model and the other part we learn the weight afresh for the new data set. So, this is called as tuning. So, the feature generation and tuning. So, in case of feature generation we do not train the entire model. The base convolution network already contains features that are generally useful for classifying features. We only train the final part of the feed for a neural network. In fine tuning we unfreeze a top few layers and jointly train both the newly added classifier layer and the last layers of the base model. Now that you have understood how to use pre-trained model for building a custom model let us look at the machine learning workflow involved in this particular process. So, the first step we will examine and understand the data which is exactly the same as traditional machine learning algorithms. Then we build a input pipeline then we compose our model. So, this model composition differs from the traditional machine learning algorithms. In traditional machine learning algorithm we define the model completely where here what we do is we take the pre-trained model. So, we will have to load the pre-trained model and then add the classifier layer on top of it. Classifier layer is defined in terms of feed for a neural network. Then the remaining two steps are again very similar to our traditional machine learning algorithms that are training and evaluation of model. So, let us start by importing all the necessary libraries. Let us also install tensorflow 2.0 and import and import the tensorflow package. Let us load the dataset using TFDS package. The TFDS.load method downloads and caches data and returns a TF.data.dataset object. These objects provide powerful and efficient methods for manipulating data and piping it into our model. Since cats and dogs dataset does not define standard split, we use the subsplit function to divide it into train validation and test with the split specified by the weighted parameter. Here we use 80 percent data for training, 10 percent data for test and the remaining 10 percent data for validation. So, we have 80 percent, 10 percent and 10 percent split across train validation and test. The resulting dataset object contains images and label pairs. The images have variable shape and three channels as you can see here and the label is a scalar quantity. Let us look at first two images and their labels from the training set. So, we iterate on the training set using tech method on the tensor and we generate the string of the label using into string property. So, you can see that there is a picture of a dog with label displayed at the top and there is a picture of the cat. You can see that the dog picture has height of 500 and width of about 350, whereas the cat picture has height close to 400 and width of 500. So, you can see that all the pictures are not of the same size. So, the first thing is to resize the image so that we have the same input size for all the images. We will be using tf.image module to format the images for this task. We will also rescale the input channel to a range of minus 1 to plus 1. We do that through these two lines of the code and once the image is rescaled we use the resize function to convert the image into the desired size. Here the desired size of image is 160 by 160 and the format example function returns image and its label. We will apply this function on each item in the data set using the map method. So, you can see here that the format example is applied on the training set over here, validation set over here and the test set in this particular statement. We get three tensors train validation and test that contains images of the same size and their labels. So, let us run the format example on the rod tensors. Let us shuffle and batch the data. We use batch size of 32 and for shuffling we define a buffer size of 1000. We shuffle only the training data whereas validation and test data is not shuffled and we batch all the three data sets by the batch size of 32. Let us inspect a batch of data from training batches. So, you can see that in the first training batch we have 32 images each with height of 160 and width of 160 on the three channels. So, you can see that the image batch here is a 4D tensor. Now that we have gotten the data in the desired shape let us create the model. In model creation as we talked earlier there are two steps. First is to load the base convolution model and add a classifier layer on top of it and add a classification layer on top of it. Here we will create the base model from mobile net version 2 developed at Google. This model was pre-trained on image data set which is a large data set of 1.4 million images from 1000 classes of web images. Image net has categories like jackfruit and syringe, but we will use the image net classifier here to classify cats versus dogs. First you need to pick which layer of mobile net you will use for feature extraction. Obviously, very last classification layer is not going to be very useful for this task. Instead we will follow common practice of extracting features at a layer just before the flatten operation. This layer is referred as a bottleneck layer. A bottleneck feature retain much generality as compared to final or top layer. So, let us first instantiate mobile net with a pre-loaded weights trained on image net. We can do that using tf.keras.applications.mobilenetv2 function. We specify the input shape. We tell the model that we do not want to include the top layer or the classifier layer by specifying include underscore top argument to false. And we specify that we want to include we want to use weights from the image net data set for this particular base model. Since we specify include underscore top as false, the network does not include the classification layer at the top which is ideal for feature extraction. The feature extractor converts each of 160 by 160 by 3 image into a 5 by 5 by 1280 block of features. Let us look at how this model looks like. So, when we call summary on the model, you get to see the complete architecture of the mobile net v2. You can see that it takes a 4D tensor where there are images of size 160 by 160 across 3 channels that means it takes colored images of size 160 by 160. And its final layer produces a 4D tensor where we get 5 by 5 patches across 1280 channels. We also see the total number of parameters here for this model. So, this model has got 2.2 million parameters. Let us use the base model to generate the features. So, you can see that on the image batch that we selected, we computed the features for the image batch. You can see that for each of the 32 examples in the batch, we got a 3D tensor of size of shape 5 comma 5 comma 1280. So, we got 5 by 5 patches across 1280 channels. We will freeze the convolution base created from the previous step and use that as a feature extractor. We add a classifier on top of it and train the top lever classifier. Let us see how to do that in the code. We use base underscore model dot trainable attribute or property and set it to false. This makes sure that we freeze the convolution base before we compile and train the model. By freezing, you prevent the weights in a given layer from being updated during the training. Mobile net has many layers. So, setting the entire model's trainable flag to false will freeze all the layers. Let us look at how the base model looks like. We also had looked at this summary earlier. I would like you to note down these two numbers over here and let us compare the trainable parameters after freezing. You can see that after freezing, the trainable parameters become 0. That means we do not have to train any of the parameter of this network and all the 2.2 million parameters become non-trainable. In order to generate predictions from the block of the feature, we average the special 5 by 5 block using a global average pooling layer to convert the feature to a single vector or 1280 elements per image. So, we add the global average pooling layer to the model. Let us look at the shape of the feature batch. You can see that each of the image, each of the 32 images in the batch got converted into a 1D tensor containing 1280 numbers. Finally, we apply tf dot keras dot layers dot dense layer to convert these features into a single prediction per image. We do not need an activation function here because this prediction will be treated as a raw prediction value. A positive number predicts class 1 and negative number predicts class 0. So, let us stack the feature extractor and 2 layers using a tf dot keras dot sequential model. So, we define the model to be tf dot keras dot sequential model which has got a base model which itself is a sequence of convolution and pooling layers according to the mobile net architecture. We have a global average layer followed by a prediction layer. Prediction layer is a dense layer having a single unit. Let us compile the model before training it. Since there are 2 classes, we are using binary cross entropy loss and we are using rms prop optimizer with a small learning rate. Let us look at the summary of the model. You can see that the mobile net v2 returns a 4D tensor of shape none 5 comma 5 comma 1280. It has got 2.2 million parameters. The global average pooling returns 1280 numbers per image and finally we have a dense layer with a single unit. You can see that this dense layer received 1280 inputs plus 1 bias unit makes 1281 parameters in the dense layer. So, you can see that the total number of parameters are more than 2.2 million out of which most of the parameters are non-trainable and we have to only train 1281 parameters corresponding to the output layer. Let us train the model for 10 epochs. You can see that after 10 epoch we cross the accuracy of 95 percent. Let us take a look at the learning curves of the training and validation accuracy or loss when using mobile net base model as a fixed feature extractor. You can see that training and validation loss and accuracies are quite close by after the 10 epoch. You must be wondering why validation matrices are better than the training matrices. The main factor is because layers like tf dot keras dot layers dot batch normalization and dropout affect accuracy during the training. They are turned off when calculating the validation loss. To a lesser extent it is also because training matrices report average for an epoch while validation matrix are evaluated after every epoch. So, validation matrix see a model that has trained slightly longer. In our feature extraction experiment we were only training a few layers on top of mobile net base model. The weights of the pre-trained network were not updated during the training. The weights were frozen for the base model. One idea to improve the performance can be to fine tune the weights of the top layer of the pre-trained model alongside the classifier layer. So, now what we are going to do here is we are going to we are going to unfreeze these layers of the base network and train them along with the classifier layer. Let us see how to specify unfreezing of these layers in the code. So, we set the trainable property of the base model to true and what we do is we specify the layers after which we want to unfreeze the network. So, here we want to fine tune from 100th layer from 100th layer onwards. So, we specify the fine tune at variable and anything after the fine tune will be trained and anything before this fine tune at will be fixed. So, we set the trainable value of the layers before 100th layer as false. So, there are 155 layers in the base model out of which we retrain the top 55 layers. It is important to note that as we go deeper and deeper in the network, network become more specialized to patterns learned from the training data. The initial convolution layer learn general patterns or simpler patterns like edges or corners, but as we go deeper and deeper the model becomes specialized to recognize patterns from the training data. So, if we are using CNN as a pretrained model for some other task, it is important to cut it somewhere in the middle if the task is very very different from the training data. For example, if the training data contains pictures of cats and dogs whereas, then the training data for the new task is about machine parts, you might have to start unfreezing quite early in the network because the original network which was trained on cats and dogs will start recognizing patterns related to cats and dog as it goes deeper. So, let us use the same setting, we use binary cross entropy loss and we use RMS prop as an optimizer. Let us look at the model summary after unfreezing some layers of the network. Now, you can see that out of 2.2 million total parameters, the non-trainable parameters are reduced to 395,000 and the rest of the parameters are trainable parameters. Let us compare this model summary with the model summary from the previous exercise. So, you can see that in the previous exercise we only had 1281 trainable parameters and all of the 2.2 million parameters of the base convolution neural network were non-trainable. But since we have unfrozen some of the layer, the number of parameters have gone up in this particular case. We should note that we should only attend fine tuning after training the top level classifier with the pre-trained model set to non-trainable. If you add a randomly digitized classifier on the top of the pre-trained model and attend to train all the layers jointly, the magnitude of the gradient updates will be too large and our pre-trained model will forget what it has learned. Let us train the model for 10 epochs. Here, the parameter to note is a initial epoch. So, what we do is we start the training of the model where we had stopped in the previous run. So, we had stopped on 10th epoch. So, we want to initialize the model with the 10th epoch and resume the training from that. So, you can see that when we run this particular fit function or the model training, we started training from 11th epoch and accuracy has gone up close to 99 percent, it has crossed 98 percent which is more than 4 percentage point from the previous exercise where we used mobile net merely as a feature extraction mechanism. Let us also look at the learning curves. So, you can see that, so you see a green line over here where we started fine tuning. The left of the green line denotes the performance of the model when we use mobile net for feature extraction. On the right side, we see performance of the model after starting the fine tuning exercise. You can see that after fine tuning, the training accuracy has surpassed the validation accuracy and training and validation loss have converged almost to the same number. So, in this module we learnt how to use a pre-trained machine learning model as a feature extractor and we also learnt how to how to fine tune a pre-trained model. Hope you had fun learning these concepts. In the next session, we will use TensorFlow Hub for loading some of the pre-trained models and using them for feature extraction as well as for fine tuning. See you in the next session. Thank you.