 Namaste, welcome to the next module of our course. In this module, we will understand under fitting and over fitting through the code. You know that over fitting happens when our model has excess capacity to memorize the entire training data. So, what happens if you look at the learning curves, we observe that the training error and validation error both reduce to begin with after a point the training error goes down, but validation error starts climbing up. If we are seeing that those kind of learning curves, we infer that the model is suffering from over fitting. On the other hand, if your model is so simple that it does not have enough capacity to learn the model or to learn the patterns in the training data, then our model is suffer from under fitting. In case of under fitting, both training and validation errors are high. In this lab, we will use IMDb movie review data set to demonstrate under fitting and over fitting. We will initially build a baseline model, then we will build a model to under fit the data and over fit the data. We will first build a baseline model and then we will build couple of model so that our model under fits and over fits to the training data. Later in the lab, we will demonstrate some of the techniques that help us to overcome the under fitting and over fitting problems. Let us begin. Let us first connect to the colab, colab runtime, we are connected to the colab runtime. Let us install the tensorflow 2.0 and print the TF version to ensure that we have right tensorflow version installed. We will be using tf.keras API as in earlier colab, so that is why we are importing keras, we are using numpy and matplotlib for data manipulation and plotting respectively. So, you can see that we have successfully installed tensorflow 2.0 through the print command over here and you can see that we have now tensorflow 2.0 installed on the machine. We have tensorflow 2.0 installed in our cloud runtime. Next step is to set up the training data, we are using IMDB dataset for training and IMDB dataset has movie reviews and each of the movie reviews is tagged with a label 0 or 1, 0 meaning the movie review is negative and 1 meaning the movie review is positive. So it is a problem of identifying whether the movie review is positive or not. So, what we do is IMDB dataset is available in the keras.datasets, so we do not have to write a lot of code to load the IMDB dataset. So we will use top 10,000 words, the IMDB dataset is available in keras.datasets, so we can easily load IMDB datasets and we will be using a multi-hot encoding where we turn the words into vectors of 0, 0 and 1 in 10,000 dimensional vector space. So if the word is present at a specific position that we will see one over there. We will be using a multi-hot encoding for our reviews where each word is converted into a vector of 0s and 1s. Concretely if we have a word whose index is 3 and 5 present in the document, it gets converted into a 10,000 dimensional vector where all the entries are 0s except for indices 3 and 5. So let us try to understand that every review is converted into a multi-hot encoding where we have a vector with 10,000 positions. So initially we convert each movie review into a sequence of word indices. So the IMDB dataset is maintained as a sequence of numbers in the keras. We convert each review into a multi-hot encoding that means if there are two words with indices 3 and 5 present in the review, only the position corresponding to those words will have one in the list, rest of the other elements will have 0 in their list. So that is how we convert the review into multi-hot encoding. So as you can see in the code, we use a multi-hot encoding with 10,000 words. We use the load data command to load the training and test data along with their labels and then we define a function for converting the sequence into multi-hot sequences. So this particular function takes into takes sequence and dimension as argument and it essentially defines or it essentially tries to fill a vector of dimension 10,000. What we do is for each of the word or each of the word index that is present, we set 1 for that word index, everything else is essentially 0. We call we essentially we first convert the training data into multi-hot encoding and test data into multi-hot encoding using multi-hot sequences function and we are going to use the number of dimensions to be 10,000. So let us run this and convert training and test data into multi-hot encoding. You can see that the data is being downloaded and getting loaded into memory and the data is converted into multi-hot encoding here. Let us look at some of the resulting multi-hot vectors. So here we have plotted the multi-hot vector for the first training example and you can see that the indices where the word is present is 1, everything else is 0. You can see that there are lot of words from the initial index are present in this particular example and fewer words from later indices are present here. Now that data is loaded, let us try to build a model. We will first build a basic model where we will use a neural network with two hidden layers, each layer having 16 units each. So let us look at how this neural network looks like. In base model, we have neural network which has two layer of 16 units and then there is a single output unit and we have a multi-hot encoded vectors. So the number of inputs is 10,000. So each of the 10,000 inputs are fed into the first hidden layer, then the second hidden layer and then the output layer which gives us the probability of the review belonging or review being a positive review. So let us see how to set this up in the code. So we build a kera.sequential model where we are going to stack layers one over the other. We are going to use two dense layer each with 16 units with activation of ReLU and the input shape, we have to specify the input for the first layer and the input shape has got 10,000 inputs. And the output layer is essentially it has got one unit with sigmoid activation. We use Adam as an optimizer and binary cross entropy as a loss because we are trying to solve really binary classification problem and we will track accuracy and binary cross entropy both during the training. And this particular command will summarize the model. The model summary is a useful way to understand what is going on in the model and to check whether model is set up as per our expectation. So you can see that there are three layers in the model. So the first hidden layer has 16 units, the second one also has 16 units and then there is output layer which has got one unit. The first hidden layer each one of them get 10,000 parameters each because there are 10,000 inputs plus a bias unit makes 160 k parameters. The second one essentially each of the unit in the second layer has will get 16 inputs which is output from this and plus 16 is to 72 and the final one has 16 inputs plus 1 bias unit. So thus there are 17 parameters for the last layer. Let us try to understand this a bit more carefully. So what we have is we have this 10,000 units. So each of the 10,000 unit will be connected to the 16 units. If we just focus on one unit and try to understand number of parameters we can readily derive it for all of them. So you can see that for a single hidden unit there are 10,000 parameters and there are 16 such units. Remember each of these units also have a bias term. So we add 16 to this. So this gives us, so we essentially get 160 k parameters. For the first hidden layer. Now what happens in the second hidden layer, let us try to understand it. Second hidden layer also has got 16 units. Let us try to see how many parameters are there for one unit in the second layer and then we can calculate for all the 16 layers. So you can see that there are 16 inputs, one from each of the hidden unit from the previous cell coming to each unit in the second hidden layer. So there are 16 units, so there are sorry there are 16 weights and there are 16 such kind of units plus there are 16 bias units. We said that each one of the, each node has, each unit has one bias, no bias connection attached to it. This goes to 272. Let us see what happens in the last node. So there are 16 units here, so this last node we will get input from each of the 16 units here and it has got its own bias term. So the number of parameters in the last layer is 17. So let us tally this with what we see in the summary and you can see that there are exactly the number of parameters that we derived. If we sum all these parameters up, we have total parameter of 160k or 305 parameters. So after setting the model, let us fit the model with the training data and validate it on the test data. So we will train the model with 20 epochs with a batch size of 512 and we are going to use Adam Optimizer for training the model. And just to remind you, we use binary cross entropy as a loss that will be optimized. Let us train the model. So you can see that as we train the model, the cross entropy loss is reducing, the validation loss is also reducing. But after third layer, it seems that the validation loss is going up while training loss is still going down and you can see the same thing happening for the cross entropy loss. It initially went down and it started going up. So what does this point to? I would encourage you to stop here and think about it. This probably points to some kind of an overfitting is a signature of overfitting. So we see that after 20 epochs, our accuracy is 1. We got a perfect classifier, time to celebrate. But hold on, the accuracy on the validation is not so great. Validation we only get 85% accuracy while on training, we get a perfect classifier. We have accuracy of 100%. Our job is to build a model that works well on the future data and not the one that works well on the training data. So there is some problem which we need to fix. So one way to address the problem overfitting is by reducing the number of parameters in the model. In neural network, it is easy to reduce the complexity. We can either reduce the number of units in each of the layer or we altogether remove a layer that will result in lesser number of parameters. Other way to reduce the overfitting is by getting some kind of training data. If training data is not available, then we have regularization techniques that we can use. In regularization, we can use either L1 or L2 regularization or a dropout regularization that is used in the context of neural network. In case of L1 and L2 regularization, we add penalties that are proportional to the weight of the model. In case of L1 regularization, we add penalty that is proportional to the absolute value of the parameter or the sum of absolute values of the parameter. In case of L2 regularization, we add a penalty that is proportional to sum of square of the value of the parameters. In case of dropout, we decide to randomly drop certain nodes from the hidden layer or input layer of a neural network. We normally set dropout between 20 percent to 50 percent that means 20 percent to 50 percent nodes will be randomly dropout in each of the iteration in the neural network training. The dropout is only used while training. So, in the same manner, even regularizations or dropout are used during training. In the test, we simply apply the test data on the model. So, the first strategy, let us create a smaller model here. So, instead of using 16 units, we used 4 units. So, actually the number of parameters will go down that we just have 4 units. So, let us run and see how many parameters are there in this. So, you can see that the number of parameters have come down at least 4 folds. However, we had 160 k odd parameters. So, just by introducing, just by reducing the number of units in each of the layer, we got it down to 40000 parameters. I would encourage you to see to yourself that whether this parameter calculation is correct on the lines of the calculation that we did in the earlier case. Let us train this smaller model and see what happens. So, you can see that even this smaller model seem to be overfitting. The training accuracy is close to 1, but validation accuracy is quite low. That means the model is probably memorizing the training data. It has got enough capacity to memorize the training data. But going back, you can see that the model started overfitting somewhere around, we can look at the validation loss, validation loss is going down, going down and around this point it started overfitting where validation data, validation loss started going up around epoch number 7. If you go back and check when our baseline model started overfitting, our baseline model started overfitting right from the third epoch. So, you know by creating smaller model, we are able to delay the overfitting of the model. So, if you use smaller model and if you stop training around fifth epoch, we should still be fine. If you stop our training after fifth epoch, we will not get the overfitting that we see up to 20 epochs. Now that we have created a smaller model, let us go to the other extreme and create a bigger model and see how fast it overfits. So, we use a bigger model, so how do we create a bigger model in neural network? We can simply add more units in each of the layer or add more layer itself. So that way we will have more capacity in the model or more parameters in the model and since we have more parameters in the model, if you do not have enough training data, the model tries to memorize the training data and gives us perfect output on our training set, but it did not perform well on the unseen data. So, let us compile this model and you can see that now we have far more number of parameters. So, there are close to 5.3 million parameters as against 160 k parameters that we saw in our baseline model. Let us try to train the model for 20 epochs with the same batch size of 512. You can see that after 4 epochs, so the model right away started overfitting. So, you can see that the validation loss is increasing right after the second epoch. So, in the first epoch it was 0.29, it went up to 0.33 and then it went, it kept going up all the way and at the same time you can see that the cross entropy loss or the training loss is coming down, training accuracy is going up, but the validation accuracy is reducing. These are the signs that this is probably the model with quite a large capacity or excess capacity and it is prone to overfitting pretty quickly in the training cycle. So you can see that after training the model for 20 epochs, you can see that the training loss is very small close to 0, but the validation loss is quite high and validation loss never reduced rather it started growing up right from the second epoch. So, this points to the fact that the model is overfitting pretty fast in the bigger model. And we will compare how the size of the model affects the possibility of overfitting. So, you can see that there are 6 line in this particular plot. So the solid lines are for the training loss and dotted lines are for the validation loss. On x axis we have epochs, epoch is one epoch is one full iteration on the training set. We are using the same batch size. So all of them will have same number of iteration in one epoch and on y axis we have got a binary cross entropy loss. So, you can see that the green line corresponds to the bigger model, the blue line corresponds to the baseline model whereas, the orange time corresponds to the smaller model. We will have to keep an eye on the dotted lines because those give us validation loss and we infer that there is a overfitting when validation loss starts going up while training loss keeps going down. So, let us look at the smaller model, initially smaller model the validation and training both the losses are going down and around eighth iteration the validation loss started climbing up. So, you know we can say that the smaller model started overfitting after 8 epochs. Look at the baseline model, baseline model started overfitting around third epoch, initially the validation and training loss were going down but after third epoch validation loss just went up and up whereas, in case of bigger model it overfit almost instantly right from the first iteration. Training loss was coming down but validation loss never came up. So, you can see that smaller the model it will take a bit longer to overfit. Now that we have visualized the overfitting of the model, how can we really prevent the overfitting? So, there are three strategies if you can get more data overfitting can be prevented. But getting more data is not always an option because getting more data is costly sometimes it is just not possible to get more data then the second resort that we have is to add regularization or a penalty term to the loss function. So, there are multiple ways of performing regularization couple of them are based on L1 and L2 penalty both L1 and L2 regularizations add penalties proportional to some function of their weights. So, L1 regularization adds penalty proportional to the sum of absolute values of the weight whereas, L2 regularization adds penalty proportional to sum of square of the weights of the parameters. So, let us see how we can add L1 and L2 regularization in the code. L1 regularization adds cost proportional to absolute value of its weight coefficients, L2 regularization adds cost proportional to square of the weight coefficients. Let us see how to add L1 and L2 regularization to the neural network through tf.keras API. So, along with each of the layer dense we add along with each of the dense layer while defining the model we add a kernel regularizer argument. So, kernel regularizer we can either use L2 or L1. So, here we are using L2 and we are also specifying what is a regularization rate. So, we have let us say a loss function and we add a regularization rate times the penalty. So, this particular penalty is added is added over here. So, the regularization penalty is 0.001 and we are using L2 regularizer here. So, we use L2 regularizer in both the layers in order to prevent overfitting. So, this is how you can define your L2 regularizer and then we can compile the model and train it. You can also add L1 regularization in the similar manner. You can also add L1 regularization in the similar manner and I would encourage you to try it yourself. So, here we added the regularization to the baseline model. Now, let us compare how the training of regularized model compares with the baseline model. Let us plot the loss with respect to the epoch. So, you can see that the blue model corresponds to the baseline model whereas the orange model corresponds to the baseline model with L2 regularization. You can see that the blue line which is the baseline model started overfitting around third epoch whereas the regularized model took slightly longer to start overfitting. So, you can see that the regularized model is resistant to overfitting for some more epochs than the unregularized model. As you can see the L2 regularized model has become much more resistant to overfitting than the baseline model even though both the models have the same number of parameters. The other way of adding regularization or other way of regularizing neural network is by adding dropouts. Results in randomly dropping out a number of output features in the layer during training. Dropout is not applied at the test time. It is very important to note that at the test time we do not drop any units. Instead the layers output values are scaled down by a factor equal to a dropout rate. So, let us see how to use dropout in the context of neural networks. Keras has a dropout layer. We can use a dropout layer with dropout rate at its parameter. So, you can see that this particular dropout will be applied to a layer that is specified just before that. So, this dropout will be applied to the first hidden layer and then this dropout of 0.5 will be applied to the next hidden layer. And after specifying the dropout we can again return the model and compare its performance with the original baseline model. So, let us compare the baseline model with the dropout added. So, now compare this also with the L2 regularization. You can see that after adding L2 regularization the model marginally improved its resistance to the overfitting. But after applying dropout you can see that there is a substantial improvement in the resistance to a dropout. And now model overfits after few more epochs than the original model. So, to recap what we did is we actually build a model on IMDB dataset. The model that we built a baseline model on IMDB dataset then we built a very big model on IMDB dataset that was overfitting. In order to prevent overfitting we first reduce the capacity of the model by building a smaller model and we also added regularization to it. We studied L1 and L2 regularization and then dropout regularization. So, these are the common things that are done to prevent overfitting in the neural network. The first strategy is to get the training data if that is possible. If training data more training data is not possible we can reduce the capacity of the model or add L1 or L2 regularization or dropout. Dropout in fact is the most commonly and the most effectively used regularization technique in neural network. Apart from that we can also use data augmentation to generate more training data from the available data and batch normalization to prevent overfitting. Hope you understood underfitting and overfitting through this coding exercise and you had fun learning these concepts. Thank you.