 In this session, we will build a deep learning model for regression. In the last session, we build a deep learning model for classifying fashion accessories into one of the 10 categories. So, in this particular session, we will build a regression model for predicting fuel efficiency of vehicles. In regression problem, the output is a real number which is like a price or fuel efficiency measured by miles per gallon. Contrasting this with classification problem, where we aim to select a class from list of classes, a regression problem strives to predict a real number. In this particular exercise, we will use a classic auto MPG dataset and build a model to predict the fuel efficiency of 1970s and 80s automobiles. To do this, we will provide a model with description of many automobiles from the time period. These features include cylinders, displacement, horsepower and weight of the automobile. As usual, before starting the notebook, let us connect to a collaborant time and install the useful softwares which are not present. First we will install a software called C-Bon for plotting a pair plot. So, once the C-Bon is installed, we will import some of the plotting libraries like matplotlib.py plot, we will import pandas for manipulating data, then we will import C-Bon for pair plot and we will also install tensorflow 2.0 and install tensorflow library. We will also import a keras and layers library from tensorflow. Let us run this particular cell. At the end of this cell, we make sure that we have the right version of tensorflow present in our collaborant time. We ensure that by printing the TF version which is 2.0.0 beta 1, which is the desired version for this particular exercise. The auto-MPG dataset is available from UCI machine learning repository. Our first job is to get this particular data. In the last exercise, we, the dataset was present in the tensorflow datasets and it was easier for us to import and load the data in tensorflow. In the case of auto-MPG, this dataset is not available in tensorflow, so we will have to first download the file and then load the data using pandas and we will import the data from pandas data frame into tensorflow. Let us first understand that particular process. So, at the first step, we will download the data, we will use kerasutils.get file function for that. The first argument is where we want to store the data and then the second argument here is the URL of the file containing the auto-MPG data. So, let us run this which will download, this particular cell will download the data and the data is now present in auto-MPG.data as printed by the dataset path. Let us import the data using pandas data frame. We will first list out the column names. We have column names like miles per gallon, cylinders, displacement, horsepower, weight, acceleration, model year and the origin. We will use pandas.readcsv function because this data is present in a csv format. We will read data using read underscore csv function of pandas. This particular function takes the dataset path, column names and we tell how to handle the NA values. So, we want to replace NA values by question mark. We want to ignore anything after the tab, so that we specify using the comment argument. The separator is a space and we want to skip initial spaces in the file. After specifying all these arguments, we are able to read the content of auto-MPG dataset in the data frame, in the pandas data frame named raw underscore dataset. We will make a copy of raw underscore dataset and copy it into a dataset. Let us run this code cell and examine what is there in the dataset. So, we printed last 5 rows of the dataset using dataset.tel command and you can see here there are features like mpg, cylinders, displacement, horsepower and so on. Most of the features are numeric features in nature whereas origin is a discrete feature. It just has a value 1 or 2. So, this particular part takes care of downloading the data and loading that into pandas data frame. Once the data is loaded, our next job is to clean the data and remove null values and perform normalization on the dataset. Let us take steps to pre-process the data. So, let us first find out if dataset has got any null values. So, we essentially use isNA function and then followed by some function to identify columns where null values are present. As you can see on the screen, horsepower column has few null values. In order to keep this tutorial or this exercise simple, we will simply drop these rows. So, we will drop all the rows containing null values using function dropNA. Next, we will convert some of the categorical attributes into one-hot encoding. So, we just saw that origin is a categorical attribute and we will use one-hot encoding to convert that into something that is usable by the machine learning algorithm. So, let us pop the origin column from the dataset and replace the origin column with 3 values because origin originally has 3 values USA, Europe and Japan. So, for USA we use the value of 1, for Europe we will use the value of 2 and for Japan we will use the value of 3. So, let us do this particular step of converting the categorical attribute into a numeric attribute and you can see that whenever USA is present, you will see the value 1 appearing, whenever USA is present, Europe and Japan are obviously not present. So, the values are 0, whenever Europe is present you will see a value of 1 corresponding to a column Europe, whenever value is 1 that column corresponds to, whenever there is a value of 1 in the column of Japan that corresponds to the third value in the original dataset. So, this is how we converted our numerical attribute into a categorical attribute and we have a transformed dataset. Now that we have preprocessed and cleaner dataset, the next job is to next job is to split the data into text, into test and training set. So, we will simply use dataset dot sample or sample function on the data frame where we specify the fraction of examples that we want to use in the training set and we will set a random seed so that every time we run this particular colab we get essentially the same training set. So, let us run this particular command, this will select 80 percent of the examples in the training and anything that is in the training we drop and the remaining examples are copied into test underscore dataset. So, running this colab we have two datasets, one is train underscore dataset and test underscore dataset containing training and test examples respectively. Let us have a quick look at the joint distribution of a few pairs of columns from the training set. We will use sns dot pair plot for plotting the joint distribution of a few pairs of columns and we do that for columns like mpg miles per gallon which is our target column or the column that we want to predict cylinders displacement and weights. Let us run this particular code cell and look at the pair plot. So, in this particular pair plot on diagonal you can see the distribution of the values in each of the features. So, here you can see the distribution values in mpg, this particular figure shows distribution of values from the feature cylinder, here you see distribution of values in displacement and in this plot you see distribution of values in weight. The op diagonal elements essentially show the relationship between between between between between the feature on the row and the feature on the column. For example, this particular this particular plot shows relationship between cylinders and mpg. This particular plot shows relationship between cylinder and displacement. This particular plot shows relationship between weight and displacement. As you can see that there is hardly any relationship between cylinder and mpg. So, cylinder and mpg seems to be uncorrelated whereas, displacement and weight have some correlation with mpg because as weight goes up mpg seems to be coming down the same thing is happening with displacement and as displacement increases the mpg seems to be coming down. You can see that mpg and weight are correlated features because as the weight increases displacement also tends to increase. So, we can get some useful insights into the features by looking at this pair plot and how this features affect the outcome can also be seen in this pair plot. Let us look at the overall statistics of the data using described command. So, we first use described command on the data set and obtain training stats we we remove the field for miles per gallon because that is the field we want to predict and we do the transpose of the stats so that we can display it in a nice tabular fashion. So, you can see all the features of the model on the row side and on the column we have various statistics of for each of the features. So, you can see that there are 314 rows in the data set that is why we have count for each of the feature is 314. Then we can look at the mean of every feature then the standard deviation and pretty much the mean max and few quantiles in between. As you can see that different features are on different scales for example, minimum number of cylinders is 3 and maximum number of cylinders is 8, minimum weight is 1649, maximum weight is 5114. So, you can see that the features are on a different scale and in order to bring them on the same scale later will perform normalization operation. Before doing normalization we will split features from the label. So, right now in our train underscore data set we have features as well as labels. So, we will use pop command and we will remove the column corresponding to the label the label column. So, we essentially remove miles per gallon column and that gives us the training label and using the pop command on the test data we get the test label. Let us normalize the data leaving aside the labeling column. Now, this normalization is a very important process. So, we want to make sure that we do we perform exactly the same normalization on test data as a training data or any other future data set that is coming to the model for prediction we apply the same normalization. So, in order to do that we will we have already calculated the strain statistics we define a normalization function that does the z score normalization which is defined as the value minus mean divided by the standard deviation. So, we calculate we perform normalization on the training data as well as test data and we will also store these normalization parameters which are present in train underscore test and we will apply them for the prediction as well as any test on any other test instances. Now that we have explored the data and pre-processed it the next step is to build a model. Here we will try to build a neural network model to perform the regression task. Let us look at the architecture of neural network model that we will be building. Before proceeding further I would like you to I would like you to pay attention that we have 1, 2, 3, 4, 5, 6, 7, 8, 9 there are in all 9 features in our model. So, we will we will essentially use a Keras sequential model or a feed for a neural network with one with couple of hidden layers. The first hidden layer is a dense layer with 64 unit. The second layer is also another dense layer with 64 units. In both the layers we use activation as ReLU and we have finally a dense layer with exactly one unit. Let us look at this particular architecture on a board. So, again coming back to neural network we have to fix the architecture of the neural network. We need to specify number of hidden layers, number of units in each hidden layer. You also specify the input layer or what is the input looking like and output layer. So, here we are using 2 hidden layers each using 64 units. Let us say we have 9 features and output layer we just have one unit. Let us let us draw this pictorially. So, we have an input layer and we have first hidden layer containing 64 units, then we have second hidden layer which also has 64 units and we finally have one hidden unit or one unit in the output layers. These are the hidden layers, this is the output layer, this is the input layer and we are using dense layer. So, in dense layer we connect each node from the previous layer to this particular node and then all the 64 units. There is an additional bias unit on each of the hidden units which I am not showing here explicitly. So, each of the hidden units, so we write down the names of parameter, there is one parameter on each of the arrow plus one additional parameter which is a bias. Here we have 9 plus 1, 9 coming from input plus 1 coming from out, 1 coming from bias. So, we have total 10 parameters per unit in the first layer. In the second layer here we have 64 inputs from the previous layer plus 1 bias parameter. So, we have 65 parameters per hidden unit and here we have 64 inputs plus 1 bias which is again 65 parameters. We use, second thing we have to care about is what kind of activations we will use, we will use ReLU as an activation function here, also use ReLU as an activation function here where we do not explicitly specify activation function, so by default a linear activation function is chosen. So, whatever is the linear combination that we saw in the last slide, in the last lectures, so we have essentially x i, w i, i is 164 here in this case. So, this linear combination is outputted as it is out of this particular node, so linear activation. So, this is how the model will look like, we will use this particular model, let us code this particular model using tensorflow.keras API. So, you can see here we wrote a special function called build model and this particular part of the code is building the model. Layers.dense defines the first layer where we use dense units, we are going to use 64 such kind of units using activation of ReLU and the input shape is exactly equal to number of features that we are using in the training data set. Next we also stack another dense layer with 64 units which uses ReLU activation function and finally, we have an output layer containing one dense unit. After specifying the model, let us also specify the optimizer and the loss function. So, in case of regression, we use minimum squared loss as a loss function which is the standard loss function for regression and we use RMS prop as an optimizer. We are using RMS prop with learning rate of 0.001 and we track metrics like minimum absolute error and minimum squared error. Let us look at mean squared error and mean absolute error and let us look at mathematically what do they mean. So, in regression framework what happens is that we learn, so this is our training data for example and let us say this is the model that we learnt. Now we make end up making some error on each of the points where you can see that the point on the line is the predicted point and the point denoted by cross is the actual point. So, there is a difference over here there is some difference here there is some difference here and so on. So, if we essentially look at every value of y which is yi or let us use like the different notation. So, we take ith value of y and we look at the predicted value of ith example and squared it. So, this define a squared error per example. If we sum it across all the example we get some squared error and if we divide it by 2n we get mean squared error. Instead of doing the square instead if we just use look at the difference or mod of difference between the true value and the predicted value and average it out this is called as mean absolute error and this is called as mean squared error. Let us go back to colab and now train the model. So, let us run this particular code cell and let us build a model. So, now we have a model let us see how model looks like. So, you can see that so, this is a model summary we have a sequential model or a feed for a neural network we use the dense layer to begin with which has got the output shape of which has got 64 outputs and then we have we are using essentially 3 dense layer 2 of them are hidden layer. The last layer is a last layer is an output layer last layer has exactly one output. So, there are 65 parameters then there are 65 into 64 that is the number of parameters over here. Now that we have built a model let us try the model on few examples. Let us take a batch of 10 examples from the training data and call model.trainit on it mind you I mean we have not yet trained the model, but we are using randomly initialized weights and seeing whether model.predict works. Yes, it seems to be working and it produces result as expected of shape and type. So, we use this in order to make sure that the model has been set up properly and all shapes of the tensors are set as expected. Let us train the model for 1000 epochs and record training and validation accuracy in the history object. We here use a print dot as a callback so that we can print a dot to record the progress of the model. We give the normalized training data and corresponding labels as input to model.fit function and we record all the history in the history object that is written by model.fit function. Let us train the model. So, model is training as you can see from the progress and once model is trained will visualize the history data whatever is stored in the history data it seems model is training is complete. So, let us look at the last 5 rows of history data history will have 1000 rows one for every epoch. And you can see that how the loss is actually it is very interesting to note here loss is quite fluctuating loss is going up then coming down then again going up and again up. So, let us look at what is happening to the validation loss. Validation loss seem to be going up then coming down down and here it has gone up. So, let us plot the history and see how it looks like interesting. So, you can see that training data in terms of mean absolute error the training error is going down but validation error is going up the same thing we observe in the mean squared error that while training error is going down validation error does not seem to be improving. So, this points to some kind of a some kind of a overfitting overfitting problem in this particular data set. So, we will in order to overcome this particular overfitting problem we will try to use early stopping as a means of correcting this issue. So, we will stop the model before it starts overfitting. So, we will look at we will keep an eye on validation error and use early stopping call back for stopping the model before it starts overfitting. So, early stopping call back is set up in this particular way we use keras.callback.earlystopping where we use validation loss as a monitoring mechanism and we wait for 10 epochs if in 10 epochs validation loss does not improve then we will then then then we decide to stop the model and we use and let us train the model again with the early stopping call back and the prints.callback and we will plot the history to see how model does this time. Nice this model looks much better compare with the earlier model earlier model training error seem to improve while validation error became worse or at least stayed the same. So, that was an overfitting problem, but after doing early stopping regularization, regularization is one way of addressing the overfitting problem. So, we applied early stopping regularization here and after applying early stopping regularization we see that training loss and validation loss or validation error both seems to be improving as we keep training for more epochs. And we stopped after we saw that for 10 iteration validation loss does not improve so we stopped so that that is what we did with early stopping. And we can see that in terms of mean absolute error the training loss and validation loss are pretty close by both in mean absolute error as well as in mean squared error. So, let us evaluate the model on the test data and obtain numbers like loss mean absolute and mean squared error. So, we have mean absolute error of 1.85 miles per gallon. So, now we have a model we will use the model now to predict the miles per gallon for the test data. So, here we will give normalized test data as an input and ask the model to predict miles per gallon. Let us plot the predicted value of miles per gallon. So, we have true values of miles per gallon and the predicted values of miles per gallon and you can see that in most of the cases predicted values the true value and the predicted values are actually close by and our model seem to be working reasonably well. Let us also plot the error distribution the mean absolute error distribution. Mean absolute error distribution is not really Gaussian, but we expect we might expect that because here we have very small number of sample we just had 314 rows out of which 80 percent were used for training and only 20 percent were used for test. So, this brings us to the end of the regression exercise. In this exercise we learnt how to use deep neural network model for the task of regression. We also studied how to remedy or how to prevent overfitting using early stopping as a regularization mechanism. In the coming session we will look at how to use this deep neural network model to make predictions on the structured data. We will also study how to store and retrieve the model for the deployment purpose and also study underfitting and overfitting through some practical examples.