 Namaste, welcome to the next module of practical machine learning course. In this module we will study the techniques around model evaluation. So far we looked at other steps in machine learning like data gathering, transformation, defining models, defining loss functions and training the model. The final step in the machine learning pipeline is really the model evaluation. In model evaluation we are concerned with the question of how well the model will work on the unseen data and we are also interested in diagnosing the problems like under fitting and over fitting with the model and taking the corrective actions. Finally, we will also study bunch of evaluation measures that are established in practice. Let us begin. So, as I said we expect model to work well on the unseen data. Now, how do we really test the performance of the model on the unseen data? More than unseen data we really want model to work well on the future data. We train the model on the past data, but we want to make it work on the future data and we do not have the future data in hand. So, how do we really solve that particular problem? So, one way of solving that problem is taking the entire training data and dividing it or partitioning it into two parts. One is a training and second is a test. We use a large chunk of the data for training and hold out a small percentage of data for test. Usually we follow 80 percent data for training and 20 percent data for testing. And the model will be trained based on training data, test data will not be exposed to the model. Model does not get to see the test data. If by any chance your model gets to see the test data, then the performance that you report on test data will not stand the scrutiny of the time. In a sense that you might get over estimation of the model performance if your test data is accidentally leaked to the model in the training time. How do we really measure the performance of the model? So, there are standard measures of performance. So, in case of regression we have squared error, mean squared error or mean absolute error. So, here we take the square of the difference or square of the error between the actual value and the predicted value. Here we take the absolute difference between the square and calculate mean of that. So, these are two measures that we use in case of regression. In case of classification, we first compute the confusion matrix or we first come up with a confusion matrix or we first calculate or fill up the entries of confusion matrix and we calculate measures like precision, recall, F1 score to measure the accuracy or to measure the performance of the classification. As you also use accuracy as a measure, accuracy as a measure is not stable. It does not work well when your data set is imbalanced, when you have way too many examples from one class and too few examples from the other class, accuracy is not a good measure to use or it is not a good metric to use. Apart from that we use ROC curve and PR curves to understand how our classification is working across different thresholds. So, you will go through each of these terms in detail or later in this module. So, it is enough right now to understand that we will be using one of these measures and reporting performance on the test data. How do we believe that the model that worked well on the test data is also going to work well on the future data? So, why do we really believe in this? So, here comes a very important caveat in machine learning. We assume that the training data and test data are going to come from the same distribution. If training data and test data is not coming from the same distribution, we do not have much hope. So, let us understand what do we mean by training and test data coming from the same distribution. Let us take an example of a kid preparing for an exam. We tell the kid to prepare to come prepared for maths exam and the next day we give him questions from the maths, maths textbook or maths syllabus. So, here since kid has already prepared for maths, he will be able to answer questions about maths. So, this is a nice example of training and test data coming from the same distribution. Now, imagine a case where we ask a kid to prepare for maths, but we ask questions in science. So, the kid would not be able to answer those questions because he never trained on science. This is an example where training and test data are different. So, this is an important caveat to note that machine learning makes an assumption that training and test data comes from the same distribution. So, now that we have learnt how to train the model and evaluate it, let us try to understand three possible scenarios that can happen to any model after training. So, your model will either under fit or it will be the just right fit, it will over fit. Let us try to understand the three concept through a diagram. Let us say you have a model that has a single feature and it is predicting the value y and let us say these are data points. So, if I believe that my model, so let us say my model is a straight line that means it predicts some constant value. So, let us say my model is y is equal to 2. So, no matter what is the input, it always predicts the value 2. So, this is an example of under fitting where the model is not taking into a counter training data at all and it is going by a very strong bias that no matter what is the input, I am going to spit out the output which is a constant which is a fixed number which is 2 in this case. So, the second thing that could happen to your model is that it will be doing just the right fit. So, let us use this particular color and this is an example of just right fit. So, it appears that the points are, so the points are around the line. So, a linear model might be just the right fit for this particular model. So, this is the category of right fit. So, the first one that we looked was under fitting is an example of over fitting. In over fitting, you end up achieving a zero error on the training data, but your model may not generalize well on the future data and one of the important thing that we care about in machine learning is to have a model that generalizes well on the future data or that makes better predictions on the future data. So, one of the goal in machine learning is to build models that make good predictions on the future data rather than on the past data. So, that is why over fitting is bad for us. So, if you slightly perturb or if you select any other point on the training data, you know the over fitting model would not be able to predict it. In over fitting, the model is literally memorizing all the data points, all the input-output pairs. So, we studied, so there under fitting and over fitting is our problem and we understood what is under fitting and over fitting. Now, what is, now the next question is how do we really diagnose under fitting and over fitting? So, in case of under fitting, we are using a model with a single parameter which is bias, whereas in over fitting, we are using a polynomial of 7 degree to get a perfect fit on the training data. So, these are problems of either there are too few parameters in case of under fitting or there are too many parameters in case of over fitting. So, our job is to come up with right set of parameters for getting just the right fit. And, let us define the complexity of the model or let us define a major of complexity of the model and one of the major is the model complexity is measured by number of parameters that are used in the model. So, in case of under fitted model here, the model complexity is just one in case of number of parameters. Whereas, in case of this degree 7 polynomial, there will be far more parameters than the under fitted model. Whereas, just right model has two parameters, one parameter with respect to the bias and second parameter with respect the second parameter which is a coefficient for feature x 1. So, what we can do is we can actually draw, so on x axis we have what is called as model complexity which is defined by number of parameters. So, let us say it starts from 1 all the way up to whatever a very large number here and what happens is that this is a major of loss which is j or any other metric from evaluation you can use here. When you have two few parameters, the model generally has very high loss and as we go on adding more parameters, the model will become better at training error. So, model will have lesser and lesser training error, this is the training loss. Now, we have to understand whether model is under fitting or over fitting. So, what we do is we have already divided our training data into train and test part. So, we take this train part of that and we divide it further into two parts. One is called the train and the second is called dev, dev set or validation set. So, we measure the training loss on the train set and we also measure the loss on validation. So, note here that validation data is not shown in the training algorithm at the training time. So, whatever loss we are measuring on the dev set is similar to the loss that we measure on the test set. So, we will get dev loss. So, what will happen is the dev loss or the validation loss will reduce initially as we increase a number of parameters in the model or as we increase the model complexity. After a point, it will shoot up. So, no matter how many more parameters you increase, the training error will go down and keep going down, but the validation error keeps going up. So, if we look at this particular graph, it seems to us that this is probably the sweet spot where the validation error is low enough and training error is also low. Beyond this point, validation error has shot up. So, this point corresponds to just right fit. On the right hand side, we have overfitting. What is happening here in case of overfitting? The validation error initially improved, but after a point it went up. So, if you see this kind of pattern in the training error in case of in your learning curves or in the model complexity versus the loss curve, we can be sure that there is some kind of an overfitting happening. Whereas, this particular part to the left of just right is called underfitting zone. It is an underfitting zone. So, we do not just we simply do not have enough capacity or we do not have enough parameters in the model to achieve lower training and validation error in this particular part. So, if your training error and validation error is high, we can assume that there is some kind of an underfitting happening. Whereas, if your validation error went down before going up, then we can say that we have an overfitting happening. In case of just right fit, we have training and validation error both low and then we get a model which is probably just right fit for the data. So, this is more like an exploratory exercise where in order to so, you will have to you will have to conduct multiple experiment starting with simpler model and adding more and more parameters to it and checking out the training loss and the validation loss. The training loss is computed on the training set. Dev or validation loss is constructed on dev or validation set and we observe the patterns or we observe the numbers or we observe the loss to figure out whether we have an overfitting situation or an underfitting situation. And there are you know based on once you understand what is really the problem with the model, we can take corrective actions based on that. So, let us try to understand how we can fix underfitting. So, once you recognize that underfitting is your problem that means, your model is not getting not getting lower loss on validation and training set, both the losses are high. We can infer that probably model does not have enough capacity or model does not have enough parameters to learn the training data or to learn the model on the training data. So, what is the solution here? The solution here is to increase the complexity of the model because your model simply does not have a capacity to learn patterns from the training data. So, here one of the solution that works well here is to add more parameters to the model. How do we add more parameters to the model? Either you get more features or we can do, we can construct features by crossing the existing features or by constructing several polynomial features of different degrees. So, for example, concretely if you have two features x 1 and x 2 in the original model, we can construct the features of polynomials of degree 2 by raising the power of each of the features and by taking the cross or by taking the feature cross or by taking the feature cross which captures interaction between two features and when we do this, we are automatically increasing the capacity of the model. So, feature crossing is a standard way of increasing the complexity of the model. Now, how do we fix the overfitting? So, let us why does overfitting happens because we simply have too many parameters. So, one way of fixing overfitting is straight away getting more data. If you get more data, we will have enough data to train for the complex model or model having large number of parameters. Second is we can reduce the model complexity. So, one way of doing this is through regularization. Here the idea is that we will penalize the model for being excessively complex. So, you remember you must be remembering the loss, this is a loss function, we have a loss function and in loss function we are trying to minimize the loss function. So, we add a penalty which is lambda times is lambda times the model complexity. So, our new loss function becomes the loss plus the penalty for the complexity. Lambda is the factor that controls the weight or controls how important, how much importance to give to the model complexity. It is called the regularization rate is lambda and there are various ways of, there are various ways of calculating model complexity. So, one way to calculate model complexity is called L2 regularization or ridge regularization. In this case what we do is we calculate the second norm of the parameter values. So, essentially what happens is we calculate the second norm and we add the second norm as a penalty. The effect of this is we cannot have a model that minimizes the joint objective by assigning a very large weight on one of the feature. So, that would not be possible with L2 regularization. So, one way of measuring model complexity is with L2 regularization which adds the second norm of the parameter vector. The other measure is or the other way to do this is through L1 regularization where instead of using the second norm we use the first norm or the absolute value of the parameter of the parameters. So, this actually simplifies to 1 to m w i whole square and this simplifies to 1 to m absolute value of the parameters. So, by adding these penalty terms we are able to control the model complexity. So, if you are, if you have the problem overfitting it is good it will be a good idea to increase the value of the regularization rate or lambda. So, if you increase the value of regularization rate here what will happen is that we will be giving more importance to the model complexity. So, we will be we will be applying more penalty. So, if you have if you are suffering from overfitting increase the regularization rate. So, sometimes people also combine L1 and L2 regularization together which is called as elastic net regularization. So, those options are also available to you. If you are trying to fight underfitting and if you have regularization added one way of reducing underfitting is by we can reduce lambda to fix. So, once we once we reduce lambda what will happen is that we apply lesser penalty for model complexity and that gives us more room or that gives us more capacity to learn the training data and that is how we can fix the underfitting problem. So, these are these are the broad ways in which you can fix up your underfitting and overfitting problem and in your training. So, whenever you are training machine learning model you will often come you will often come across issues like underfitting and overfitting and these are some of the strategies in which you can fix those problem. So, we have two levers here. So, we have problem which is underfitting and we have overfitting. So, we have two problems and what levers we have we have we say more data more complexity. So, we say that more data does not help here it definitely helps here more complexity is by decreasing lambda and here we increase lambda and lambda helps us to. So, increasing lambda helps us to leads to less complexity more complex model more complex in terms of number of parameters. So, this is this is a summary of the whole discussion that we are having. So, this is a summary of the whole discussion that we had on diagnosing the problems with the model and fixing them with more data or more complexity. So, after we perform regularization what we do is we essentially get a model and we can run that model on the test data and report the performance on the test data which we hope will be similar to the performance that we expect on the future data and why do we why do we expect that because we make a crucial assumption that training data and test data comes from the same distribution. So, before we close this discussion let us quickly look at some of the evaluation measures so that we have good understanding of them. So, let us look at more of the classification measures like accuracy, precision, recall and so on. So, the first thing that we do is we actually build what is called as confusion matrix. So, confusion matrix is a key matrix from which we can derive all the matrix. So, this is let us say the actual and this is the predicted side. So, let us say actual is plus 1 and minus 1 this is plus 1 and minus 1. If actual is plus 1 and predicted is plus 1 we have what is called as true positives. If actual is minus 1, predictor is minus 1 we have what is called as true negatives. If actual is plus 1, but predicted is minus 1 we have what is called as false negative. This is predicted negative falsely and this is false positive. Now, we will use this confusion matrix. So, what you can do is you can essentially apply so these measures are mainly calculated for classification and once you get a classification you can find out what are you can first fill up the confusion matrix and then based on that we can calculate measures like accuracy. So, we can calculate accuracy as so what is accurate which are accurate numbers which is true positives plus true negatives divided by true positives plus false negatives plus false positives plus true negatives. So, essentially it is a ratio of the diagonal element some of the diagonal element divided by all the cells in the matrix that is accuracy for you. Then we have other measures called so accuracy does not work on imbalanced dataset where there are too many examples of one class and too few examples of the other class. But accuracy works well on a balanced dataset if you have roughly equal number of positive and negative example accuracy works in those cases. So, in case of imbalanced datasets or in general we calculate a measure called precision. So, the precision is the ratio of true positives divided by total predicted positives. So, what are the total predicted positives these are true positives plus false positives these are total predicted positives. So, ratio of the actual positives to the total predicted positives is called precision recall on the other hand is the ratio of true positives to the total positives. So, what are total positives which is the total actual positives which is true positives plus false negative this is recall and we combine precision and recall into a score called F1 score which is harmonic mean of precision and recall and is calculated as 2 times precision into recall precision plus recall. So, these are the 4 measures that we mainly use to evaluate the models on a classification task. So, this brings us to the end of model evaluation discussion and this was the last piece in the machine learning pipeline. So, you know let us summarize or let us revisit the steps in machine learning pipeline. So, we have we start with data, data is the essential component of machine learning pipeline. We first do data pre-processing, after data pre-processing we build model. Model bringing involves specifying the architecture of the model or the model function itself. Then we train the model and then we do model selection which involves selecting model of right complexity and finally, we get the model on which we do predictions. So, we do prediction on the new data and we get the output. Note that in a single though we may not be able to get the final model. So, what we do is so, there might be a loop over here. So, we do training, we do model selection, we may not have good enough model, then we go back again, we recheck the feature see if we can add more features to the model and again repeat the process. We do this until we are satisfied with the model quality. Once we get the model of reasonable quality, we deploy that model and use it to get predictions on the new data. So, this is the entire machine learning machine learning flow. So, I would like to make a passing remark about the components in machine learning model. So, this is a very very important view of machine learning model. So, what happens is so, if you if anybody tells you that they have invented a new machine learning algorithm, you should ask them three questions. So, the first question is what is your model? What is the mathematical function in your model? What is your loss function? How do you calculate the loss? And third is how do you train the model? These are three core questions you should ask. In addition to that you can ask couple of more questions. One is what is your training data and how do you evaluate a model? What are the metric that you use for evaluation? So, if we ask these five questions, we will be able to understand the machine learning model pretty well. So, this is a componentized view of machine learning model. Let us take a concrete example and see how it how it goes. So, in case of regression, a notation where we use this x i y a pair where we have y i is a number. The model that we use is is b plus the linear combination of features and their respective weights. So, this is the model. The loss we use squared loss which was half, we calculate loss across all the points. We take the prediction minus the actual value, we square it and this is how we get loss and then we use gradient descent either batch or mini batch to optimize the loss and find the parameters and we use mean squared error on mean absolute error as evaluation measures. This was for regression. Let us try to get a similar view for logistic regression which is a classification algorithm. In case of logistic regression, our training data is again x i y a pair, but y i comes from some fixed values, let us say 0 1. y is either 0 or 1. Our model is essentially to take linear combination and apply sigmoid function on the linear combination that you get. This is our model. We use cross entropy loss and we can use either batch or mini batch gradient descent and we can evaluate the accuracy with precision. So, if you come across any new machine learning algorithm and if you ask this five pointed questions and note down what is really and note down the answers, you will be able to pretty much understand what is really happening in this machine learning algorithm. We are taking the training data. We are understanding what is the type of machine learning problem we are solving whether it is a regression problem or a classification problem. Then we are selecting a suitable model. We are defining a loss function and optimization algorithm to find out optimal parameters for which the loss is minimized and finally, we evaluated on suitable metric to report the performance on the test data which we hope to get also on the future data. In this process, getting the features of the features for the data, defining loss functions are activities that are tightly linked to the domain. So, hope you understood the machine learning the whole end to end machine learning flow. You had a good understanding of componentized view of machine learning model and also steps involved in machine learning pipeline. We will take these learnings neural network in one of the follow up sessions. After this session, we will go to neural network playground and understand each of these concepts visually which I think is a very very fun activity. Hope you enjoyed learning this part with us. Thank you.