 In this learning dialogue, we will talk about training and testing data. In our last video, we saw the three steps in machine learning. The first step is data collection and selecting the algorithms second step and training and testing is the third step. We talked about data collection last week and also we saw introduction to what is machine learning algorithms in previous video. We will talk about what is training and testing data in this video. In step one, let us assume you have collected 1000 students attendance, performance in a midterm and also the final exams course. Since I mentioned 1000 students, you may not have 1000 students in the current class. So, you have historical data say students across 10 departments or 5 departments their scores across multiple courses something like that. So, your 1000 students attendance, performance in midterm and also the final score. It can be that the same students data from different courses also possible to have 1000. And you check the data and you pre-process the data, there is no out layers, no errors, everything is checked. So, you have a data of 1000 students with all these two input variables that is attendance and performance midterm and also the Y is the final exams score. So, we want to create a model to predict the students final exams score. Step 2 is selecting the algorithm. Let us say for this sake of this example, you selected a simple linear regression algorithm or linear regression algorithm with the two variables not a simple multiple. Step 3 is training and testing. The question is which data pair will be used for training? You want to create a linear regression model that linear regression model if you remember is y equal to some mx plus c. Which data you will use to create that y equal to mx plus c to train the weights and m value and c value? So, what is the testing data set? Because testing data set means you need new students data without their final score. So, what is the testing data set you want to use it? You know all the why because I said that it is 1000 students data, you already have their final score also. What is the testing data set? Let us think about that last two points as a question. Think what is the data pair will be used for training data and what is the testing data set? List down your answers after listing it down, resume the video to continue. So, in ML, we usually split the data into two sets and use one for training set and one set for testing. For example, you have 1000 students data. So, let us consider you have used 700 students data for training and 300 for a test. So, even given a same data, we will use split the data into training and testing and use it. It is very general use split the data into two. But recently, not recently say five years before we started using a validation set also train set, validation set and test set. But let us not go into that. Let us see, keep it simple, we have only two sets of data that is training and testing. So, that is a train data 700 students historical data with X i and Y i information. So, 700 pairs of X i and Y i, X i can be X 1, X 2, Y i is only Y because we have only a performance only single variable. So, this pair is basically X 1, X 2 and Y 700 samples like this. In a simple linear regression the train model will look like this. I consider only the simple linear regression which means I have only one X no X 2, X 3. So, let us not keep X 2 only X 1. So, this looks like Y equal to M X plus C the slope equation we studied in class 8 or something. So, the values of this 0.76 or the 3.2 are computed based on the training data. Given a 700 training data, the algorithm computes the value of this suppose Y equal to C plus M X, the C and M can be computed from the training data or the 700 data. Now, we use this model to predict the performance of the test data. In test data we will give, so the test data will be Y predict, we will give sorry 3.2 plus 0.76 X. This X is the test data set which is having X i and Y i, but I will use only X i and apply here and I will get the Y predict value. So, what I am saying is we split the data into 2 training and testing, use the training data set to create the model then you apply the the test data set input variable to predict Y predict. Let us discuss in detail. Suppose this is a equation of the model we talked in the last class, so Y prediction is 0.56 X i into 31.96. This is not the same I just given some example here. Consider you have a training data, training data say 700 sorry, consider you have a training data or training data say 700 data here and you have a testing data that is 300 data value here. So, training data is X i, Y i, you are given into simple linear regression algorithm, it comes up with this particular equation. So, how it works we will discuss that in detail later, but consider it is a black box you are given this input to here and we got this particular equation that is what it gives you. C can be some value 0.76 is some value I am just given an example, it is not an actual any score to fit this values because I do not have 700 data, I just have 3 data here to show. Now, you want to predict the test this performance of this particular linear regression model on a test data set. Now, you have like a 300 test data that is 300 values of X i, apply this X i to this at each equation. So, you will have a Y predict, Y predict for each of this score like for each X i you will have a corresponding Y predict, something comes out. So, in general we use, so Y is 700, Y is 300. In general, we use 2 third of data for the training, 1 third of data for the testing that is 666 data, that is 666 data for training and 333 34 data for testing. So, the output will be thus testing at output will be X i, this X i, this X i and this Y prediction after applying the value here. Also, you know what is Y 1 because you already collected a supervised learning algorithm. So, you have a Y 1, Y i. So, this is a predicted value, this is the actual value in a test data set. Now, to compare these 2, you might able to validate the performance of this system. We will discuss that in detail in next lecture, but just want to tell you that you will get Y predict here. And comparing Y predict and Y i, you get the validation of a performance of this particular algorithm. So, you saw that we can split data into train and test. Do you think this is a drawback in splitting data into training and test just 2 third and 1 third? If you think what is a drawback, just don't your answers and resume the video to continue. So, there is a bias is the main problem in the particular data. I will tell you explain what is that. You classified the data into simple 666 and 333. How do you arrange the data? You might arrange the data based on the students in one class, first class 1, class 2, class 3 or based on the year they did the course in your subject or in the university or in the college. So, when you split the data, there is a high chance that the test data sets are not belong or may not be even related to the training data set. So, splitting a data as simply a 66 and 333 will lead to bias because the students bigger will be totally different. That is the one thing. Also, the data you selected of the same students in different class courses that is valid, but the data is from different batches. Say you are selected from one department or selected from other department and each department will have different teachers for different course. It is not just attendance only impacts the students performance. Attendance plus some other external factors like a teacher, teaching material or the course difficulty level or how many senior batches the students had all this information is needed. So, when you classify the data simply splitting into training and test set by random numbers say 66 or 333, it will not help, it will have bias. To avoid it, what we can do? We can select a two-third of data randomly from training instead of selecting the first 66, instead of selecting the first 66 data as the training and the last 3334 data as the testing, we can randomly pick 666 data for training and rest of data as a testing. That is the one way to do that. Still algorithm will be trained only for the trained data. The performance of the algorithm is tested only on the test data you have and it has trained only for the test data. So, how to avoid it? So, to avoid that error, we will use cross validation. In cross validation, we split the data set into n sets. Use the n minus 1 data set for training and left out set that is the set which you have not used for training for a testing. Repeat the same process n time with a different test set for each time. So, to do that in detail, let us see one example. Consider I want n equal to 4 which means 4-fold cross validation because it is a 4-fold cross validation because you have to do the testing for 4 times of a 4-fold. So, I have a data, complete data I created into 4 different sets. Set 1 has 25 percent of data, set 2 is 25 percent, set 4 is 25 percentage. Why this 25 percentage? Because I have 4-fold, if it is total data is 100, I will have 25 data. So, that is called 25 percentage in each set. I have a equal amount of data in the 4 sets. If you do not have equal amount, it will it is okay if you have one or more data is in one set it is just fine. So, what I will do in fold 1? In fold 1, I will use set 1, set 2, set 3 for the training. Test set will be the fold. So, this test set will be tested now. Initially, I test the algorithm or the model which I created using the test set fold. All the test set used as a training. In the second fold, second fold, I use 1, 2 and 4 as the training. This data as a test set. So, this data also will be tested now. Similarly, in the fold 3 and 4, we will use the other sets every all the data has been tested. So, this gives the performance of the algorithm. So, this gives the performance of the algorithm in a complete on all the test data available that is complete data, the n sample has been tested on the algorithm. Then this is more strong or more comparable compared to a simple split of 66% by 33%. So, it is always advisable to use the cross validation. So, how many fold to choose? Is it 4 or 5 or 10? It depends on the data set. It is always good to choose n equal to 10 or 10 fold castration is good. If you do not have that much data, you can select lesser number of fold. There is one more type of cross validation is called leave one out. So, in this particular leave one out approach, suppose you have a n number of data set, the n is total number of features. Here in this case, we have 1000 students data, right? If you have 1000 students data, the leave one out will be, this is the big n we consider the total number of students. The n fold will be 999 fold cross validation or it is 1000 fold cross validation. But n minus 1 data will be used for training. So, the leave one out, we will lose again the n is also a 1000 fold. So, we will use 999 data for training one for testing will repeat this for 1000 times. That is more strongly suggested now, but it is a lot of computational time and it is computationally costly also. So, if you have huge data, it is suggested go for 10 fold cross validation or depends upon your algorithm you can choose and number of folds in the cross validation. So, in this video, we saw what is training data and what is data split and also we discussed what is cross validation. In the next video, we will talk about some of the performance measures in a machine learning. Thank you.