 In this video, we will talk about correlation matrix using Pearson correlation coefficient. So, if you remember, we discussed about students marks in subject A and subject D, we used stacked bar chart or a bar chart to compare or box plot to compare these things in descriptor antics. Let us look at the same data. We have 60 students marks in a subject A and subject B, same students. So, we can plot the relationship between subject A and subject B. You see that student who was going good in subject A definitely is going good in subject B. Actually, I created this marks with a linear relationship say 0.93 into subject A. If you just compute 0.93 into 65, it will come like that. So, I created this marks to show that is a linear relationship. So, it is a linear relationship and it is perfect and it will be one like positive one. And you know there is a strong relation between subject A, the student who scores good in subject A that is math, they might definitely score good in science. This can be established. Let us look at the other example. We saw we have attendance and midterm marks and the final marks in the semester. Now, we have a two independent variable and one dependent variable that is you see this is x1, x2 and one dependent variable and two independent variable. We can compute the correlation between these values like what is the correlation of between attendance and final marks and what is the correlation of midterm marks and final marks. Let us look at that. So, consider column 1 is attendance. So, let us and midterm mark final. This matrix is created using Excel. The simple tool available in Excel can help you to do that. So, is there a relationship between column 2 and column 1? Yes, this two relationship x and y highly correlated 0.9. Is there a relationship between column 1 and column 3? That is 0.84. This is the correlation coefficient between attendance versus final marks. If the student have more attendance, it is highly likely scores better score compared to students who has low attendance score. And so, this is the self correlation like attendance versus attendance, midterm marks versus midterm marks, final marks. Obviously, it is one issue to the self correlation. And if you know that, if you know the diagonal part of this matrix. Let us look at that. Column 3 versus column 2, what is the correlation coefficient of midterm marks versus final marks? It indicates 0.8, it says if the student score good in midterm or like in a midterm or midterm exams, it definitely score good in the final exams. So, that is a final exam score. So, this is the correlation matrix which tells you the x1 and y1, x2 and y1. What is this is the correlation between attendance and midterm. This may not be correct because the midterm would have done till the first half of semester. Attendance should have data till the final just day before the exam. Or we do not know when you have attendance, may be first half of semester or you have attendance just one month before the final exams. But given that assumption, you have attendance till only the midterm time, attendance and midterm marks has a high correlation, it is a 0.9. Now, there is a question comes in, there are lot of questions. If I want to create if I want to create a simple classifier, should I use both x1 and x2 to predict y1? Or should I use any one of them? It is a very interesting question. The reason is you see there is a high correlation between attendance and midterm marks. Because 0.9 is very high correlation, if the student have more attendance definitely has a more midterm marks. If it is less attendance, let us let marks in the midterm exam. So, there is no point in picking both variables for our research, for our creating the classifier prediction predictor analytics methods. Instead, we have to pick one of them. The general approach is we use the one which has highest correlation with the dependent variable that is column 3 with column 1 that is attendance. Since it is very simple, since attendance and the midterm marks are highly correlated. So, since they are correlated, we can pick one of these features for the classification of predictor analytics, which one to do that depends on how much they are strongly correlated with the final marks score. I would choose, say we choose the attendance because attendance 1 is strongly correlated with the minus score compared to the midterm marks. Midterm marks also good, it is strong. 0.8 compared to 0.85 might pick 0.84. So, I will choose. So, attendance to predict final marks. So, that is what I will try to do. The reason is I will I will try to use the best independent variable which is correlated with the dependent variable for creating classifier. For these two values, it is easy to predict. For example, since there is a like correlation between attendance and final marks, it is simple. You just have to create a simple if else condition or some filtering criteria to figure out whether the student will pass the exam or not pass the exam or student will, how much student will score using regression or something like that. Consider you have more than say 2, that is you have x1 up to x10. Like you have assignment score, your midterm marks, there are the number of the attendance, the engagement level in the class, there are responses in the discussion forum, you have a lot of other information available from the classroom environment or some other learning environment. If you have that kind of information, then having all these features to predict is maybe a problem if you want to reduce the computation time. So, you can do the feature selection algorithm. One of the algorithm is doing the correlation between independent variable with dependent variable, then picking the strong correlated one and also ignoring the values which is highly correlated like the independent variables who are highly correlated can be moved out. This is one technique which is used very widely some time ago. However, due to the advanced classifiers, we do not really care about this how to select these features to predict Y1. Instead, we use other techniques which are available in the tools. We might see them when you are using the tools which help you to pick the right features for prediction. But still, it is very good practice if you want to go for prediction, try the correlation method with the correlation matrix that might help you. Also, the correlation matrix if you have say 10 columns might help you to provide a relationship between each independent variables and how it is related, how these variables can be predicted using the values given in the correlation matrix. So, I would recommend if you have more than one variable, one independent variable, go ahead and complete the correlation matrix to get a sense of the data and see how they are related, is there any linear relation between these two variables. You saw correlation matrix in previous slide and I hope you understood that. Let us do a small activity. Given the four plots, you know X1, Y1, X2, Y2, X3, Y3 and X4, Y4. Given these four plots, can you guess the correlation coefficient of these charts like chart 1, chart 2, chart 3, chart 4? Can you guess this? Also, if you guess it, what is the inference you are making it out of this correlation coefficient? In order to do any calculation or something, just guess it, just try to guess what will be the value. This data is from paper. So, this is the cited paper. So, all of them have a same correlation coefficient. You might have guessed it correctly if it is good. All of them have a same correlation coefficient, Y. In last video, we talked about drawbacks of Pearson correlation. This actually example gives you all the details of that. Let us look at it. The first one, X1 versus X2, it is scattered around a linear relationship. So, there is a 0.81 relationship. It is 0.816 is good. This is not a linear relationship. X might be a linear but Y is non-linear relationship with X. Still, it is correlation coefficient 0.86 because there is a line which can fit all these lines, these points. So, here the drawback of Pearson correlation that non-linear relationship is not identified here. If it is good, correlation coefficient, you should be able to identify non-linear to also. But unfortunately, there is no any metric which identifies this non-linear, this kind of non-linear relationship very accurately. Here, the correlation, if you remove this particular point, for example, if I remove this point, there will be line, it will be an exact positive one. Because of one outlier, only one outlier, the value is less 0.816. So, the Pearson correlation coefficient is sensitive to the outlier. How sensitive? That is this defined, explained in this particular figure, look at this figure. And if you remove this outlier, absolutely, there is no correlation between X and Y, it is 0. There is no correlation coefficient. It is not increasing, it is just simple 0. Because of this outlier, the line connecting these two is perfectly matching because this value is high and this is good. The value is 0.816. So, this tells you the Pearson correlation coefficient is sensitive to an outlier. This tells you how much sensitive it is. And this tells you, Pearson correlation coefficient does not consider the non-linear relationship. So, that is the typical figure which explains the Pearson correlation coefficient, also its drawbacks. So, in this video, we saw what is correlation matrix. And we saw a plot which explains the drawbacks of Pearson correlation coefficient. Thank you.