 In this course, we will talk about performance metric, detailed second part of performance metrics. So we saw performance metric like recall, precision and accuracy in the last video. Let us look at more detailed performance metrics which come as different classifiers. Consider there is a binary classification problem and the classes are 0, if student will not pass the course 1, the student will pass the course, the only 2 classes so it is a binary classification problem. What do you have the table like this, yi is the actual dump value and y predict is the output of the classifier. And we computed the confusion matrix and we computed precision recall and accuracy, right. The focus was on predicting students who will pass the exam. So I want to tell that. So the true positive is focus is on the students who will pass the exam. If I have focus on 0 that is who will not pass the exam, what will be the change? I am not going to discuss what will happen if you focus on 0 in this video, but this is for you to think take it as extra work. So please check Wikipedia, they are very good resources. How to check Wikipedia? Just find this particular page, just type, precision, recall, accuracy, Wikipedia and Google you will get the page. Let us check the next problem in classification. Consider your binary classification problem that is 2 classes, the performance of predicting which students will get more than 90 marks in final exam and which students will not get more than 90 marks in final exam. And you have n equal to 1000 samples that is 1000 students data from historical semesters and courses. As you know, there are very few number of students will get more than 90 marks, they will get university ranks or something like that. So the number will be very less. So consider there are only 20 students who got more than 90 and other 980 students got less than 90 marks. This data set is imbalanced. For example, in a true value, there are only 20 positive classes and 980 negative class that is they will not score more than 90 marks. So if we have these kind of in balance data set, what will happen? So let us compute precision, accuracy and recall. Accuracy is 980 correct plus 9 correct, 989 by 1000, 98.9% is the very highest accuracy you can get. And precision is it predicted all the 9, 9 divided by 9 plus 0 is 900% precisely predicting. Recall rate is 9 by 11. So what do you think about this? Is it good? This value is interesting. Let us see. Consider you have two classifiers. The results are not same as what we discussed in the last slide. This is a bit different. Consider there are two classifiers which use the same 1000 data set and which give the results like this. Accuracy in percentage that 98.9% and precision 34 recall 45%. Classifier here you might have used decision tree or here you would have used some naive base classifier. Classifier 1, classifier 2 results on the same data set is given. Why? It has a very high accuracy but very poor recall and precision rate. And which classifier is better? Please pause the video, think about it. Take a minute, think about it. Why this two classifiers give really, very accuracy and very less poor regression and recall. Think about it and which classifier is better? After you list down your answer, you can resume to continue. So, the reason for very high accuracy is imbalanced data set. Given that the data is 980 is positive or negative. So, if a system which classifies everything as negative, still will get a 98% accuracy. No need to even try to create a new logic, new rules, nothing. The simple rule can be classify all the classes into majority class. So, you will get high accuracy if it is imbalanced data set. And which classifier is better? Given the data set, it is not enough to say which classifier is good because it depends on the research goal. If my interest is on precision, my interest on recall, based on that, which classifier is better can be told. So, in order to make a decision which classifier is better or which performance is doing good, we need a score, we need a metric which combines precision and recall or some other kind of metrics. Let us look at those metrics in this video. The one of those metrics is F score or F1 score. It actually the harmonic mean of precision and recall. What is harmonic mean? Almonic mean is simple, it is a one kind of averaging technique. It is there are like a two values since precision and recall. So, harmonic mean is simply it is a 2 divided by 1 by precision plus 1 by, if you simplify this, you will get the results for harmonic mean that will be like a 2 into precision into recall divided by precision plus recall. So, it gives importance to both precision and recall. So, is it good? Should we give importance to both precision and recall? In the last slide, I mentioned there are some research questions which will need better precision compared to recall, some research questions which need better recall than precision. Can you guess, can you think of one such a research problem? This is not the activity, but you can pass it and think about that. So, we will talk about that such a research questions later, but please think about it. In order to avoid this challenging that the F score is giving importance to both precision and recall, what we can do, we can have a variation of F1 score computation methods that is it gives a more importance to precision, you can add a weight to it when you are doing the precision and recall computation. So, there are variation of F1 score that can be considered if you give weights to precision, but we are not discussing that in this video. You can check the Wikipedia page on what is the formulations to do that. So, there is other metric which is developed by Jacob Cohen and Oscar Kappa. In Kappa, it is developed to measure the inter-rater agreement of two-raters. What is inter-rater agreement of two-raters? Let us take an example. So, there are two-raters, two researchers are watching the students facial expressions. There are say 10 students in the class are attending a class. Two-raters are looking at the students facial expressions and body gesture, the tone, everything, classifying them as one of the affected states, both confused or engaged, something like that. So, there are two-raters, we cannot have two-raters for our complete research, you want to use one-rater for five students and other-rater for five students. But how do we avoid the bias between these raters? So, that is called inter-rater agreement. Initially, we have to ask two-raters to observe the same set of students and check whether the two-raters agree on their classification. That is, if you have a class, items to classify say boredom, confusion, engaged and you want to classify into two or three categories and how these two categories are accepted by both raters. So, in order to measure whether there is agreement between these two-raters, Corgan's kappa is used. So, Corgan's kappa, the formulation is k equal to Po minus Pe, data by 1 minus Pe. Let us see what is Po, Pe and all. Po is the accuracy. From the confusion table, we can compute the accuracy which we computed in the last class. And Pe is the hypothetical probability of chance agreement. What is the hypothetical probability that both raters will agree? What is the minimum value they can agree? So, how to compute Pe? It is a sum of estimated probability that both raters agree for a k number of items. We will see example how to compute a kappa score. Let us take this table. Let us understand the table first. There are two-raters looking at students facial expressions. They looked at, so, 40 plus 30 plus 30. So, they looked at around 100 instances of facial expression. I am not telling 100 students. There might be two students, there will be one student, there might be like 50 student. But they have 100 instances of facial observations. Both observed all the 100 observations. So, 100 instances. Rater 1 said, at 40 times frustrated, same as the Rater 2, but Rater 1 said that 20 more times students frustrated, but Rater 2 did not say that. He might have said not frustrated. So, this is a simple confusion table, similar to what we saw in the classification problem. So, Rater 1 and Rater 2 agree, they are frustrated. Rater 1 and Rater 2 not agreeing their frustration. This is the cross values that is Rater 1, not agreeing it is frustration, but Rater 2 marks as a frustration. This is the wrongly classified problems. There is no agreement between these two-raters. There is high agreement between these two-raters. What is the accuracy? Simple to compute, there is 40 plus 30, thereby total number of observations that is 70 by 100. Very simple to compute. What is Rater 1, yes agreement? Let us compute P E now. What is the Rater 1 agreement? Is Rater 1 says yes for a 60 percent of time compared to all 100 samples, 60 percent of time he says yes, that is 40 plus 20, 60. This is 60. This is 40. This is a Rater 1's agreement. This is 50. And this is 50. If you just add these two values. So, Rater 1 says 60 percent of time yes and Rater 2 says, Rater 1 not says yes for 40 percent of time, that is Rater 1 saying no is 0.4. What it says that Rater 1 has a bias of telling yes more compared to no, which means when you see a small slight expression in the face, it might mark it as a frustrated, that is the Rater 1 bias. In a Rater 2 it is equally biased. Say for example, it says 0.5 percent of time, say 0.5 probability, a Rater 2 says the students are frustrated and another 0.5 probability that is Rater 2 or students are not frustrated. So, this is 0.5 and the Rater 2 saying no is also 0.5. This is from this value 50 by 100 and 50 by 100. So, what is the probability that both Raters will say yes, that is simply multiplying the Raters 1s value into Raters 2s value 0.6 from here, 0.5 from here. So, if you compute there will be 0.3. Similarly, for Raters saying no will be that is 0.4 into 0.5 that will be 0.2. So, what is the observation probability is like 0.3, that is yes value plus no value that is 0.5. This is the, we saw that sum of hypothetical probability of chance agreement that is P as a observation accuracy we say. So, if you use the 0.7 value in the formula you get the cap as core equal to 0.4, you can compute that. Please apply that in the formula we gave it in the previous slide and do that. Is K equal to 0.4 good is the question. Think about it, if you want can pass and go and search in internet and see is Kappa score 0.4 is good. There is no answer, no exact answer to this, but it depends on domain the K value good or bad can be inferred. So, in this scenario that is the both Raters will agree will be 0.4, that is not so good score for, which is not a so good score for inter-rater agreement reliability. So, do you compute Kappa? Do you need to do it every time like this? It is a simple website which uses like a 2 cross 2 table confession table. Just enter the values in the table and say calculate, it will calculate until the Kappa score, this is the website. And lot of tools like script languages have a library and the machine learning tools have the library to compute Kappa score easily. So, in this video we saw what is imbalanced dataset, that is in a dataset there are too many positive cases or too many negative cases. And we in order to pick up the better performance or better performed classifier, we have to combine come to the new score which combines accuracy, precision and recall. The one can be F score simple one to start with or the Kogan s Kappa. Kogan s Kappa is used widely to pick the right classifier. We will look at more better classifiers or better metrics. In next video we will check more metrics on picking the right classifier. Thank you.