 Welcome back to Introduction to Learning Antiscourse. In this learning dialogue, I will demonstrate how to use VECA. I hope in last LXT, you might have installed VECA and you have already played with VECA. However, in this LED, I will talk about how to use VECA as a simple data set. When you start VECA in Explorer mode, you get this particular screen. The first, the top tabs are the preprocessing. That is, you can open some data and you can store the data and the tabs clusters, classification and the associate, like identifying the rules, association rules between the data. Then also the tab called select attributes. This is to select if you have a lot of independent variables, you can select attributes by automatically. And also you can visualize the data for the analysis. You can load the data in VECA using these four different tabs, like open URL or open file. Let us do the open file. We already loaded a data called test.arff. Arff is the format which is used for the VECA. Now, this data is actually cost data. That is, there are a lot of features in the car. There are like 7 features, the 6 features and 7th is the label we want to predict. And there are 518 instances of these features. We have data from 518 different costs. And there are 7 features. So, let us see the feature one that is buying. And there is no missing data. We have all the data that is good. And the buying has a 4 different values. It is like 4 different values. It is a nominal type data. The values are high, very high, high, medium and low. Now, this data is compared with the class called class, like that is the label we have predicted. So, if we can compare visualize this data across different other variables. For example, I am plotting the buying versus safety here. Let us look at the maintenance. Maintenance also have a similar variables like very high anything. And number of doors in the car is 2, 3, 4, 5 more. Number of persons can sit in the car is like 2, 4 or more. Or luggage space, safety is high, low, medium and the class. If you look at the class or the safety across all these variables, here we are plotting the variable buying a maintenance doors, persons, luggage space versus safety. The safety is 3 variables that is medium, high, low. There are like 160 instances of low, 197 instances of medium and 150 instances of high. You can edit the data like if it has any missing values, in this particular data we do not have any missing values. If we have any missing values, you can edit the data. So, all this data is nominal. So, you can see the values for each of these variables like a buying, maintenance, doors, persons, luggage space, safety and classes. There are 4 classes in this. There are 4 classes in this variable. One is unacceptable, acceptable condition, good condition and very good condition. The aim of this data set is you have a 518 cos data and you have to predict whether the car, given the data the car can be classified as unacceptable or acceptable or very good condition or good condition. There are 7 data sets here. So, first one is buying, like the chance of buying is high, very high. Maintenance is good, very high, medium not maintainable and the number of doors and persons and luggage space and safety. The safety is 3 features that is low, medium and high. Given these 6 features, we have to predict whether the car is classified as acceptable or unacceptable or good or very good. So, these are the data we obtained from 518 cos. Also, this class of classification is also done by the manually the experts in the automobile field. The next important tab which we will do is a classify. We will not deal, we will not talk about the other tabs in this demonstration. In a vehicle, we can, the default classifier is 0, that is the baseline classifier. Here first we will start with the J48 classifier, the decision trees. Decision tree is the another algorithm to classify the given variable into 4 classes. Now, you have 4 different text options. One is you can use the training set that is the 518 data we use to train the model, use the same data set to test it. If you do that, you get always highest accuracy. Or you can supply the separate test set like use this 518 instance of data to create the model, then supply another set of data to test it. Or you can split the data by percentage wise. Say I want to use 66 percentage that is 2 third of data to train this model and use the other one third of data to test the model. However, there is a bias which 2 third of model you are going to split it. Are you randomly picking this 2 third of data or this 2 third of data is the first 66 inches of data. In order to avoid these confusions, we can use the cross validation. In a cross validation, we split the given data set into set of volts or multiple groups. Say if I want to, I have a data, I want to do the 10 fold cross validation, you have to split the data into 10 different groups. Let us take it very simple cross fold validation example. You have a 30 instance of data and you want to do 3 fold cross validation. You can split the data into first 10 is the first fold and second 10 data and second group and third 10 data and third group. So, the 3 groups of data each have 10, 10 instance of data. What we will do, we will train the first 20 data that is group 1 and group 2 data to create the model, we test it on the group 3 data. So, the group 3 data is a test data now. In a nest iteration, we select the group 1 and group 3 as the training data set, we create a model using group 1 and group 3 as a training data set, then we will test it using the group 2. So, the group 2 also tested now. In a nest iteration, we will select group 2 and group 3 as a training model, then we will select group 1 for the testing. So, in this case, all this group 1, group 2, group 3 has been used to test the model and the performance of all these groups is indicated as the accuracy or the performance of this particular classifier model. So, 10 fold cross validation or the end fold cross validation is very, very useful in order to do the classification. Please use the cross validation technique when you apply the RECA in a course project. You can go and look at the more options like instead of just selecting these four given options, you can have more options to high penalizing particular variable or you can, what are the things you want to look at that, you want to output the model or you want to output only the performance, all these things can be done. So, we selected J48 as the classifier model, then we want to predict the class, the nominee will be class of acceptable unacceptable, very good and good and we start it. Once we start it, we can see the performance here. The first thing is correctly classified instances that is out of 518 instances, 439 instances are correctly classified. So, 439 divided by 518 that is the 83%, 84% that is the accuracy of this particular classifier. So, incorrectly classified instances is 79% incorrectly classified. Then kappa statistics is 0.6 is very good value, kappa actually compares the classifier performance compared to the 0 or classified that is all the baseline classifier that is if I classify all these values as a A only that is unacceptable condition what happens that is the condition here in kappa. So, this particular is 0.66 better than the baseline classifier. So, 0.6 is a very good value actually in classification. But in education settings it is good, but in this the setting which we use that is example of call classification is not it may not be better kappa value. Root mean square is a 0.222332 and that is absolute error we can identify. Let us look at the table detailed accuracy by class. So, there is a true positive rate false positive rate precision recall. Let us below look at the confusion matrix there are four values A, B, C, D, A is unacceptable, B is acceptable, C is good and D is very good. This A, B, C, D classified as is the predicted value the actual value is given in the row wise the column wise is the predicted value. There are 335 cars are classified as unacceptable which is actually unacceptable that is good. There are 17 cars classified as unacceptable, but they are actually acceptable that is B equal to acceptable. So, there is a wrong classification of 17 data that is like unacceptable unacceptable sorry the acceptable cars are classified as an unacceptable value. So, this is used to calculate the precision of this classifier. If you look at the precision of the first row is a 0.952. So, this value is good. Also the there are 32 other cars which are unacceptable, but which are not able to not classified as unacceptable by the system. So, you can look at it the 335 cars are classified as unacceptable is correct then 30 and 1 and 1 or 32 cars which are actually unacceptable, but are classified as acceptable and good and very good. So, the recall rate is reduced to 0.913. So, the F measure actually computes based on the precision and recall then the values given here. So, you can look at the confusion matrix in Wikipedia and look at the position recall what is F measure and ROC area. It will be explained using this similar kind of table. The interesting thing here is the B acceptable to acceptable 95 it is good. Let us look at the poor performance that is C. There are 13 plus 1 plus 1 that is 17 cars which are good really good, but our model predicted only 1 car is a good C equal to 1 and other cars are acceptable and other 13 cars are acceptable 3 cars as a very good. So, this is very poor recall rate also very poor position rate. So, you look at the table the third line you can see the very poor recall rate and position rate at 0.059 and 0.125 is very very poor and the fourth one although it is very less numbers, but it is good in the sense out of 13 cars sorry 14 cars 8 plus 1 plus 5, 8 cars are classified as very good. So, this is good. So, we might have a better recall rate compared to the other ones. So, in based on this particular classification we can see that the classification of good car is not done good. So, however compared to all other 3 classes this classifier done really good job. So, the kappa value is 0.66. If you train the different set of classifier we might able to increase the performance of this classifier this classification task. You can also view the model we created because it is a decision tree you can visualize the tree. For example, the visualize decision tree has been created based on safety safety is the primary parameter here and if the safety is equal to high then it checks whether the number of persons in the given car then based on that if it is more whether what is the value of buying based on that it checks the maintenance then it goes and classifies based on the value of maintenance acceptable and unacceptable or it can it is low then again it looks at persons then the luggage space or buying values. So, this is the real model which is used to create this classify the given data set. Thank you for listening to the VECA demo and I hope you will use VECA in a course project. We actually have the data and we will explain the data in a next LED. Use the data and apply it on a VECA and try to predict the students dropout rate in a MOOC. Thank you.