 Today, we are going to discuss on the topic classification and its issues. At the end of this session, a student will be able to demonstrate what classification is and the issues basically related to it. Classification are data mining algorithms and classification is the most popular data mining technique which is used to mine data. The examples where classification are generally used are image and pattern recognition. Here we will classify the images based upon the content of the particular image and based on the pattern and its values, we will be recognizing these particular images and assigning them to different classes. In medical diagnosis, the image of a particular organ might be taken and diagnosis is done based upon whether that organ is defective or not. For a loan approval, classification takes the requirement from a particular user and finds out whether the loan can be approved or it cannot be approved, thus dividing it into two classes. We could have a third class also associated here which might be a pending class. For fault detection and industry applications, we would like to find out whether the part is defective or not. So we are again classifying our parts into two different classes. Financial market trends classify the particular investments and the market demands into various classes. So basically there are two types of classification methodologies that can be used. The first one is an estimation wherein we find or guess a particular value and depending on that particular value, we would then use it to define which class that particular item whose value has been found or guessed could be put into. The next method is the prediction where we classify an attribute value into one of the set of classes based on the value of the attribute and we forecast a continuous value. However, classification forecasts usually a discrete value, one or a zero, associating it to belong to a class or not to belong to a particular class. As an example, we all know that teachers classify students based on performance as weak students, very good students and average students. So we are requiring everywhere that some data is required for a classification to be defined. It is used in problems like the credit card purchase. I as a require of the particular credit card would look into what facilities the service provider is going to give me and then I would prefer to purchase a credit card which might be a master card or a visa card or some specific card which is of my interest. The use of decision trees and neural networks help us to classify people according to their height based upon the requirement that we are requiring to solve a particular problem. The background requirement for classification thus tells us that all approaches of classification assume some knowledge about data. We have to know what that data is. The training set is used to develop the specific parameters required by the technique and development of a particular model. This training data consists of sample input data as well as classification assignments for the data. Domain experts may also be used to assist this particular process so that classification is done in a proper way. We could define classification now given the database because a data is required and this database has tuples T1 up to Tn of items or records and a set of classes C1 up to Cm. The classification problem is now to define a mapping between this database and the classes where Ti is assigned to one class and the class Cj contains precisely those tuples mapped to it that is Cj is equal to Ti where Ti is found from f of Ti a function that is on the particular tuples to put it into the class Cj where we can define n number of classes from 1 to n and every Ti belongs to your database B. So there is mapping from the database to the set of classes. Classes are predefined and non overlapping each is unique and there is a partitioning of the entire data set. Classes are equivalent classes and we see that there are different phases to create this particular classes. Now let us think for a while about this particular phases. We come to a conclusion that classification depends on two phases. One is to create the specific model by evaluating the training data, out of the total data some data is termed as training data and reserved to develop your particular model. So the input is this training data and the output of the phase one would be the definition of a model which is developed. And this model is used on the remaining data that is you apply the model which is developed in the step one and classify the tuples from the target database which is the remaining data and then you will put these particular tuples into the classes as you have defined by your model. So the methods to solve this classification problem are specifying boundaries, dividing it into the spaces into regions and associate each region which is independent with one class. These methods are used in naive division and decision tree. The second method is to use probability distributions where the probability distribution function for a class is evaluated at one point and probability of the class is used to estimate based on the points which are in it. The third method uses posterior probabilities which determine if the item in a class is based on probability. We use this particular concept when we are using neural networks as classifiers. However, in classification problems there is always an issue of overfitting showing biasness which if satisfies the testing may not be able to border population of data. So on the border lines the data is of much importance during classification. Now what are the issues to be dealt with classification? The first is that of missing data which may not be there in your database. This causes problem during the training phase and also the second phase called as a classification process. Much of this can be handled in the training data to generate accurate results. Otherwise we may miss certain items where the results might not be as we are requiring. In a tuple to be classified we have to handle them by a proper classification scheme. Now what is the approach to handle missing data? Either ignore the missing data totally right from the beginning training till the end of the classification. Assume a value for the missing data through prediction right from beginning to the end. Keep it the same. Assume a special value for a missing data all of its own. So for a specific data we will put a special value for it and see that this missing data is dealt with till the end of the classification. Now to measure the performance of classification we determine the best association with the interpretation of the problem by users. So the users will give us the data and how well we classify it is how well we have performed in our classification technique. The using of the classification results and the classification tools which are used for classification help in measuring performance. We evaluate the accuracy of classification that is proper output to the classification has been achieved that is items are properly classified and not misclassified. Accuracy space and the time complexity results can be used. Hence finally classification accuracy is calculated as a percentage of tuples placed in correct classes. The cost associated with incorrect assignments to wrong class should be considered. We use operating characteristic curves which are special curves which indicate classification of data, receiver operating characteristic curves and relative operating characteristic curves to show relationship between a false positive and true positives to be finely classified. We find the false alarm rates, we find fallout and the recall and the confusion matrix illustrates the accuracy of the solution to a classification problem. What is this confusion matrix given a class a confusion matrix is an MXM matrix where the entry Cij indicates the number of tuples from the database D that are assigned to the class Cj but where the correct class was actually Ci. The best solution will only have zero values outside the diagonal. For our references we have used. Thank you.