 Assignments, assignments for the students can be divided in two different ways, can be divided into two parts, first part assignments on artificial data sets, the second part is assignment on real life data sets. Let me first talk about assignments on real life data sets, so if you want to really do the assignments on real life data sets, the first question is where are the real life data sets available, well these are available in UCI archive, these are available in UCI archive, you will get simply many, many data sets at least 50 if not more, in fact 50 is a conservative number if not more okay, you will get data sets labeled, unlabeled and you can do classification, you can do clustering, you can do feature selection, all these things by downloading the data sets, these data sets are freely downloadable, you do not have to pay a single pie for downloading them, they are freely available, so you can do whatever techniques that have been taught here, you can apply those techniques on the respective data sets, this is for real life data sets, but then you will always have a problem there, when you want to apply it on a real life data set, the problem is that well you have applied many of these techniques, maybe some of them are giving good result, some of them are not giving good results okay, the question is that how do I know, the first thing is that how do I know which particular method is to be applied to which data sets, this is one of the questions that people have been bothered about for a long, long time, see in order to give a partial answer to this question, you need to actually look at artificial data sets, in artificial data sets the modeling part we are going to do, because we are going to generate the data, we know the model of the data, so we know everything about the data set, then if a method is working well, whether a method is working well or not on the data set that can be found out easily if you generate the data on your own, whereas if some other person generates a data set on some problem and he gives you the data set and if you say that some X method is working on this data set, then the person may come and ask you, I will give you 10 more observations, then can you say I will give you 10 more observations from the same data set, but these observations are unknown to you, so if I give you those 10 more observations which are not present in your data set, then do you think your method is going to work, then your answer will be you do not know, whereas an artificial data sets since we have we know the model of the data, when we do it we can take many more points, we can make the value of small n as much as possible, to really see whether may be for 1000 points one method is giving better than the other method, but then for 2000 the one which was giving better results, it is not giving better results now, the one which was not giving better results, it is giving better results now, so these variations you can see because you can generate as many points as you can on artificial data sets, you really know you can make your n to be very very large to check whether one method is better than other method or not, whereas in real life data sets since the number is limited, whatever may be your conclusions you are concluding in only on those points only, the moment a new point comes in, you still have a problem of generalizing some of these methods to the next point, you are not in a position many times, many times you are not in a position to say the method that is working best for those many points whether it will still work better than the other methods if you include some more points, whereas these problems you will not have for artificial data sets, it is always necessary to work on artificial data sets for understanding the limitations of a method, that is why now I am going to discuss about how to do the assignments on artificial data sets, for this the first thing is that you need to generate the data set, so what is it that you let me just give some examples, synthetic artificial, synthetic data sets artificial data sets, so let me just give one assignment right now, I will take points in R2, I have taken three two dimensional points, prime means transpose, so it is basically a column vector 00, this is column vector 01, this is column vector 10, these are three covariance matrices and case 1, case 2, case 1 p1 is equal to 0.5 is equal to p2, case 2 p1 is equal to 0.4, p2 is equal to 0.6, okay. Now let me just write capital A, this is the first problem, first one, consider normal mu 1 sigma 1 and normal mu 2 sigma 1 only, normal mu 1 sigma 1, normal mu 2 sigma 1 under case 1, that means the prior probabilities are 0.5 and 0.5, okay. Yes, and generate n points from the mixture density function, generate n points from the mixture density function, what is the value of this n, the value of n is we will start with 100, the next value is 500, the next value is 1000, the next is 2000 and the next is 5000, okay. So each time, so for example here you should generate first 100 points, then next 500 points, next 1000 points, next 2000 and then next 5000 points. So when you generate a point in the mixture density function, you will get the label and also the point, you will get the label and also the point, this is A, B is normal mu 1 sigma 1, normal mu 2, this is also sigma 1 under case 1, under case 2, sorry. Here the prior probabilities are different, 0.4 and 0.6, generate n points from the mixture density function, mixture density function. C is, note that in the first two cases the covariance matrices are same, so in the third case I will take covariance matrices to be different, normal mu 2 and sigma 2 and here under case, okay, let me just take 2, so here the covariance matrices are different, generate n points from the mixture distribution, mixture density function and the last one is normal mu 1 sigma 1, normal mu 2 sigma 1, normal mu 2 sigma 2 and normal mu 3 sigma 3 and here p 1 is equal to 0.3, say p 2 is equal to 0.4, p 3 is equal to 0.3, generate n points, generate n points from the mixture density function. So here, so data sets are generated, note that when you are generating these points, you are not only knowing the point but also the class label of the points. After the generation, now what are the works that you are going to do? Well the first one is that, so there are totally 4 cases and in each one the value of n can be any one of the 5, so totally 20 different situations, totally 20 different situations. Now take the first n by 2 points the training set and the rest as test set, take the first n by 2 points of the training set and the rest as the test set. Now for each training set and for each class, for each training set and for each class, find okay estimate mean covariance matrix, estimate mean, estimate covariance matrix and prior probability. For each training set and for each class, estimate mean, estimate covariance matrix, mean covariance matrix and prior probability. After estimating these 3, so apply based decision rule on, apply based decision rule on the test set or apply based on each point in the test set by using the estimated mean, by using the estimated values. Estimated values what I mean is estimated mean, estimated covariance matrix and estimated prior probability, so under the assumption of normality do all these things if you assume normal distribution, under the assumption of normality. So assume normal distribution, since you are assuming normal distribution what you need to know is the mean covariance matrix and then the prior probabilities for applying the based decision rule. So you have estimated those things, so apply based decision rule and find the misclassification rate. What is misclassification rate? Take a point in the test set, you know already the label of the point, then you do all these things, suppose the label says that it should go to class 1 but suppose your rule says that it is going to class 2 then it is misclassified. If the point is in class I and if you put it in class I then there is no misclassification, otherwise there is misclassification. So what is misclassification rate? Rate means number of misclassified points divided by the total number of points. Number of misclassified points divided by the total number of points that is the misclassification rate. Note that all these things you are doing on the estimated ones. Now you do the same thing on the actual ones, you already know the actual mean and actual covariance matrix and the actual prior probabilities. Apply based decision rule on each point in the test set by using the actual values, by using the actual parameter values not the estimated ones. Under the assumption of normality, we already know actually the normal okay and find the misclassification rate. So you have two misclassification rates now, one by using the estimated values, another by using the actual values. As n increases, these two misclassification rates, the difference should decrease. These two rates should be very close. Then we know that this whole thing is fine. Whatever theory that has been taught, we assume normal distribution, we have done many things okay. One of the ways that you can check is by doing this. So this is an assignment, but let me tell you a small problem that you may face. In all these things, where is this? Generate n point from the mixture distribution and n point from the mixture density and n points, n point from the mixture density function. All these things, there is one thing that I need to write randomly always. So randomly, randomly here also randomly. This randomly word is extremely important because computer cannot generate random numbers. It can generate only pseudo random numbers. Whatever method that you are going to use, it can generate only pseudo random numbers. It cannot generate random numbers. So this is one thing that you need to keep on your mind. To the extent possible, I do not know how you are going to do it, you should generate the points randomly okay. And if your method of random number generation is fine, you are going to see this result. Let me tell you how to generate points from the mixture density function. This for generating one point from the mixture density function, it has two parts. The first part is generation out of random number in the interval 0 to 1. The first part is generation out of random number in the interval 0 to 1. Once you generate it, you should see where the random number belongs to. That is, in this case 1, if the random number belongs to the interval 0 to 0.5, then you should generate point from the first distribution. If it belongs to 0.5 to 1, then you should generate point from the second distribution. This is what one needs to do in each one of these four cases, A, B, C, D. For example in this case, generate a random number between 0 and 1. If it belongs to 0 to 0.3, then you should generate a number from the random vector from the first distribution. If it belongs to 0.3 to 0.7, then you should generate a random vector from the second distribution. From 0.7 to 1, you should generate a random number, random vector from the third distribution. So generating a random vector from one distribution means, it has, from the mixture distribution means it has two parts. One is from which one of the original distributions you need to do it. And the second part is generate from the distribution, so that you are going to get the label and as well as the point. You will get a point and you will get the label also. Like that, you need to generate points from, you need to generate 100 point set, then 500 point set, 1000 point set, 2000 point set, 5000 point set. Then so if you apply this thing, then the first 10 by 2 points will be trained and the rest will be test set. That means if you take N as 100, first 50 points will be trained and then the rest 50 will be test. Similarly for other values, for example for this, the first 2500 will be trained and the next 2500 will be test set. Then you use the training set and then you estimate the mean covenants matrix and prior probability for each one of the classes. Then using the estimated things, then for every point in the test set, you see whether it is rightly classified or wrongly classified and you get the misclassification rate. Similarly you get the misclassification rate by using the original parameter values. Original parameter values are known, mu1, mu2, mu3, sigma1, sigma2, sigma3, they are all known. So you will get two error rates, one for the original, one for the estimated ones. So you need to check whether that the estimated ones are going to the original ones. Note that the original one, the misclassification rate according to the theorems as n goes to infinity, the misclassification rate by using the original parameter values, it should go to the base misclassification probability or misclassification probability of the base decision rule. Since we are applying base decision rule, somehow we need to talk about misclassification probability of the base decision rule. So what I am trying to say is that if you use the estimated ones, then also for sufficiently large n as if n is increasing, then the estimated misclassification rate, it will go to the base misclassification probability. It will go to the base misclassification probability. So for this since we do not know the base misclassification probability, even that we are doing some sort of an estimation here, though we are using the original parameters, still it depends on what the training, what the test set is. The test set is varying. If I take one set of 5000 points, I will get one test set and if you take something then it is going to be a different test set. Now the next question is that the same is the case with training also. How do I know that the misclassification rate that we are getting, whether they are really good estimates or not? For that in principle, you are supposed to do the same experiment many, many times. Say for n is equal to 100 and for this one, you it let us just say 10 times. So 100 point sets, you should generate it 10 times. So you will get 10 such 100 point sets and for each set, you have an misclassification rate based on estimated values, misclassification rate based on actual values. So your misclassification rate based on estimated values, if you take the average, it is supposed to be close to the average that you are going to have when you have got the misclassification rate based on the actual value. These two averages should be close. So in principle one should do it. So this also you can take it as a part of the assignment. This is on base decision rule. So we have been discussing about applying base decision rule or in fact by base decision rule. The other classifiers that have been developed so far about classification, I mean for classification, they can all be used. You have a training set and you have a test set using the test set. First using the training set, you develop the classifier and then using the test set, you find the misclassification rate. That is it. Whatever classifiers that you have, for example if you have minimum distance classifier, well there you have to have the means with respect to the mean you calculate the distance and for whichever class the distance is whichever mean the distance is minimum put it in that class. So you can do the same thing by using this training and test sets. And so like that, whatever classifier that have been developed, you have this training data set and test set data set and you can apply those classifiers and check their performance. Now that is about artificial data set. Now they are this real life data sets. About this real life data sets, I would like to make a few comments. Those of you who are aware of image processing, you know that there is this lana image on which most of the persons would work. Probably on lana image, you know that there are several several work that have been done. Similarly on real life data sets also there is one specific data set on which lot of work have been done. That is Fisher's Iris data. So you have 4 dimensional observations, you have 3 classes and from each class there are 50 points, 50 plus 50 plus 50 totally 150 points. So you have to generate a classifier. This is one of the most widely used data sets. I do not know many persons, they are just too many whatever classification methods have been developed. I am more or less sure that almost all of them have been applied on this data set. And if you develop a new scheme and if you apply it and if you want to say that your method works better than the other scheme, then one of the ways in which you can say is that apply your method on Fisher's Iris data set and then you see the latest literature. You will find many persons working on this thing and showing the thing. So you say that if your one is better than the whatever you find in the latest literature, then that will add plus point to your work. So Fisher's Iris data is one such standard data set and like that now there are many, many more data sets. Before I conclude I would like to mention one thing. There are some data sets available there for which the training sample sets have also been given. So that I mean you always have a problem of what is the size of your training sample set that is one. And the second one is that once you choose the size, what are the points that you are going to have in that? So you will find some data sets there where the training sample set are also given there. So in that sense for those data sets you can do the comparison very well. But for many data sets there are no training sample sets, they are just labeled data. Then since they are standard data sets, you will find too many persons working on those data sets and you will find many results. If your results on just say, let us just say 20 to 25 data sets, if they are better than the existing results then at least you have a contention, you have something to say to the reviewers that your results are better. It is very difficult for the reviewers to reject your paper straightly even if you are saying it only experimentally because they are standard available data sets and you are saying that your results are better than all the existing results. That is a good use of, I mean this is why people like to apply their own methods on real life data sets where these data sets on these data sets already there are many works. So that they have become benchmark data sets. So I would like you people to work on these data sets and if any new method is developed you should apply it and find the veracity of your method, find how good or how bad is your method on the basis of the results that you are going to get on the real life data sets. Thank you.