 So I had discussed little bit of mathematical preliminaries for pattern recognition where I talked about some amount of statistics and some of the connections between the statistics literature and matrix algebra now let us actually start going to the subject basically in pattern recognition we have some phenomenon on which we have made some measurements from those measurements basically we have to look at what all features that we can think of from the measurement space we have to go to what is known as the feature space from the feature space what is known as decision space we need to make decisions about the phenomenon under consideration and the main tasks are one task is feature selection which features are actually important for the problem under consideration and the next task is we are supposed to do either classification which in some of the books it is written as supervised classification nowadays it is basically known as classification and unsupervised classification this is slightly old terminology and nowadays it is known as clustering we need to do either classification or clustering and the other task is feature selection so basically from the measurement space you are supposed to go to feature space and from there we are supposed to go into the decisions now again let me just give you an example suppose the problem that you are dealing with is a character recognition problem what is a character recognition problem say this is a character be this is character be this is the number 8 this is character be this is the number 8 we know this is be this is 8 you want the computer to say that this is be this is 8 so this is recognition you need to recognize this so how does one do it there are several ways of doing it I am just mentioning a procedure which is easily understandable to you what you need to do is that you enclose the character into a rectangle where the sides of the rectangle are parallel to x and y axis so it should be probably like this say it is like this and there is this is like this and here it is like this similarly here also parallel to so it must be I think it should be like this I wrote it slightly slantly so here this line is actually touching this that is why it is like this so now what one is going to do is what is the main difference between B and H this is sort of perpendicular to this line whereas if you look at this character this one and this portion is same but basically it is this portion where there is a difference so if you want to differentiate between B and 8 what one needs to do is that you take some points on this line and you measure the distance of this point to the one that is very close to this I mean you draw a line parallel to this and see where it is actually touching this is touching at this place this is touching at this place this is touching at this place just measure the distance of this pixel with this this pixel with this these distances if these distances are all more or less the same then you are going to get a B okay if they are not same you are going to get 8 here this will be very very big and here they will be small and here this is actually 0 so this is so these are the features that we are measuring earlier we are given this character these two characters which we have digitized which we have digitized we made a binary image this is a binary image and this is a binary image then what we have done is we have enclosed this one in this and then from here we have found these features so note that from the measurement space we went to the feature space we calculated these values now after we have calculated those values we made a decision if these values are all very close then we called the character as B if they are not very close then we called the character as 8 are you understanding what I am trying to say if they are not very close this is just for you people to understand the thing there are much better methods in fact those methods not only deal with B and 8 they deal with all the characters in English language okay not only capital letters I mean lowercase letters and uppercase letters you have all the numbers 0 to 9 all the digits and I mean you have all possible things they take care of all the for English language the character recognition software it is you have really widely really developed softwares available in the world now character recognition software for English language so whatever I am telling you it is basically a rudimentary thing just to make you understand the meaning of measurement space feature space decision space measurement space is the given thing given two characters which we have distanced them and we made a binary image then feature space means we measured these features then decision space means we have made a decision on the basis of some distance those distance values so this is just an example of what measurement space is and what feature space is and what decision space is this which is not necessarily true that in every example measurement space and feature spaces they are different sometimes they may be same sometimes a portion of measurement space you will put into features so it depends on how you look at it it depends on how you look at it and then we are going in for decision and the decision can be if it is the decisions can be of various types if we are looking at some sort of dissimilarity and dissimilarity and within a cluster the similarity should be more between two clusters that the dissimilarity between one cluster another cluster should be more than the similarity I mean between two clusters the dissimilarity should be more and in a cluster the similarity should be more that is for clustering and we might be interested in doing classification so there are the main problems that we are going to encounter here are mostly they are feature selection and clustering in classification now let us just see in classification there are basically two cases that we are going to consider in these classes the in one case we are going to assume that conditional probability density functions and prior probabilities are known for the classes in another case we are going to assume that training sample points are given now what is the meaning of these two things the meaning is the following in the case one here we will be knowing the number of classes what is the meaning of we will be knowing the number of classes note that in my previous lecture I was giving you several applications of pattern recognition where I was talking about land cover types in satellite images the number of different land cover types for classification of pixels that is the number of classes the number of different land cover types that is basically the number of classes in that problem now in the problem of digits you have 0 to 9 9 10 digits and you are given a new digit this new digit should be put into one of these 10 digits if that is the problem then the number of classes is 10 and those classes are they correspond to the digit 0 digit 1 digit 2 digit 3 up to digit 9 if it is a classification of characters the uppercase letters of English language then there are 26 classes because they are 26 letters a b c d up to judge then someone writes a or someone writes one of the character one of the characters you should put the character into one of those classes now one may write the letter a it may be like this it may be like this it may be like this so you have several such things each one of them you should put into the class of letter a each one of these things you should put into the class of letter a so we are going to assume that the number of classes is known now the second one is that you see conditional probability density function for the present moment let me forget about the meaning of the word the word conditional probability density function you see you have we have to talk about a probability density function when we know the features under consideration if the number of features is a small n number of small n then the probability function density function is defined over the space Rn so when we are going to talk about probability density function it means the first question is how many variables do you have are the same as how many features do you have how many variables do you have are the same as how many features do you have let me for the sake of convenience again mention an example the example that I am going to mention is classification of say persons in belonging to two different communities one community is South Indians another community is Punjabis so basically you have two classes one class is Punjabis one other class is South Indians now you are going to measure features what features are you going to measure let us just say one feature is height another feature is weight and you might measure some other features also and I have just given two features you might measure some other features also then then the probability density function it comes into it how you take say let us just say we are going to be interested in classification of males let us for the present moment forget about female and let us also for the present moment assume that the males under consideration they have at least the age as 20 years so that let us not think of small children okay aged some months and years let us not think of them okay so then our main problem is we have some males for each one of them the ages greater than or equal to 20 each male is either a Punjabi or a South Indian and what you need to do is that you should put this male into either the Punjabi class or the South Indian class that is the problem under consideration now what is the input that is given to you the input is for the Punjabi class we know the density function for the Punjabi class we know the density function for the South Indian class we know the density function what is the meaning of that the meaning is the following let us say we only have one feature height we only have one feature height let us say height we are measuring it in centimeters let us say the maximal height that we have is say 240 centimeters 240 centimeters means how many feet it is 8 feet right and let us say the minimum height is say 120 centimeters let us say the minimum height is 120 centimeter and the maximum height is 240 centimeters okay and now let us look at the class of Punjabis mostly and let us assume that we only have one feature so that I can draw the density function I can draw the density function if I have only one feature if I have two feature then I need to draw it in three dimensional space which I cannot do okay so I am going to assume that for the present moment in order to make you understand I am going to assume that I only have one feature that is height so basically what may happen is with respect to the height maybe this is the density function of the Punjabis say this thing it may correspond to something like maybe around 5 feet 5 feet 8 or 5 feet 9 so how much it will be 150 plus 8 means so it is around 167 168 168 centimeters or so something like this whereas for South Indians because South Indians are generally shorter than Punjabis that is one of the reasons why I took these two communities am I right South Indians are generally shorter than Punjabis am I right yes so I think for South Indians it may look like this it is on this side of the curve how much this side that I do not know but it is on this side of the curve so this is for say Punjabis say this is for say South Indians the density function that means area under this curve is one area under this curve is also one so you can have some such density functions when you have the number of features to be 234 okay like that you are given the density functions of the classes you have two classes the example that in which I have taken your two classes and for each class the density functions are given since the density function is given for a class that is known as class conditional density function it is known as class conditional density function and it is represented by if you look at the books it is represented by this is the general representation that you find p represents probability density function this is for the ith class and for the point x the density value is px under omega i okay this is the notation that you will find in books I have followed I did not want to write this the w I always so I have just followed this notation you will see in my slides and I have followed this notation okay now there is one another quantity there prior probability what is a prior probability okay you put Punjabis and South Indians together all Punjabis and South Indians together suppose in the whole mixture let us say 40% or Punjabis 60% or South Indians then the prior probability of the class Punjabi is 0.4 and the prior probability of the class South Indian is 0.6 are you understanding let me repeat you put Punjabis and South Indians together in the whole mixture if you get 40% as Punjabis 60% as South Indians then the prior probability for the class Punjabi is 0.4 and the prior probability for the class South Indian is 0.6 okay so these are prior probabilities now what we know is the conditional probability density functions are known and the prior probabilities are also known if you know these two then how does one get the rule of classification there is a second case where training sample points are given I will explain to you slightly later I will explain this case to you slightly later let us for this let us assume that we are given the conditional probability density function and prior probabilities okay now this is what is known as Bayes decision rule well what is this rule I have assumed here that you have capital N number of class and the density functions are naturally as many classes are there those many density functions you would have so the density functions are P1 P2 up to Pm now the number of features are capital N since I have assumed that all the points they belong to capital N dimensional Euclidean space and the number of features are capital N and the prior probabilities as many classes are there those many prior probabilities you need to have so the prior probabilities are P1 P2 up to Pm these are all known and naturally the summation of this probabilities is 1 and all the probabilities must lie between 0 and 1 because they are probabilities now what the Bayes decision rule says is put x in class I if PI PIX is greater than or equal to PJ PJX for all they are not is equal to I this is what is known as Bayes decision rule now I think the remaining part of this lecture and the next lecture I am going to solely concentrate upon this rule which is Bayes decision rule you see whenever you are doing any classification there is some misclassification whenever we are doing any classification there is generally some misclassification let me tell you this thing by using a few examples the example that I have we have considered Punjabis and South Indians okay we have taken the features as let us just say only two features height and weight let us assume that we have only two features height and weight now if you put a threshold that if the height is greater than this and the weight is something then you take this person as Punjabi otherwise it is the South Indian okay now then you can ask me a question count we have South Indians whose heights are greater than that how do we say that we can never have one well I cannot say that are you understanding whenever there is a decision rule it is not necessarily true that every point in one class it satisfies that rule and other points in other class it satisfies the complementary rules that is not necessarily true in general you have some amount of misclassification let me tell another example this example concerns images suppose you are a you are working in a bank you would like to see to it that people do not steal money from the bank that is what is your aim now what you would like to do is that you would like to prevent people carrying rifles pistols and guns etc etc like that say some 10 to 15 such objects now what you have done is you have put a camera at a specific location the camera is a video camera whenever someone is entering that particular portion of the bank the camera is looking at the person and he is recording actually it is taking photograph okay and the recording is going to the one who analyzes the video naturally earlier there was a human being who is sitting next to the monitor to see whether anyone is carrying this thing and he is making he would make his own judgment but it is difficult to expect the human being to be really nice about judging this thing in at each and every minute of the day probably the first few minutes or the first 20 30 minutes or the first one or two hours he would be really alert afterwards you would start seeing the same thing you would not be alert why we are blaming other even if you use it or I sit in that next to the monitor start observing the same thing we would not be alert this is the property of all of us human being. So it is better that a machine does this analysis and comes out with whether someone is comes out with the judgment of someone is carrying one of these objects are not okay. Now in order to make this judgment there are again two problems involved one is someone is actually carrying and the machine says that he is not carrying this is one error you have another error someone is not carrying it the machine is saying that he is carrying it there are two errors let me repeat the machine says that this man is carrying it but actually he is not carrying that is one error the machine says that he is not carrying but actually the man is carrying there are two errors now which error you would not like to have and which error you do not mind the answer is if you are the manager of the bank you would say that well I do not mind a few false alarms that is if someone is not carrying any such explosive device but if the machine says he is carrying it I do not mind it but on the other hand if someone is actually carrying and if the machine says that he is not carrying then I do mind I do mind so I do not want this sort of error to happen are you understanding I do not want this sort of error to happen. So it is not necessarily true that the errors would have the same weight depending on the situation the errors would have different weights now in this slide you please note the last line it is best decision rule minimizes the probability of misclassification yes it minimizes the probability of misclassification by assuming that the errors would have the same weight I am repeating it minimizes the probability of misclassification by assuming that the errors would have the same weight if the errors do not have the same weight then this rule is not applicable if the errors do not have the same weight then this rule is not applicable one can prove that if the errors have the same weight then this rule minimizes the probability of misclassification that is why this rule is known as the best rule for classification this rule is known as the best rule for classification because minimizes the probability of misclassification I will give you another example where the errors would not have the same weight the example is the following there are two statements so there is actually one statement the statement is India is going to attack Pakistan tomorrow India is going to attack Pakistan tomorrow and you have to attribute this statement to one of the two persons one person is see a murthy another person is Manmohan Singh now if the statement had actually been made by see a murthy and if you attribute it to Manmohan Singh then you can see that the consequences are enormous on the other hand it was really made by Manmohan Singh and you have attributed to see a murthy well you might maybe your boss you would say you are a journalist your boss would say fine you have made a mistake does not matter that much nothing would happen to you okay it was not really it is not really that serious an offense than the first one so here again the weights are not same this example is from speech the earlier example was from images and in any classification you are going to have these sort of an observation belonging to one class if you put it in the another class then what are the consequences the consequences can be first you are looking at the consequences only from the point of view of I mean how much misclassification it is going to occur without really looking at the weight of misclassification then the next one is that if the misclassification actually happened how much cost you are going to occur if the costs are same then the weights are same if the costs are different then the weights are different okay then the second one is cost of misclassification the cost of misclassification is the thing that actually decides that actually makes you in many situations to have your own decision rule of the problem because cost of misclassification may vary you would like to have some cost of misclassification I mean some diff some set of weights for the misclassification maybe some other person for the same problem may have different set of weights it depends you might be knowing that cancer is a disease which is not actually easily detectable in the early stages even now now say you are a doctor and the patient comes to you now you have to say whether this person is having cancer or not you have a problem your problem is if the patient is actually having cancer and you say that patient is not having cancer because the knowledge is not completely it is not completely known to you then that creates a problem not having cancer you would say that the patient is having cancer you are going to give medicines according to that and that will create several complications on the patient on the other hand if the patient is having cancer and you say that she does not have that will create another difficulty then what is it that you are going to do as a doctor would you say that the patient because it is not quite clear to you then probably the best option is you directly admit that you are not in a position to judge whether the patient is having cancer or not otherwise if you make a wrong judgment nowadays people can file cases in quotes probably you are aware of this thing this is again so the doctor is facing the classification problem whether the patient is having cancer or not you will see very many classification problems in actually every minute and every second of our life when you walk from here to downstairs how do you take this step at this place or at this place or at that place it starts from there every minute we are making decisions every minute and every second we are making decisions by going out what path are you going to choose what path there is lot of space why do you take one such path why do not you take another path if I ask you do you have an answer look at that place there the door and the place there you have an answer to that note that you really do not have an answer because you are not actually optimizing anything whichever is reasonable you are just following so for the same problem you have different solutions at different points of time for the same problem you have different solutions at different points of time and every problem there is pattern recognition problem though you may not have unique solution there are several solutions you can take any one of them you can take any one of them I suppose I need to stop here