 So, I have actually given you some sort of an introduction to pattern recognition where I basically mentioned the uses of pattern recognition now I shall be going into the mathematics before I actually entered into the subject of pattern recognition we need to know a little bit of mathematics preliminary mathematics so that we can actually use those things use the mathematics while developing theorems or other results for pattern recognition. The basic mathematics involves some amount of matrix algebra and basics of probability theory and statistics probability theory statistics and yes I also assume some amount of calculus knowledge some amount of calculus I assume that all of you have this knowledge I assume that all of you know the meaning of what a probability density function is and I assume that you know what a Gaussian distribution is let us just see a probability density function is it may be defined over n-dimensional Euclidean space which I represented by Rn this is a point in n-dimensional Euclidean space this is actually a vector and a column vector it is a column vector it has small n number of components and this belongs to n-dimensional Euclidean space and a probability density function p defined over n-dimensional Euclidean space p defined over n-dimensional Euclidean space it has the following properties it is greater than or equal to 0 for all x belonging to Rn and integral px dx over Rn is equal to 1 this is a probability density function over n-dimensional Euclidean space this dx it means actually d the first component if I call it x1 then dx1 dx2 over dxn okay so that is just represented as dx and there is this underlying underline you will see at all these places this is representing column vector it is representing a vector okay and integral px over px dx over the whole of Rn is equal to 1 so any function satisfying these two properties is known as probability density function over n-dimensional Euclidean space we can actually in the course of these lectures we can assume that we are not going into complex spaces basically we are going to be with real spaces so our probability density function is like this okay and now there are many such functions like this one such function is known as density function for Gaussian distribution or Gaussian density function for Gaussian distribution or that is also known as normal density function it is the definition is this the definition is this 1 over square root of 2 pi to the power n determinant of sigma to the power half exponential to the power minus half x-mu prime means transpose sigma inverse x-mu here it is in determinant of sigma to the power half let me I think since you have got background in matrix algebra determinant of sigma it need not always be positive it can be 0 or it can also be negative right if determinant of sigma is negative and determinant of sigma to the power half you cannot write because then it becomes a complex number and our definition of probability density function not only our definition it is says it must be strictly greater than or equal to 0 so this means that necessarily this should be greater than 0 the determinant of sigma should necessarily be greater than 0 even if it is 0 this whole quantity is not defined right even if it is 0 this whole quantity is not defined so this should be strictly greater than 0 and this exponential means e exponential means e to the power of something here let us just see this x is an undimensional vector mu is the mean vector mu is what is known as mean vector it is the mean it represents the mean of the distribution mean average of the distribution and this is also small and dimensional vector x is also small and dimensional vector both of them are column vectors so when I write the transpose they become row vectors that means this is going to be 1 by n so here let us look at this this is x-mu this will be n by 1 now what is now this sigma sigma is known as there are several names for it it is known as variance covariance matrix or another name for it is dispersion matrix variance covariance matrix or dispersion matrix this will be n by n matrix sigma is an n by n matrix so sigma inverse will also be n by n so then this whole thing will be 1 by n n by n n by 1 this whole thing will be 1 by 1 that is a scalar so e power so any quantity is greater than or equal to 0 right e power any quantity is greater than or equal to 0 and this is a scalar so this will be greater than or equal to 0 anyway I have already mentioned that the determinant of sigma has to be strictly greater than 0 determinant of sigma has to be strictly greater than 0 now how does one ensure it that let me tell you how one ensures that first do you all understand the meaning of dispersion matrix variance covariance matrix do you understand yes or no I think I will explain so I will explain to you what a dispersion matrix is or what a variance covariance matrix is for this one you need to know first the meaning of variance and you need to know the meaning of covariance you need to know the meaning of variance and as well as you need to know the meaning of covariance if you have n observations x1 x2 xn these are points in R these are points in R that means these are n values on real line then the mean of these n values x bar is equal to 1 over n summation i is equal to 1 to n xi this is the mean of these n observations mean of n observations then the variance of these n observations the variance is 1 over n summation i is equal to 1 to n xi-x bar whole square the variance is equal to 1 over n summation i is equal to 1 to n xi-x bar whole square that is from every point you subtract the mean and you take the square like that you do it for all the n points and take the average of this square that is 1 by n summation i is equal to 1 to n xi-x bar whole square in some books you would find some other expression like 1 by n-1 summation i is equal to 1 to n xi-x bar whole square in some books you will find this also this is actually unbiased estimate for variance of the population this is known as unbiased estimate for variance of the population this is a slightly complicated concept in statistics let us not go into those specifics then what is n or n-1 let us not go into the specifics we can just follow one of these things and I would like to follow this but if you are interested you can also follow this n-1 also there will be slight differences in the actual values if you follow n-1 instead of n but ultimately for the decisions it is basically looking at the definition of variance in a slightly different way than if you write 1 by n or if you write 1 by n-1 basically tells you that you are looking at the concept of variance in two different ways there is a slight difference if you just follow one of these things it is fine okay and I am following 1 by n but you can also follow 1 by n-1 1 by n-1 tells you that it is an unbiased estimate of there is something called a population variance for that the unbiased estimate is 1 by n-1 summation i is equal to 1 to n xi-x bar whole square this basically for advanced students of statistics they will use 1 by n-1 okay the advanced students of statistics they will use 1 by n-1 so I am going to follow 1 by n which is what generally in the preliminary level 1 by n summation i is equal to 1 to n xi-x bar whole square is taught in the preliminary level in the advanced level it becomes 1 by n-1 okay this is the basic difference now this is about variance now there is another concept that is covariance what is covariance in order to explain the meaning of the word covariance we need to have two variables let us write the two variables as x and y it is something like x is say height y is say weight on the same individual you are measuring the individual's height say in centimeters individual's weight say in kgs like that you are measuring these heights and weight say for small n number of individuals then you are going to get observations like x1 y1 x2 y2 and you will get xn yn x1 y1 x2 y2 dot xn yn now what you do is that just plot these n values now I shall give here three such plots actually these plots can look in very many ways I am going to give examples of three such plots in one plot the points are looking like this in another plot that is here the x and y variables are such that that xn and yn xi i is equal to 1 to n yi y is i is equal to 1 to n so these x i's and yi's they are looking like this in the third plot in the third plot say xi and yi is equal to 1 to n those points are looking like this now let us see here this is say your x axis these are your x's and these are your y's here as x values are increasing the y values are also more or less increasing so we would like to denote this relationship by some quantity that is greater than 0 we would like to denote this relationship by some quantity greater than 0 now let us look at this here when x are increasing y is decreasing so this is a negative relationship here we would like to represent this relationship by the same quantity but that should be less than 0 now let us look at this here whatever may be the value of x y is more or less in the same range so in this case we would like to get the relationship as something very close to 0 in this case we would like to get it as something very close to 0 so we would like to define a quantity in such a way that that quantity should take positive value here negative value here it should be something very close to 0 here what is that quantity let us see what that quantity is for all the x i's here you find the average of this x i's and you find the average of this yi's also so probably that point will be somewhere here probably that point will be somewhere here so then what I will do is that I change my axis to this so this point is x bar y bar similarly the x bar and y bar here probably is this so this quantity is x bar y bar for this data set and for this data set probably x bar and y bar is this so this is your x bar and y bar is this clear now what I will do is that take a point let us look at the coordinates of this point with respect to new x bar and y bar then for this corresponding value for x is this and the corresponding value for x is this corresponding value for y is this right similarly for a point here the corresponding value for x is this and y is this for a point here the value for x is this and y is this and then so on now you multiply the x the new x and y values that what is going to happen note that this is the first quadrant according to the new axis this is the second quadrant this is the third quadrant and this is the fourth quadrant here axis and y's they are greater than 0 product will be greater than 0 here x's and y's are less than 0 product will be greater than 0. And here the product will be less than 0 here the product will be less than 0 note Note that the place where products are less than 0 the number of such points is small whereas the number of points for which the product is greater than 0 that is large and the values are also large so if you have summation i is equal to 1 to n x i – x bar x y i – y bar which is actually this into this for this point this into this for this point this into this for this point and then so on and then you are just adding them up and again you can have 1 by n or 1 by n – 1 okay I am taking 1 by n here this will be greater than 0 in this case right now what will happen here here the points are in the second and fourth quadrants. So the product will be less than 0 so here this quantity will be less than what will happen to this here the points are more or less equally distributed in all the quadrants so this summation is likely to be very close to this is known as covariance this is known as covariance okay this is known as covariance between x and y okay this is also represented as c o v you only write the variables x, y c o v x, y now if you have two variables you have one covariance term I said that we have points in n-dimensional Euclidean space that means basically we are assuming that we have small n number of variables we have small n number of variables which in pattern recognition language we call them as features okay we have small n number of features or small n number of variable so if you take pairs you have how many pairs you have n c 2 pairs right by the way what is covariance of x with x what is covariance of x with x it is basically the variance is this clear covariance of x with itself that means x i-x bar x i-x bar x i-x bar whole square which is what is the variance here okay so covariance of x with itself is basically the variance okay so if we are in small n-dimensional space that means basically we have the number of features as small n then the variance covariance matrix is like this this is the variance covariance matrix if the n features are x 1, x 2, x n then the variance covariance matrix then the variance covariance matrix of these n variables or n features these n features this is denoted by sigma denoted by sigma is defined as is defined as there are smaller number of variables I am representing them by x 1, x 2, x n since we are in n-dimensional Euclidean space there are smaller number of variables or smaller number of features and those variables are represented by x 1, x 2, x n then the variance covariance matrix is this is an n by n matrix it has n rows and n columns this is an n by n matrix the first quantity here is covariance of x 1 with x 1 which is nothing but variance of x 1 this is covariance of x 1 with x 2 if you look at the definition of covariance whether you write x 1 with x 2 or x 2 with x 1 they are the same thing. So this is covariance of x 1 with x 2 this is x 1 with x 3 x 1 with x n this is x 2 with x 1 or x 1 with x 2 this is x 2 with x 2 and then so on so since covariance of x 1 x 2 is same as x 2 x 1 x 1 x 3 is same as x 3 x 1 this matrix is actually symmetric this matrix is symmetric it is a real matrix it is a symmetric matrix and it is also what is known as positive definite it is it can be shown to be always non-negative definite and in most of the applications it is positive definite I will explain to you the meaning of positive definite probably all of you know the meaning of the word distance all of you probably also know the meaning of Euclidean distance suppose you have two vectors x 1 x 2 x n this is one vector y 1 y 2 y n this is another vector then the distance between these two vectors if I represent this vector by x if I represent this vector by y then the distance between these two vectors x and y is equal to summation i is equal to 1 to n x i – y i whole square and then there is a square root right this is the Euclidean distance which all of us know okay now let me ask you a question my question is like this suppose okay let us assume that the value of small n is equal to 2 we only have vectors in R2 two dimensional space and let us assume that my the first one is representing height the second one is representing weight the first one is representing height and the second one is representing weight now say for a person the height is let me just say 160 cm say the weight is 70 kg this is for one person say for another person the height is say 158 cm say the weight is 73 kgs now you want to measure the distance between these two now would you like to apply this formula for it you will have difficulty if you want to apply this formula then please note that you are going to have difficulty what is the difficulty is that if you want to apply this formula one is say i is equal to 1 you are going to have the difference between there is a centimeter value here there is a centimeter value 163 – 158 this is 2 you will get 2 square and here the next one is 3 square right say I measure height in centimeters weight in kgs my friend here he may want to measure height in millimeters then what is going to happen this will be 1600 this will be 1580 the difference will be 20 and this will be 20 square 400 then there is a difference I mean originally I got 2 square 2 square plus 3 square which is square root 13 whereas if I measure this in millimeters then it is going to be 400 plus 9 409 whereas the two human beings they are the same so the distance should not change just because I change it the units the distance should not change just because I change it the units always Euclidean distance is not useful are you understanding it always the Euclidean distance is not useful then how does one measure the distance let us see let me write this thing as d square xy so I am just removing the square root here I have this I will just write it as x1 – y1 x2 – y2 these are row vector I am writing 1001 and I am writing this one x1 – y1 x2 – y2 I am writing 1001 then x1 – y1 x2 – y2 do you think the product of this thing is actually summation i is equal to 1 to 2 xi – yi whole square multiplication by identity matrix does not change anything right so it is actually going to be summation xi – yi whole square now I said that just because I change the units the value should not change right so now what one so a slight generalization of the distance is instead of 1001 probably we write some weight here w1 and w2 so that here I write wi this depends on the unit the wi values will change if the units change so that the whole thing will remain the same so this is a small change in the definition of distance and you may also have something more instead of this we might have w11 say w12 this is for the present moment let me just take w12 this let me just take w22 this is a I mean much better generalization of the previous one in the previous one I just wrote the previous one I wrote this now I want this thing to give us distance by definition it should be greater than or equal to 0 now my question to you is for what values of w1 and w2 this thing will be greater than or equal to 0 now if w1 is strictly greater than or equal to 0 and w2 is strictly greater than or equal to 0 then whatever may be the values of this w1 and w2 if they are strictly greater than or equal to 0 then the whole thing will be greater than or equal to 0 because this whole thing is nothing but wi and xi – yi whole square and wi are greater than or equal to 0 so the whole sum is greater than or equal to 0 but then if I write a matrix like this note that I wrote basically a symmetric matrix here w11 this w12 I wrote the same thing here and this is w22 and this is a generalization of this okay then for which such matrices this thing will be greater than or equal to 0 I will give you one example in fact you are going to get many such examples let me just say this will be always in fact I will say it is strictly greater than 0 if x1 is not equal to y1 or x2 is not equal to you take any x1 y1 x2 y2 if one of these properties is satisfied then this is strictly greater than 0 you can check in your homes you can check in your home you take any x1 and y1 x2 and y2 one of these properties should hold either this holds or this holds if both of them hold then there is no problem now this is the now I am going to write the definition of positive definite matrix a n by n is said to be positive definite if this is an n by n matrix okay this is an n by n matrix if a prime a a is strictly greater than 0 for all a not is equal to 0 vector this is the 0 vector it has n rows and one column a matrix a is said to be positive definite if a prime a a a is n by 1 so a prime is 1 by n a is n by n then this is n by 1 so the whole thing is a scalar so I can write greater than 0 equal to 0 or less than 0 now this is matrix capital A is said to be positive definite if this is greater than 0 for all a not is equal to 0 vector and there are many such matrices okay and the variance covariance matrix is what is known as let us just see how that another definition a n by n is said to be here it is written positive definite this is positive semi definite in some books this is also written as non-negative definite it is also written as non-negative definite so matrix a is said to be positive semi definite or non-negative definite if a prime a a is greater than or equal to 0 for all a okay if this is greater than or equal to 0 for all a and this strictly greater than 0 usually in matrix algebra you would basically find this definition but you would not know or I mean generally we may not be knowing why this definition is necessary I mean a way of looking at this definition is from the point of view of the distances we want something like this because we wanted this distance to be greater than 0 right we want this thing to be greater than 0 we want this distance to be equal to 0 if this is equal to this and this is equal to this then we want the distance to be equal to 0 otherwise we want it to be strictly greater than 0 right, otherwise if x1 is equal to y1 and x2 is equal to y2, we want the distance to be equal to 0, otherwise we want the distance to be strictly greater than 0, if two quantities are same then there is no distance, then there is a difference then there is a distance, right. So that is what we want to incorporate in the definition which is that is why it is written like this, we want some such definition and such a matrix is known to be positive definite whereas if you include equality then this is non-negative definition. Now the variance covariance matrix is, it can be shown to be non-negative definition, the variance covariance matrix can be shown to be non-negative definition, in fact most of the times it is positive definition and in the case of normal distribution we assume that the variance covariance matrix is positive definite that is why we write it in the denominator determinant of sigma to the power half, if variance covariance matrix is positive definite then some properties follow automatically, what are the properties, probably you are aware that the determinant of the matrix is product of its eigenvalues, the determinant of a matrix is product of its eigenvalues. Now if it is variance covariance matrix then all the eigenvalues because it is non-negative definite they are to be strictly greater than or equal to 0 and if it is positive definite then every eigenvalue is strictly greater than 0, so that the product is also greater than 0, so that you can write the square root determinant sigma to the power half, so variance covariance matrix is positive definite implies every eigenvalue of the matrix is strictly greater than 0, okay. So this is one property that people have used extensively in the literature on pattern recognition, shall we stop here?