 Welcome back to the lecture on Normal Distribution and Decision Boundaries, under the course title pattern recognition. In the last class, you would have heard about concepts of base decision rule, which forms the basic basis of the base classifier. And what are the three components in the expression for the base, when you compute the probability, posterior probability for a particular sample. The three terms, one of them is the class prior, class prior, evidence. Now, when you actually want to use the base criteria for performing a pattern classification task, you need to actually compute the probability densities for samples for a particular class. And for that, you need certain models. One of such commonly used models, not only in the field of pattern recognition, but other many scientific and engineering disciplines is the normal distribution. So, we will start with details on normal distributions and see later how using this normal distribution and base decision criteria, we come up with certain decision boundaries for the task of classification, for a particular task. So, let us look into the slide and find out what a normal distribution is. So, the normal or Gaussian distribution is a very commonly occurring function in the fields of probability theory, but there is also very wide applications in many other fields. Examples of course include pattern recognition and machine learning, which we are involved in this course, artificial neural networks, soft computing, digital signal signal processing, other fields of vibrations, graphics, any sorts of modeling which you need. Normal distribution is a very commonly used function to model certain distribution. So, we will first see the formula of the normal distribution and then see certain properties of the distribution and then we will proceed towards ways by which this distribution can be used for classification task, which involves certain distance measures and classification it is also called a bell function. The Gaussian function is also called a bell function or a bell curve okay and the formula of the normal distribution is given by this. It is a one dimensional function although this formula holds for multivariate random variables. So, we will see that later. So, for a random variable X or for a variable X you see two terms, the denominator term on the first here is basically a normalizing term okay and the expression is usually given by this. You can see two parameters in the distribution. One is the mu which is called the mean, mu is called the mean or expectation, some statisticians may say median or mode but we will henceforth use the word mean of the distribution. There is another parameter sigma which is actually more important in the normal distribution which is called the standard deviation okay and square of the standard deviation is actually called the variance sigma square okay. So, I repeat again sigma is the standard deviation and mu is the mean for that particular function okay. So, it is a product of an exponential function of this nature divided by a normalizing term which is here okay. Now we will see some examples of this bell function and the curve okay, some plots which will show you the importance of or the significance of these two parameters. Of course mean mu is very simple to interpret, it is like an average value okay. So, whatever be the average value of a set of random numbers which this is represented by X correct, the mean will represent that. How does sigma that is the standard deviation control the nature of the bell function or Gaussian function? Let us look at this example in the slide. The reference of this particular plot is available from Wikipedia which is given below. So, you can also have a look at this. But now observe there are actually 4 different curves, there are 4 different curves in 4 different colors and 3 of them have the same mean value 0. So, the mean value is 0 at this point is where the 3 curves in blue, cyan, blue, blue, green and this yellowish curve okay is what you get, grayish yellow. One of the curves has a nonzero mean which is minus 2 as you see here in the top right and that is given by the magenta curve. So, that is how you can see this, how the mean value mu helps you to position the curve. What is the effect of sigma? Well the values of sigma square are given 4 different values and for the first 3 curves when the mean is 0, you can see that the curve in yellowish gray curve has a value of variance which is the least. Point 2 is the variance, standard deviation will be root over that. What about the one in cyan color? The standard deviation of the variance is equal to 1. So, that is the normal curve which you also define that when sigma is equal to 1 and then what you also have is the one in blue where the variance is 5 which is the wider curve. You also have the nonzero mean Gaussian function when mean is equal to minus 2, the corresponding variance is 0.5 which you see here. So, if you compare these 3 curves having the mean at the center you can see that if the variance is large the function has a larger spread or width or larger set of nonzero values. Lesser the value of variance more peaky or sharp or lesser is the extent or span or width of this function. So, what we just learned now is if you increase the variance of the or standard deviation of the Gaussian function you will have a larger span. The bell function will have a larger span or width. If you have a small value of variance or standard deviation you will have a much sharper or peaky nature of almost. If you take a very small value of sigma that is the standard deviation you will tend to actually create an impulse function very narrow width that is for smaller values. For larger values larger width but lesser height. What is the normalizing term doing in the expression if you go back to the slide? Look at the normalizing function. The value of the function p of x of the Gaussian function is equal to this when this exponential value will be equal to 1. Only under one condition this value of the exponent will be equal to 1. When will it be the value of the exponent will be equal to 1? Look back into the expression and think and tell when will this term be equal to 1? When the value of x is equal to the value of mean. So, you can see from the curves back now that when the value of x touches the mean the value of the function p of x is also at its maximum and the maximum value is dictated by this left hand side term here. One by this denominator where you have a sigma that means for a larger value of sigma which has a larger width you will have a lesser height. For a smaller value of sigma when you have a lesser span you will have a very large peak like an impulse. What is actually happening is that the area under the curve is always same. The normalizing term ensures that the area under the curve is equal to 1. You can have a Gaussian function without the normalizing term. The nature will be the same but the peak value will always start at 1. Peak value will always start at 1. It will not have that normalizing factor of having the area under the curve equal to 1 and hence the peak changing with the variance. Only the width will change. We will keep on looking at a few more examples as we go along. Now this is a nice empirical rule which is sometimes casually called the 68, 95 and 99.7 or sometimes called 99.8. Some books will use 99.7 or 8 rule in which a normal density curve or sometimes also called the Gaussian function. We will interchangeably use henceforth in this lecture normal density or Gaussian function. It satisfies the following property which is often referred to as the empirical rule. What does that mean? There are 3 parts to this rule. It says that 68% of the observations that means you are making density observations which is creating the density. They fall within one standard deviation of the mean that is between mu minus sigma. So look at this curve of the part of the Gaussian which is colored in yellow from mean which is equal to 0 in this case to minus 1 and mu plus 1. These images also you will get in certain websites if you start looking at properties or empirical rules of Gaussian function. Let us look back again. So if you sum up the area of this yellow region from sigma minus 1 which is here to sigma plus 1, sigma minus 1 to sigma plus 1 the total area under the curve will be 68% of the total area. That means if the total area of the normalized Gaussian curve actually this is a Gaussian with sigma equal to 1 and how I will come to that in a moment with another property. This is a Gaussian function with sigma equal to 1. Did we have it in the previous slide? A sigma equal to 1? You look at this same curve. Sigma equal to 1. This is the one. The blue curve has been drawn here but we have shaded in yellow the area under the curve between sigma minus 1 to sigma plus 1 and it is 0.68. That means 68% of the total area which is equal to 1. What about something more? If we go to some sigma minus 2 sorry mu minus 2 sigma I repeat again. If you start from mu minus 2 sigma since sigma is equal to 1 we are starting from minus 2 here. You can see it is minus 2 to plus 2. You look at this range here. So you are starting from mu minus 2 because sigma equal to 1 to mu plus 2 you have 95% of the area covered between this range and if you still go one step further between sigma minus 3 to plus 3 it seems that you have almost covered the entire area under the curve. A negligible part is left beyond this it is just 0.3%. So 99.7% of the observation of this area falls between mean minus 3 sigma to mu plus 3 sigma. Whatever be the value of sigma this holds good. The curves which you have seen are for sigma equal to 1 or standard deviation equal to 1 only. But this is valid for sigma. Let us look back into the curves again I repeat. So mu plus minus sigma if you want to say it in a very casual simple manner then that area is about 68 or 70% mean plus minus 2 sigma is about 95% and almost close to 100% is mu plus minus 3 sigma. So that means if you are taking observations from mean minus 3 sigma to mu plus 3 sigma you are actually almost capturing all these samples or all the observations. Much less than 1% is outside that range and that is why this range of 6 sigma some books may take even 7 sigma. So this is called a 6 sigma 7 sigma rule empirical rule where you take the interval of plus minus 3 sigma or plus minus 3.5 also. Some books you will find that this actually contains almost all the energy information or observations which you are trying to model using this Gaussian function. This is another curve from another source from another document. The expression is given at the bottom and you can see the values 68, 95, 99.7 and 99.8 plus minus 3 sigma. You want to go a little bit beyond again mean is considered to be 0. You can look at the range plus minus 1 sigma this range plus minus 2 sigma and then minus 3 sigma to 3 sigma here. It covers the entire range. Can you guess why this value of the peak is marked as close to 0.4, 0.399 but you can safely approximate it to be 0.4. What could be the basis of that? It is not a difficult answer. You can think for a few moments and tell me based on last 5 minutes of whatever I have just discussed. I mean you can make a guess what is that value of 0.4 representing? Of course it is the peak of the Gaussian but what sort of calculations is giving you? That is the simple question. 1 by root 2 pi that is correct because sigma equal to 1. I did mention that this is the plot for standard deviation is equal to 1. So I leave it as an exercise for you to use your calculator to check the value of 1 by root 2 pi. You will get the value of 0.4. I mean very close 0.398 or 2.4 is what you will get. Good. So what we have just observed now? For a normal distribution almost all the values lie between or within 3 standard deviations of the mean. A couple of more statements to be made before we move on to properties of Gaussian distribution. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate. If the mean is equal to 0 which you see at the end and the standard deviation or variance is equal to 1, this distribution is called the standard normal distribution or unit normal distribution. This is just terminology. So somebody says what is unit normal distribution or standard normal distribution? It is a Gaussian function with sigma equal to 1 that is all. And you can use the expression given in the last few slides to compute the curve. You will see in animation now coming up in the slide in which actually it shows the Gaussian function with increasing values of points on the curve. This is not actual observations made for the Gaussian. We will come to that in a moment because this density function is basically a probability. See the number of points are plotted more because this is an important thing to observe although you will not appreciate it right now because it is possible that in many applications of pattern recognition and signal processing you may have a very few observations sometimes to make and calculate. And often it may be that statistically if the number of values which you are using for calculations either to compute a probability function or density function or maybe an histogram let us say the number of points which you use to make the observation and the number of discrete values of the sample which you get from the density function it is better if the number of points are more then you have much more smooth. You can see the value at n equal to 5 to look at the values just more n equal to 8 you see the Gaussian function is not showing up in the smooth way as it is expected to be okay. This is just a nice animation to show you that in most examples of probability and statistics we expect the values of the samples to be large in number more larger the better. Let us look at some very nice important some more properties one important property which we have seen now is that the effect of the standard deviation larger wider okay and lesser height smaller the value smaller the special extent more peak it is okay but these are nice interesting properties. So the mean is mu and the standard deviation sigma is positive if you go back into the expression mathematically it is possible to actually have a negative value of sigma mathematically possible it does not alter the value of the expression within the exponential because you have a square term but what will happen is if sigma is negative the probability density becomes negative. So it is basically a meaningless effort to choose a value of standard deviation which is negative okay deviation from a certain value has to be positive and density functions are usually positive and we will soon go ahead and use them as distance functions. So they have to be all positive good so do not worry about negative. So we are anyway assuming positive sigma first of all it is symmetric around me we have seen that from the curves that the function is symmetric it is unimodal it is first derivative is positive when the value of X is less than the mean and it is negative when it is more than the mean and exactly 0 only at the mean. So the first derivative has a nice property connected to the second derivative is a very interesting property which says that the Gaussian function has two inflection points basically it is a second order function located one standard deviation away from the mean. So at mu minus sigma and plus sigma you will get two inflection points maybe in some future scope of analysis I will show you the derivatives of the Gaussian but I leave this as an exercise for you to actually check the derivatives of the Gaussian function first second and even higher orders but at least check the first and second order and it has a good deal of application in many analysis of pattern recognition as well as image processing as well as image processing people detect edges with the help of derivatives of the Gaussian function which basically become edge operators. So the property which we only understand here is now that the first derivative is positive for negative values of X and negative for positive values of X 0 only at the mean and it has two inflection points at these values it is log concave a function satisfies certain property to be a concave function I am not getting into those details but you can find that out from certain concepts of algebra set till concave functions find that out if the log of that function is also concave it is called a log concave function. So this Gaussian function has that special property we may not use all of these properties okay the last and the most important part why it is also popular is that it is infinitely differentiable look at the last sentence infinitely differentiable and indeed super smooth of order 2 okay infinitely differentiable that means we are only talking about first and second derivatives but I did tell you that you can take even higher derivatives of the Gaussian as high as you can think and this gives scientists lot of maneuvering capability with the analytics with the help of Gaussian function if it is if you are able to model it that is the main reason of its popularity number one a function satisfies certain criteria to be smooth again like the log concave or the concavity property of a function when is the function concave when it is smooth there are certain inequalities or criteria to define that and there is a super smooth property of a curve of certain order beta in this case the order beta is considered to be 2 the Gaussian function satisfies certain property curve is expected to be smooth it is better but in this case the last two are important properties but there probably would not be used in a great extent in our theories I think it is infinitely differentiable is one property sometimes higher degree different derivatives of the Gaussian function are used but the first and second derivative are very very important yeah this is the expression of its first derivative and we have taken mean to be 0 and sigma equal to 1 I leave it as an exercise for you to do the same when there is a nonzero mean that means derive the expression of the derivative first and second derivative of the Gaussian function when the mean is nonzero and value of the standard efficient is not equal to 1 do that and you should be able to write it in terms of the Gaussian function like we have done here that should be also possible if you look back so in this case mean is 0 and sigma is equal to 1 but if it is not those parameters will come in the expression look at the second derivative of the function you can actually check yourself from this see px is always positive look at the second derivative of the expression here px is always positive we know that from the function this component will be equal to 0 at only 2 values of x what are those 2 values I repeat if you look back into the expression this component will be equal to 0 at only 2 possible values x equals plus and minus sigma to be very precise because in this case sigma equal to 1 what are those let us go back to the previous slide remember this point we are just talking about inflection points this is where the second derivative is 0 and it changes sign it is sometimes called a 0 crossing point also and it occurs at plus and minus sigma here since sigma is equal to 1 the value so you can almost blindly close your eyes and replace this by x square minus sigma square is not it more generally the nth derivative is given by this particular where h of n is given as the Hermite polynomial of order n this is a common function used in interpolation in the field of computer graphics okay I am not getting into details but the Hermite polynomial of order n is a very common function used in many branches of science and engineering specifically curve fitting in the field of interpolation graphics interpolation things like that so carrying on the discussion on Gaussian or normal density function if you look back into the screen now the expression of a one dimensional Gaussian function or normal density function is given let us look at the density function in two dimension or what is called as the bivariate normal density function so you have two variables now instead of only x you have x and y instead of just one standard deviation you have sigma x and sigma y that means standard deviation along x and y respectively you have the corresponding means as well mu x mean along x direction or the x component mu y is for the y component in addition you will also have a correlation coefficient so mu stands for mean with the corresponding subscripts indicating the direction or component standard deviation sigma stands for standard deviation with its corresponding components and you have the correlation coefficient rho x comma y which is the correlation coefficient between two variables x and y okay there is a relationship between the correlation coefficient and the corresponding standard deviation and the joint covariance term between the two variables x and y we will have a look at that now the covariance of x and y is defined as the expectation of the two variables subtract with their corresponding mean subtracted that means x minus mu x represent the values of x centered around 0 the same with respect to y the covariance indicates how much an x and y vary together we will see that with the with some examples after some time and the value of this covariance term depends on how much each variable tends to deviate from its mean as well as also how it depends on the degree of association between x and y so a larger relationship between x and y of course a mathematical relationship that means what type of relationship you may have well let us say if the value of x is rising will the value of y also rise or will it fall down or will it remain constant uniform does not change there may be many different possibilities and if it rises or falls does it do that steeply or gradually so these three different conditions along with their rate of change forms the value measuring the degree of association which is actually called the covariance term and it has a very strong relationship with the correlation coefficient which we saw inside the bivariate normal density function look back into the slide so the covariance indicates how much x and y vary together if you remember this it is quite sufficient for the time being it also depends on the degree of association between x and y the correlation coefficient between x and y as a function of the covariance covariance term sigma x y is given by this expression and in such a case you rewrite the covariance term using this so it is something like a normalized change you have the sigma x and sigma y is coming with the denominator and the correlation coefficient is actually a scalar quantity the value always lies between minus and plus 1 indicating the degree of relationship variance or association between x and y this is an important formula which you remember that means you can actually rewrite covariance as correlation coefficient multiplied by the individual variances or standard deviation standard deviation to be very precise I repeat again covariance of x and y is correlation coefficient multiplied by the individual standard deviations of x and y okay this is another way by which you can write some books you will find that you will write the expression of correlation coefficient using expectations of x and y separately or using joint expression in this particular form you can use any one of these expression the this correlation coefficient is quite important because we would like to know how two variables depend on each other in our case in the field of pattern recognition we must remember that each of these variables are feature dimensions okay so it is something like how one feature is related to the other one okay are they jointly related are they heavily correlated or not okay the correlation coefficient and the covariance terms will be actually holding that information so you may need to estimate these parameters of correlation coefficient or covariance from the data itself and that is why we are looking at this particular formula so again looking back to the slide what does correlation coefficient tell us it is basically the cosine of the angle between the two vectors of course in the dimensional space but you can take d to be 2 when you are talking about only x and y of the samples drawn from two random variables so if you have two random variables x and y we are talking of just two vectors in two dimensional space and of course the data must be always be normalized or it is centered as it is called it is shifted by the sample means so as to have an average of 0 this must be done for all analysis of classification and pattern recognition tasks this figure is a geometrical illustration of correlation coefficient values in 2d there are three rows of set of points or instances or samples drawn from a particular data and depending on the geometrical arrangement which you see the corresponding correlation coefficient which you have are given next to the figure look at the last row as a typical example the correlation coefficient is 0 what basically means that in some sense there is no relationship that means that there is no relationship between the two independent variables x and y x and y are independent in some sense you can say in the all the other cases you have a value of 1 here which basically means that there is a strong correlation between x and y no correlation again and a negative value of correlation that means when x is increasing it seems y is decreasing you look at from the center if x is increasing in this direction the y is decreasing this should give you an idea why you have a negative value of correlation the same thing here the value is not equal to minus 1 but still you can see that when x is increasing the y is decreasing and vice versa if x is decreasing y is increasing so there is a negative correlation here there is a positive because both are either together increasing or together decreasing the first gives you an idea why you have positive values of correlation coefficient here on the left hand side and negative values here so this gives you an idea about how the correlation coefficient may look at the value here which is 0 there are other examples of density functions there are other examples of density functions which exist in the field of probability theory and statistics but I talked well beforehand as to why Gaussian is popular in the field of pattern recognition neural network signal processing other fields of mathematics and engineering as well a couple of important properties is this nice smoothness of the function is derivative existing at up to almost an infinite order and so on and so forth which may not and its symmetricity many of the nice properties which may not exist for other density functions they exist they can be used but you may not have the nice advantage of having mathematical manipulations or expressions done using non Gaussian distribution functions we will have a look at them a few of them some of them are popular but not that much to the extent as what most mathematicians scientist use in the field of signal image pattern recognition so and so forth the Gaussian function well this is a Poisson distribution its almost one sided almost one sided and there is just one parameter lambda this is a binomial distribution it has two parameters n one is the value of n the other is the value of k then you have a Cauchy distribution where you have like similar to the Gaussian it is almost similar to the Gaussian except there is no exponential here but you can see the nature and look at the mean here the first three are curves are for mean 0 and the last one is for the value minus 2 which is this curve in magenta okay and you have I think almost similar values is what we have used in our Gaussian function when we took the example so the Cauchy is very very close to the Gaussian but its expression is different the Laplace function which is a very peak in nature that means let us go back if you do not want this smoothness at the top for a particular density function of course you could ask me a question now when do we need this function when do you need such a property well let me tell you in spite of the Gaussian function being of the most popular and the most commonly used once it is possible that real data sometimes may not actually follow the Gaussian distribution in spite of the expectations of all theorists scientists and engineers specifically they are even used to model noise but unfortunately noise does not always follow the Gaussian distribution any real life data you want to do forecasting for the weather stock market elections temperature okay seaside current whatever the case may be you can use all sorts of model but real data usually does not follow always a very nice distribution and sometimes quite far away from the Gaussian distribution so it is good to have some other functions available at our disposal if we can use them and one such case which we are discussing now is that if you do not want the smoothness at the peak of the function let us say you want to peak in nature like this you can use a Laplace function which has an exponential without a square term as you can see this is similar to the Gaussian but first of all it has a parameter here but there is no square term which gives it this peakiness here okay this is a double density expression which is expression wise similar without the square term again and I would encourage you to read other concepts if you have not gone through central limit theorem uniform distribution geometric distribution and so on and so forth. So after we have studied a few examples of the properties of the Gaussian function in 1D and 2D let us now look into the case when a Gaussian distribution is used to model a distribution in very high dimension what about the case when you have data in very high dimension you remember the lectures when we introduce the concepts of pattern recognition clustering classification that we extract lot of features from the data the number of features which we extract from a particular signal could be a few tens to a few hundred to a thousand in certain cases. So you may need to compute density distribution for features which are very large in dimension not only 1 and 2 so we need to have an expression now remember we had expressions of the Gaussian distribution in 1D and 2D I leave it is an exercise for you to write the expression of the Gaussian distribution in 3 dimension we have P of x you had P of x y I leave it as an exercise for you to write P x y z one simple extension will be instead of mu you had mu x mu y so you will have mu x mu y or mu z and mu z or you can write it is mu 1 mu 2 mu 3 3 dimensions some books will follow 1 2 3 because you can write using these indices rather than the x y z you will have the individual standard deviations or variances what are they you had sigma then you had sigma x and y you will now have sigma x sigma y sigma z or sigma 1 sigma 2 sigma 3 you did not have a correlation coefficient in 1D you had one correlation coefficient in 2D in 3D how many do you expect my I repeat again try to extrapolate the idea when you are doing in 1D you had did not have any correlation because you do not have you cannot correlate if you just have one dimension data you have to correlate with something else correct. So when you have two dimension think about two dimensions x and y or direction 1 and 2 you had a sigma x sorry a row x y at a row 1 2 if you think these are the two directions now you have three dimensions x y and z you should be able to tell me how many correlations I can establish you have to take pair wise combinations that is what we did in 2D there are only one option available okay so three of them now the question comes is there were three of those three of those individual variances then three of those means expressions will be little bit complicated and it will get more and more complicated if you go to four or higher dimension so it is better to have one expression which can handle in generally a very large dimension D and then see if it can be generalized to 1D 2D and 3D let me tell you it will not be easy to write the expression in 3D from 2D although we have seen what are the extra terms and parameters required in three dimensional data for normal distribution correct okay but it would not be easy to extend that logic because you have to fit those correspondingly you can attempt to do that and the attempt will be more and more difficult for higher dimension so let us have a very closed form nice compact expression in D dimension and see if you can write the two and the three as well which I leave it as an exercise for you though we are looking at now if you look back into the slide you will have a multivariate case data that means in dimension you have random samples taken along each dimension each dimension could be a feature we know what are the features we have talked lot about them in our earlier classes and for D dimension you have a D dimensional vector all the mean vector you see the notation used to show the that it is a vector okay you can actually put the marker here also to indicate that this is also a vector this is also a vector this is also a vector this is also a vector these are individual scalar means along the corresponding directions good so no problem in 1D you will have just mu 1 in 2D you will have these two mu 1 mu 2 or mu x mu y which we saw three dimension you will have mu x mu y mu z or mu 1 mu 2 mu 3 now remains the most important part the variances and the covariances put together it was nicely nicely put we did not have to worry about this in 1D because you just had one mean term and one variance 2D we wrote an expression okay 3D onwards for larger dimensions it becomes a little complicated but there is a closed form expression and to do that we introduce a matrix called the covariance matrix which will have all the terms which you are talking about it is usually denoted by this symbol sigma but in some books you will find S as a symbol also okay look back into the expression first of all you will see that this is a symmetric both these are matrices are same you can either right using the left or right does not matter this is a symmetric matrix as you can see here the individual variances are along the diagonal you can write it in this term as you like or in this whatever way you feel basically it is a product of two individual standard deviations giving you the corresponding variances and you have a set of off diagonal terms which are duplicated because it is symmetric okay so if this is a D cross D matrix can you tell me how many off diagonal terms will be there not difficult if it is a D cross D matrix how many off diagonal terms will you get totally totally let us start with the total off diagonal terms should that is very simple it should be louder you got it square minus D square minus D are the number of off diagonal terms divided by 2 because it is symmetric matrix so it will be D square minus D by 2 D square minus D overall divided by 2 if D is equal to 1 1 dimension case that value will be 0 correct we do not have any term there okay let us try to correlate D equal to 2 if D equal to 2 D square minus D by 2 how much it will be that is just one of the term we had that single rho xy just one term so D equal to 3 3 we talked just sometime back 3 3 of those correlation you can see that this is a generic form of a covariance matrix which can handle all these cases for dimension d equal to 1 2 2 a very large dimension okay and this is the form which you must remember so it is a symmetric matrix and the diagonal terms have a certain significance the off diagonal terms have a certain other significance the diagonal terms contain the individual standard deviation or variances along the corresponding directions 1 2 3 and so forth along the directions and the off diagonal terms are simply having as many possible correlations between corresponding any pairs of between pairs of corresponding so that means ijth term in that matrix will be giving you the correlation between or the covariance to be very specific covariance between the ith and jth directions or dimensions and from that you can relate it to the correlation coefficient rho ij or sigma ij whatever you want the i and j values are given here as you can see this is a d cross d matrix here so using that covariance matrix this is the d dimensional normal density that function is given by this expression I was telling you that this is the covariance terms and look you actually are using the inverse of the covariance matrix is given here inverse of the co matrix x and mu are defined in the previous slide we will go back and have a look at it so that is the data that is the mean vector and we already talked about x minus mu that means center shifting it making the mean 0 this is what this will do and of course the normalizing factor here you can either use this expression in the form of a matrix or you can write it in terms of this as well which is summation of certain terms consisting of scalar elements sij of the matrix which is the ijth component of the inverse of the covariance matrix inverse of the covariance matrix so the covariance matrix inverse has to be taken and then it has to be used in this expression to compute this is the d dimensional normal density function and I would request you to almost attempt in due course of time in the next few lectures slides to almost memorize this by heart as much as possible because it will be used thoroughly in our analysis in many places for clustering classification distance measures immediately will propose the distance measure now and we will talk about more of that later in the next few classes as well okay where you just need to remember that there is a normalizing term root over determinant of this symbol indicates that is the determinant of this covariance term the dimension d is also here the 2 pi and x minus mu sigma inverse x the transpose here which you must not forget okay so this which or you can actually remember this but this is the best thing to follow because henceforth we will keep on concentrating on different properties of this covariance matrix or its inverse in various forms and see the net result of classification job the covariance matrix almost says about what the classification task you are trying to solve okay you might have heard somewhere in our earlier discussions and going to hear a lot more in the future about linear decision boundaries and nonlinear decision boundaries class separability distance between clusters we talked at the very beginning using an animation all those will get reflected in that covariance matrix it will hold all the information of the data except the class means except the class mean the class mean information is now here in these 2 terms but this covariance matrix will hold all the information about the way the data varies along each every direction and relationship between the 2 directions so please remember this expression here as given by the it is again repeating them in the next slide as you can see here and a special case when d equal to 2 because this is where we had talked about the bivariate normal density where the capital X is given here becomes a 2 dimensional vector x comma y the corresponding mean vector is also mu x mu y and the covariance matrix will be a 2 cross 2 because d is equal to 2 and as promised earlier you have the variance along the diagonal of diagonal terms having the what are these covariance terms or you can add in terms of the correlation coefficient also so if the dimension d is equal to 2 you can see easily that this d is equal to 2 and the root of all will cancel out you can simplify this expression you can come with the determinant of this matrix also it is a 2 cross 2 very easy to compute and what will happen here is the inverse of this matrix will come in ct which is also not difficult to compute 2 cross 2 matrix easy to compute and I leave it as an exercise to see that using the sigma substituting it here you get back this expression which we had a few slides back what was this expression the bivariate normal density with the correlation coefficient and the corresponding two different means I did tell you that this square and the root over 2 pi you will be available here for you the determinant of this should be sitting here I leave this as an exercise fit through that and the rest of it take the inverse substitute it here this is what you will get actually truly speaking there is a normalizing term here also which will appear out of the determinant of this matrix because inverse will also have a factor of the determinant in the value I leave this analytical derivation to you as a home task because this will help you to get used to this expression help you to understand and also memorize as much as possible the task of this expression of the normal d dimensional distribution for a Gaussian this is a picture which shows a wire frame diagram of a two dimensional Gaussian function it is a very standard method of representing surface in the field of computer graphics so this is basically called a wire frame or a mesh diagram of such a surface in 2d and what you see here are with that corresponding Gaussian function you are seeing at the bottom what I called as contours of lines which are at a certain distance as defined by the density function for the d equal to 2k that means we are talking about two dimensional case the Gaussian distribution function or the Gaussian density function and these circles are intersections of this Gaussian surface with a horizontal plane of the value given by that corresponding distance that means if I take this blue circle at the outside which appears as an ellipse because of the projection basically it is a circle on that surface all the points in this surface are at equal distance it is like a circle obviously it has to be equal distance from the center which is given by the mu and that distance is some value which is defined by the covariance term and the density distribution function okay so closer the circle is to the mean you are at a lesser distance or at a higher density you can think that if you have a Gaussian surface and find an intersection for a planar horizontal plane with that corresponding surface you will get a contour those contours are shown below the surface at they are radially concentric circles with as you get out of the mu away from the mu you are talking about larger and larger values of distance from the mean given by this particular expression now you see what I have done is just switch the logic this expression is the same as the expression which we got in the previous and let us go back this expression what I have taken out is this particular term which is within the exponential and it can be easily taken out using a simple mathematical operation like as for example if you take a log of this expression let us say which will in fact do after sometime in the next class you take a log of this expression you will get this expression now which you can actually is the key which because it convince it actually contains the covariance matrix of a dimension D its inverse and this is within the exponential term which you can I have now taken that term and say it is a distance D of a point x I repeat again it is a distance D of a point x from the mean it is given you can suppress this term and use the rest of these two terms and you will also get a distance value which is typically what you will get is the I repeat again if you suppress the covariance term and take only these two and compute the distance you have the simple expression of distance measure which is called Euclidean distance very simple and that is a special case when the covariance matrix is an identity matrix that is a special case the covariance the individual variances are equal to 1 and the off diagonal terms are 0 all the correlation coefficients all the covariance are all 0 the diagonal terms which are sitting in the covariance matrix remember that expression which I asked you to almost memorize as well as this term it is an identity matrix and now this is what you have so now seamlessly we have moved from a distance function sorry I repeat again we have moved from a distribution to a distance the relationship is very close the covariance term is there in both the distance from the mean is there in both only the normalizing term and the exponent is taken out in one which was there in the expression of the Gaussian or normal distribution or density function in this case the distance we just take that term within the exponent because it seems to give an expression for a distance of a point that means what is the distance of this point to this point well you can measure it in two dimension x comma y then mu 1 mu 2 or mu x mu y and compute that is a Euclidean distance but if you want to take distributions of point into account then you must use the expression of which contains the covariance matrix as given in the slide this expression which you take will actually give you the correct distance incorporating the density distribution and the distribution of points within that and this sort of a distance measure is actually called in the field of statistics and estimation theory as well as pattern recognition the Mahalanubis distance determined by the covariance matrix and these lines are usually quadratic functions well in this case they appear to be elliptical or circular but they could be any other quadratic functions so we have just introduced the concept of a distance from the normal distribution and we will now see how this distance plays a very important role in the job of classification where we will now bring in concepts of what where did you have probability distributions for the classification task so far we have discussed one algorithm earlier by Professor Murthy that was the Bayes rule for classification it had probability functions probability density functions class priors class conditional distribution functions if some of them or one of them is a Gaussian function if we put that constraint now then we can derive distances out of those to put inside the Bayes decision rule so Bayes decision rule will now become a criteria based on distances instead of comparing probability we will compare distances so you can see that there is a nice correlation between the two probability and distances because the expression of the Gaussian function allows you to do such operations and you have already done so taken out the key term if you look back this is the expression we are talking about which contains the covenants matrix it was in the probability distribution function and is now taken as a distance and we will see using this distance how it can be incorporated in the classification function it will give us distance it will give us decision rules and it will give what are called decision boundaries and decision regions based on the certain discriminant functions you may get linear boundaries sometimes you may get nonlinear the next set of discussions will follow this and we will go towards linear and nonlinear decision boundaries as well as discriminant analysis which will slowly lead us to an LDA linear discriminant analysis for classification we will stop here.