 Today we shall be discussing about probability density estimation you have already seen in this lectures that whenever we discussed classification classifiers many times we assume the form of the density function either it is normal density or some other density function we assume that the density function is so and so in reality how do you know what density function the dataset follows this is the problem which I am going to tackle today there are various approaches for finding densities given a dataset one of the approaches which we generally learn in second and third year of any statistics course is fitting a distribution given a dataset you assume that the dataset follows let us just say binomial distribution and then estimate the parameters of the binomial distribution and check whether it satisfies certain test like whether the fit is good or not a chi square test or some other such test and the same is true for whether you want to fit the Poisson distribution or a normal distribution or any other distribution so basically this approach deals with you assume a functional form of the distribution and estimate the parameters of the distribution but it is not necessarily true that every dataset follows a standard distribution so if a dataset follows something that is not one of the usual distribution then how do we actually get the distribution in those cases we cannot assume the functional form of the density function so the question is how does one estimate the functional form of the density function automatically from the dataset that is the basic question that needs to be answered as you can see it is a fundamental problem and it is a fundamental question even for statisticians so one of the first works on this was done by someone named Parjan in the year 1960 and the paper was published in AMS annals of mathematical statistics this is usually known as parjan density estimation this is usually known as parjan density estimation now let me just explain this thing to you suppose you have n observations these are drawn randomly from an unknown distribution these are drawn randomly from an unknown distribution or I should say independent and identically distributed these are the set of observations basically what you are the setup that one always likes to write is x1 x2 x3 dot dot dot xn are independent and identically distributed random variables independent and identically distributed random variables I have already explained to you the meaning of independent and identically distributed so I would not go I would not explain it once again these are independent and identically distributed random variables so they have the same density function f same density function f density what I mean is probability density what I mean is probability density they have the same density function probability density function f where f is unknown where f is unknown so this is the basic problem you have independent and identically distributed random variables they have since they have the same distribution that means they have the same density function f but then f is not known so another way of saying it is that you have a you have a data set which is drawn independent and identically which is drawn independently and naturally from the same distribution and the data set is represented by x1 x2 xn and if you have more than you have more points so this is the way usually it is formulated now the question is how does one find f the question is how does one find f now let us see I will I will do it in the following way let me first define we shall take a positive quantity H let H be let H be greater than 0 this is a positive quantity how to take H I will come to it later now for every x belonging to R you can define this function g of x g of y which is dependent on x this function is taking the value 1 by 2h if y lies between x-h to x plus h and if y does not belong to this interval it takes 0 it takes the value 0 now what is the meaning of this meaning is suppose this is x this is x-h and this is x plus h the height is this height is the same height and this is 1 by 2h this height is 1 by 2h now note that this length is 2h and the height is 1 by 2h. So the area under this curve is 1 this is to it this height is 1 by 2h so area under this curve is 1 so after we define g now we define estimate of the function estimate of the function f using n points and since I am calling it as an estimate let me put the sign hat estimate of the function at the value y is equal to because it is an estimate I am using the word hat I am using the symbol hat it is based on n random variables I am using the symbol n I am finding the value of the estimate at the point y so it is fn hat y this is 1 by small n summation i is equal to 1 to n I have already defined g xi of y i is equal to let me first explain some of the mathematical properties now I will explain intuitively what is happening here after that note that integral over gx is equal to 1 for all x right because in the interval x – 2h to x 2h value is 1 and other place I mean – infinite to – – infinite to x – 2h x – 2h to x 2h x 2h to infinity you can divide the whole range into those three parts and the middle portion the value the integral value is 1 and the other portion since the function is taking the value 0 is 0 so totally it is 1 and for all x this is true now what about the integration of this is the average of this each integral here is 1 so what is the meaning of that what is the prop what are the properties of densities a function f is said to be a density function when f of any y is greater than or equal to 0 and integral is 1 and the integral here is 1 okay it does not matter what exercise you are going to take what exercise you take the integral is 1 so this is actually a density function now now let me tell you what exactly is happening for that I will what I will do is say this is our data set G is actually window G is actually window and the integral is 1 is a special case I will be using the words windows and other things slightly later say this is our given data set how many points are here 1 2 3 4 5 6 7 8 9 10 points so n is equal to 10 they are 10 points here now we are supposed to estimate density function now where do you think the density would be high given these 10 points can you say given these 10 points where do you think the density may be really high somewhere here right and then the density is decreasing okay now suppose H suppose this is H this much length is say H okay suppose H is this much now so what is happening from this point say this is minimum this is say minimum minus H this point right anything below this anything less than this the density value will be 0 according to the estimate according to the estimate because X anything less than X-H G 0 and we are taking the minimum minimum minus H is this so the value of the density would be 0 estimate of the density would be 0 now this side this is your maximum and this is max plus H greater than this the estimate of the density would be 0 greater than this the estimate of the density would be 0 now let us see what is going to happen when you take a Y here suppose you take a Y here now what will happen with respect to this point you will get the value 1 by 2 H but then with respect to this you won't get any 1 by 2 H with respect to this you won't get any 1 by 2 H so for all these points let us say this height is 1 by 2 H 1 by 2 H x N this height is 1 by 2 H x N are you understanding me I will repeat it if you take any Y here then it is in the 2 H interval of this point but it is not in the 2 H interval of this point or this point or any other point so it is in the 2 H interval of this point so the value is 1 by 2 H and there is an N here so 1 by 2 H x N for all the points here okay now now let us look at this point let us say for this point that H this H is coming let us just say it is coming say somewhere here it is coming say somewhere say here then from this point onwards the height will go to 1 by 2 H n plus 1 by 2 H n it will be increased from this point onwards it will be increased okay and it will be say this is 2 by 2 H n this height is 2 by 2 H n okay maybe the influence of this point it will come up to this but then at the influence of this point in the interval suppose say it is starting say somewhere here then from this point onwards the height will be 1 by 3 H n because this point is influenced by this it is in that H interval of this this is in the H interval of this this is in the H interval of this so this will be 1 by 3 H n and somewhere when you come to this it will be more okay actually it should one should not write these lines one should write only things like this and it will go to somewhere here you are going to have some maximum maybe so this is basically a step function that you are seeing this is basically a step function that you are seeing and it basically goes up like this for some data sets maybe depending on the data sets the shapes will change this is f n hat y this is f n hat y that is right this is f n hat now somehow this f n hat y given the data set it is somehow whatever our feeling is there that where it the value it should be the density should be maximum it is actually giving that now the question is intuitively yes we are getting a step function we are not getting a continuous function it is actually discontinuous at many points as you can see now the question is does it have any theoretical properties now parzan what he did was he had shown that under some conditions this density estimate it will go to the actual density estimate actual density whatever may be the functional form under some conditions now let us see what those conditions are the first condition is regarding this H the first condition is regarding this H needs to be a function of n H needs to be a function of n so let me just write it as H as hn n is the number of points now as the number of points n goes to infinity hn should go towards 0 that is the first condition this is the first condition as n goes towards infinity hn goes towards 0 why the reason is this suppose hn is not going to 0 hn is going to let us just say some finite quantum some quantity greater than 0 then what happens is that suppose the original density function say its values are it takes values from say 0 to 1 okay f of x where x takes values in interval 0 to 1 now since this f is and this h is not going towards 0 it goes to say some quantity delta greater than 0 then this fn hat why it will be minus delta to 1 plus delta because always you are taking this interval you are taking an open set around each one of the points that is given to you are taking a neighborhood around it and the neighborhood size you are decreasing but you are not decreasing it to 0 you are decreasing it to a some quantity greater than 0 so whatever points lying in those interval in those intervals they are going to have positive value right then in that way it will not go to the original density so if you want to get original density hn should go towards 0 as n goes towards infinity there is a second condition suppose hn goes towards 0 very fast suppose hn goes towards 0 very fast what is the meaning of that it goes to 0 so fast that around these points we are considering discs or open sets ultimately it will go to only those points it does not include any other points then in that case you will only get at each point some height so that should not happen you want to include some points there some more points other than the given points so it should go to 0 it should go to 0 slowly so the second condition is it should go to 0 but it is going to 0 slowly that is n times hn sorry it goes to infinite infinity as n goes to infinity n times hn it goes to infinity as n goes to infinity so what is the meaning of this I mean there are many sequences actually which satisfy these two conditions I will tell you a few such sequences one sequence is let us look at this 1 by n to the power of 1 by 3 it goes to 0 as n goes to infinity but n times hn goes to infinity as n goes to 1 by log n you are computer scientist we like to okay look at log n 1 by log n goes to 0 but n times 1 by log n goes to infinity right so there are actually many sequences which satisfy these two conditions then these are two very small conditions I mean there is nothing really intuitively we understand how these conditions are coming now by using these conditions in fact you have used one another condition the other condition is this then he had shown that this density estimate is asymptotically unbiased and consistent what is the meaning of asymptotically unbiased asymptotically unbiased mean first you find the expectation of this expectation is a function of naturally n this expectation it goes to the actual expectation value I mean actual value of the function I will write it clearly I will write this statement clearly expected value of fn hat y it goes to f of y for every continuity point y of f and the squared error that goes to 0 as n goes to infinity okay the squared error that goes to 0 as n goes to infinity for that he I think he used this one note that if this is satisfied and surely this is also satisfied is it correct when this is satisfied this is satisfied because hn is going towards 0 but this is satisfied it does not mean that this is satisfied so if you take a sequence hn goes to 0 n times hn square goes to infinity then naturally n times hn also will go to towards infinity and he showed that this estimate is asymptotically unbiased that is this and consistent consistent means basically the error goes to 0 the error goes to 0 so this is what he had shown now I will come to the wordings used by Professor Das use the word window What exactly are we doing we are taking a window of length h around every point and in that window we are considering some value and otherwise we are taking 0 and that window actually we are moving on the data set we are taking a window of length h around each point and that actually we are moving over the entire data set so this is also known as parzans window technique for estimating density so you will also see parzans windows okay you will also see the word parzans windows in the literature now let us look at this function this is uniform distribution as you know it you might ask me a question suppose I do not take uniform I take some other distribution then what is going to happen yes you need not have to take uniform distribution you can take say for example a triangular distribution a triangle a symmetric one so this is h and this is h and the height will be 2h half times base into height that must be equal to 1 so this must be h wait wait wait half times base is 2h so height must be 1 by h that is 1 so this is h so you can take this then also you will get a similar result in fact he had shown it for a class of functions oh this is 1 by h yes sorry this is 1 by h in fact he had shown it for a class of functions and he called them as kernel functions one of the first places where you see the word kernel is by parzan so this is also called kernel density estimation this is also called kernel density estimation so he had given properties of kernels there you will see it in his paper you will see it in many other books Fukunaga's book many other pattern recognition book you will see the properties of the kernels so whichever kernel function you take and if you take it and satisfying these properties then all of them will give you that result that is asymptotically unbiased and consistent this is a very strong statement that is why parzan density estimation this was done in 1960 even now we are talking about it because he had used wide varieties of functions and the conditions are simple conditions conditions are simple conditions he had used here wide varieties of functions and for this all those wide varieties of functions he could prove the same result naturally normal is also included here so if you take normal distribution for g then what are you going to write then the h is going to be for the variance okay h will be for the variance so normal distribution with some me the mean is x and the variance is h that is what you are going to write here are you understanding me normal distribution with the mean x and the variance h that is what you will write here so now this has been proved for univariate case what is the meaning of univariate case note that we are assuming random variables not random vectors so instead of random variables suppose you have random vectors that is you have observations in let us just say two dimensional or three dimensional or in general any small m dimensional space and you would like to estimate density then this was generalized to multiple dimensions and the generalization was done by someone named a calculus the spelling could be slightly wrong calculus and this was again a paper in annals of mathematical statistics this is again a paper in mathematical annals of mathematical statistics either 1962 or 63 and the generalization is fairly simple generalization fairly simple generalization virtually writing the same things many times for example for two dimensions here instead of one h you might be having two h1 and h2 so if you are in the m dimensional space you might have h1 h2 hm m such values each one is dependent on n you have m such values each one is dependent on n so each of them you have to assume that it goes to 0 and all the corresponding conditions then actually if you look at in I will just write down this denominator portion in that the denominator portion for m dimensional case can you tell me what it would be I shall erase this portion so you have h1 h2 hm our random variables they take values in m dimensional space you have h1 h2 hm each one of them is again dependent on n so it is basically you have h1 and h2n and hmn okay and for each one of them if you have something like that then you have here an x okay and here it is y this will be 1 by 2 power m h1n h2n hmn if let me write why I belongs to xi-hin to xi-hin 0 otherwise okay what are these xi-hin yis what are these xi-hin yis your vector x is x1 to xm vector y is y1 to ym I is equal to 1 to up to m now look at this x instead of just a single x here you have a vector x instead of single y you have a vector y gx yy is the variable so this vector y is y1 to ym so look at the ith component yi this yi if it is belonging to xi-hin to xi-hin for all I then this is the one otherwise it is 0 so here also you are considering a window and what is this this is the volume right this is the volume this is volume of m dimensional rectangle where the sides are the first side is 2h1n the second side is 2h2n and the mth side is 2hmn this is the volume of m dimensional rectangle and if you write down actually the estimated density function with respect to this then at each point y what you are going to see is that how many of these capital X is are lying in that particular volume how many of these capital X is are lying in that particular volume just that number divided by small n into this volume that is what you are going to see I will repeat it in the denominator you will see a volume you will also see n and in the numerator the number of points that are lying in the rectangle the number of points that are lying in the rectangle around this point why it is basically some number of points this if you assume uniform distribution now let me just ask you a simple question when did you first come across the word density probably in class 8 or 9 and what did we learn about the meaning of the word density mass by volume right and here you see volume here you see volume instead of mass here you have number of points and this is out of the small n points you have this so this looks like I mean instead of mass actually if you replace that thing by this number of points it is basically like that word density that we have used and there is a volume here around why and just arm you get my point if you okay let me this is for the single dimension what does this thing mean you take why you take the H interval the number of points lying number of exercise lying in that interval you take why you take the H interval find the number of points if there is no point and the estimated density is 0 okay and this is generalized to higher dimensions and you have volume here and this is number of points in the rectangle of this volume number of points in the rectangle of this volume around why there I mean the number of point number of points lying number of exercise lying in this volume around this point why among these exercise how many points are lying in this volume around why that is it that is this number of points the basic expression it shows around X the expression so that if you take a point X and the interval is around that which is the why you see I want to do it for each each I okay okay forget about for all of I let us just look at only one yi belongs to xi-hin to xi-hin this means what xi-hin less than yi less than xi-hin right right you bring this hin on this side then xi will be less than yi-hin bring this hin here bring this hin here so either you write this or you write this same y is your variable at which you want to get your density y is the one at which you want to get your density these xi's are the things are given to us that small thing I may be the recording can be switched off later on this n-dimensional rectangle which you say can also be approximated by the n-dimensional sphere yes of course this is those kernel functions that is what they do you can use any kernel function is just one of them uniform distribution is one of them you can use it by space you can use it you can use any arbitrary shapes no problem you can use arbitrary shapes no problem so looking at this lofts cotton gave his own density estimate note that here we are actually fixing the volume we are counting the number of points and lofts garden in 1966 what he did was he fixed the points he varied the volume I will repeat it he fixed the number of points and he varied the volume so how do you fix the number of points say for a particular point why you want to find its estimated density so what you are going to do you find its kth nearest neighbor you find its kth nearest neighbor okay and so you find the distance between this y and the kth nearest neighbor distance between this y and the kth nearest neighbor now that distance use the distance as radius find the volume use the distance as radius and find the volume then his estimated density is actually n and then that particular volume with that radius and he has fixed the number of points k he has fixed the number of points k this particular density is known as this density estimation procedure is known as k nearest neighbor density estimation procedure lofts garden 1966 annals of mathematical statistics lofts garden 1966 annals of mathematical statistics now using this k nearest neighbor density estimation procedure you can derive k nearest neighbor decision rule you can derive the k nearest neighbor decision rule