 Good to see you guys. Let's get started. So today I'm going to spend a bit of time teaching you guys about how to do classification by modeling the posterior probabilities. So the idea is that we have a number of classes that we wish to identify our data into, build a model of those classes. And the way we described it is that we have a likelihood function. We have a prior probability. We form a posterior probability. And today what I'm going to show you is that we can write the posterior probability, the ratio of class 1 to class 2, class 2 to class 3. The ratio of the posterior probabilities, the log of those can be written as a formula that depends on the input x. And if the variances of the classes are equal, then the posterior probability ratios is going to be a linear function of x. If the variances are unequal, it's going to be a quadratic function of x. This is assuming that the distributions are Gaussian. And then at the end of the class, I'll show you what to do when your distributions aren't Gaussian. So for example, you have some collection of data and you have some arbitrary distribution. And you want to know how do I build a model of the posterior probability for these. And this is called kernel density based estimate. Basically the idea being that we're going to put a little kernel at every data point that we have and the sum of these kernels forms a density function. And then using that, we're going to be able to classify our data. In fact, the homework for today is classification using kernel density estimates. So to start off, let's go over what we have as our data. We have some vectors x and we have some labels y. And we have n of these. And these y's belong to some classes, maybe 0, 1, you know, L. L number of classes. And our objective is to build a model that can allow us to dissociate between the various classes. And to give you an idea, suppose that we have a two dimensional problem and we have, you know, x1 and x2. And we have some x1's and we have some x2's. And when we build a classifier, what it's going to do is to separate these two into regions. And so for example, if we have a system by which the likelihood functions are Gaussian with equal variance. If this comes from the same variance as this, then the classification is going to separate them via a line. And today we're going to see what the equation for that line is. If on the other hand we have a scenario that looks like this, suppose that we have another class. Suppose we have a blue class as well. So suppose I also have something like this. In which case now I'm going to have another line that separates this. And I'm going to end up with regions that I'm going to call blue. And I'm going to have a region that I'm going to call red and a region that I'm going to call black. And these lines that I drew, I drew two lines here. This line, this line 1, line 2, there's going to be another line as well. Let's call this line 3. What these lines are, are the ratios between the log of the probability of x given that y is equal to blue times the prior probability of y is equal to blue divided by p of x given that y is equal to red times the prior probability of y is equal to red. So that's one of the lines. Then I'm going to have another line, p of x given y is equal to blue, probability of y is equal to blue divided by p of x given that y is equal to black times probability of y is equal to black. And then finally the third line is going to be the probability of x given that y is equal to red and prior probability of y is equal to red divided by, so I have red and blue here. What I need is p of x given y is equal to black times the probability of y is equal to black. So the three lines are going to be these three ratios. And I'm going to show you that these ratios are indeed linear functions of x. So this is the case if I'm assuming that the probability of x giving that y is equal to some class i is a normal distributed with mu 1 and variance sigma and probability of x given that y is equal to j is normally distributed with mu j and the same sigma. So equal variance. If this is the case, if my likelihood function looks like this, then my class identifiers are going to be separating the classes based on these lines. And I'm going to show you why that's true. So in principle, if you wanted to classify what you'd do is that you'd say, all right, compute the probability of x given that I belong to a blue class, the likelihood times the prior probability for each x here I want to compute this ratio. And if it turns out that for a point out here that the log of this function is positive and positive here, so if it's positive for this ratio and positive for this ratio, then that means that the likelihood of it being blue, sorry, down here is what I meant to say. If a point down here has a likelihood that has a ratio that's positive for this one and positive for this one, then that means that this most likely belongs to blue. So for each point you compute the ratio of the likelihoods for that class to every other class. And if it turns out that for that class it is higher than every other class, that means its posterior probability is higher. So let me show you why these are going to be lines. So what we're going today is that we're going to start with the simplest case where the variances are equal and the classifier is going to be linear. Then we do a condition where the variances are unequal. And in that case, we're going to see that the classifier is going to be quadratic. So it's going to have shapes that are going to be not lines but quadratic functions. And finally we'll do a classifier where it's not Gaussian at all. It's some other function. And there you're going to see that it just becomes a very complicated thing that depends on these kernels. Any questions? Okay, so why is it that the ratio of the log is linear when the likelihoods are Gaussian with equal variance? Well, let's look at what this is. What's the log of p of x given that y is equal to class 1? So if we begin with a vector representation, so that we have this normal distribution is a vector, what I have is 1 over the square root of 2 pi raised to the power of m where m is what? m is the size of x. Where sigma is the variance times the exponential minus 1 half x minus mu 1 transpose sigma minus 1 x minus mu 1. So that's the normal distribution associated with the vector x of size m. So what's the log of this function? That's equal to minus 1 half. We have this term here, which is going to be minus m over 2 log of 2 pi, minus 1 half log of the absolute value, sorry, log of the determinant of the variance-covariance matrix. So that's equal to minus 1 half x transpose sigma 1 x plus mu 1 transpose sigma minus 1 mu 1 minus 2 mu 1 transpose sigma minus 1 x minus m over 2 log of 2 pi minus 1 half log of the determinant of our variance-covariance matrix. So similarly, when I want to write the distribution associated with some other class, say class 2, log of the probability of x given that y is equal to 2 is going to be equal to exactly that, except that the mean is going to be now mu 2 rather than mu 1. Okay, so I have the likelihood functions. Now, what about the prior probabilities? So the prior probability of y is equal to 1 is going to be something called, let's call it q1, and the prior probability of y is equal to 2, let's call that q2. Okay, so then the log, what's the posterior probability? Probability of being, having class 1 given x, that's equal to probability of x, given that I have class 1 times the probability of being in class 1 divided by p of x. So this is the posterior probability of being in this class. Well, when I want to find the ratio of the posterior probabilities, the log of that, log of the probability of y is equal to 1 given x, divided by probability of y is equal to 2 given x, that's going to be equal to the log of the numerators, right? p of x given y is equal to 1 times probability of y is equal to 1, divided by p of x given that y is equal to 2 times the probability of y is equal to 2. So the ratio of the posterior probabilities, their log is equal to the log of the ratio of the likelihoods times the priors, because the denominator or p of x, the marginal, is going to cancel out. So let's see what we have here. So we have the log of the likelihood functions that I wrote here, and I also can tell you the log of the prior probabilities. So this is going to be equal to the log of the numerator, which is log of p of x given that y is equal to 1, plus the log of probability of y is equal to 1, minus the log of p of x given that y is equal to 2, minus the log of the prior probability, right? Okay, so this is just log of q1, this is log of q2. So what I really need is this log minus this log. So I have written that these logarithms appear for you. Here's the log of the likelihood for being in class 1, here's the log of the likelihood of being in class 2, and you notice that when I subtract them, let's see what we get. So the log of p of x given that y is equal to 1, times the probability of y is equal to 1, divided by x, y is equal to 2, times the prior probability of y is equal to 2. It's going to be equal to minus 1 half. I'm going to have x transpose sigma minus 1, cancel x transpose sigma minus 1. I'm going to get y1 transpose sigma x minus y2 transpose sigma x. Let's see if that's right. Yeah, looks good. So that's equal to, you notice that x here is a linear function, I have minus 1 half mu1 transpose sigma1 minus mu2 transpose, the inverse of the covariance matrix, plus this is a constant and this is a constant. So I'm going to call that w0, which is equal to some vector w transpose times x plus w0. Yeah, sorry. Yeah, so there should be a 2, let me see, there should be a 2 here. Yeah, thank you. Okay, so what this says is that the ratio of the posterior probabilities is a linear function of x, which means that the decision boundary where what's happening is that the log of the ratio is equal to what? At the decision boundary, what's the log of the ratio? What is it? Zero. Zero, right, because the top and bottom are equal, right? So log of 1 is zero. So one can compute the log of the ratio if one knows the mean of each of the distributions and the variance. And if the variances are equal, then it turns out that the ratios are going to be such that it can be represented as a linear function of x. So what one does is that if one assumes that the two distributions or whatever, however number of distributions there are, if they have the same variance, then this quantity, so say that I want to know the log of the ratio between blue and red. Well, what I need to know is the likelihood for blue and the likelihood for red. And any x, for any input x, I can compute the log of the posterior ratios. So then what I need to know is, okay, which class am I going to say this data point belongs to, the class for which that log is larger, positive. I compare class 1 with class 2, class 1 with class 3. If the ratio is greater for 1, it belongs to 1. All right, so now let's do another example of this. Let's do an example where, in one dimension, what this means, before I go to two dimensions, sorry, before I go to unequal variance, what this means in one dimension is that if I have some likelihood function that has been multiplied, so this is probability of x given that y is equal to 1 times the prior probability of y is equal to 1, and I have another p of x given that y is equal to 2 times the probability of y is equal to 2, and the variances here are the same. So the variance of this is the same. The reason why this is lower is because this prior probability is lower. So the decision boundary, of course, is going to be here, and if I plot the log of p of x given that y is equal to 1 times the probability of y is equal to 1 divided by this. So if I plot the log of the ratio of the posterior probabilities, what I get is a line that crosses at the decision point. That's when it's equal to 0. On the other hand, if I have something that looks like this, if I have one distribution that looks like this, p of x given that y is equal to 1 times the prior probability of y is equal to 1, and I have another distribution that looks like this. This is p of x given that y is equal to 2 times the prior probability of y is equal to 2. So clearly now, the variances are different. So the variances of 2 is much larger than variance of class 1. So what are the decision boundaries here? Now if I were to, so my decision boundary is here, obviously here, when I'm out here, I'm going to say this belongs to class 2, belongs to class 1, belongs to class 2, right? Okay, but what does the ratio of the posterior probabilities look like? So the log of p of x given that y is equal to 1 times the probability of y is equal to 1. So I want to plot this. What do you think this looks like? Let's begin out here on this side. Is it going to be positive or negative? It's going to be negative. Why? Yeah, 2 is bigger, right? So it's going to be negative down here. What about here? It's going to be positive. What about here? Yeah, negative. So it's going to look like this. It's going to be a quadratic function. Let's show that it's going to be a quadratic function. All right, so what we have is we want to compute the log of the probability of x given that y is equal to 1 times the probability of y is equal to 1. And we're going to assume that this probability, the likelihood function, is a normal with mean mu 1 and variance sigma 1. Okay? So the only thing that's different than the experiment, the derivation that we just did is that each of the likelihood functions are going to have their own variances. So this is going to be sigma 1, and the second class is going to have sigma 2. So the log of this function, let's see what that's going to be. So we have greater of the exponent minus m over 2 log of 2 pi minus log of the determinant of the variance covariance matrix. And then we have the log of q1, where q1 is this probability. So let's simplify just a little bit. All right, so what I want to know, what's the log of this ratio? That's going to be equal to, so this term below is going to have a different variance. And then this term here, minus 2 x transpose sigma 1 mu 1 minus, although it doesn't really matter, just more convenient to write it this way, minus 2 times mu 2 transpose times x. And then I have the differences in the means. And then some constants to add at the end of it. These logs are going to add to the end of it. So that's going to be minus m over 2. That's going to cancel. I get minus log of, does that look right? Yes. Okay, so this is a quadratic function of x, right? So I have some terms here, right? There's a quadratic function of x. Then I have a linear function of x. Then I have some constants. So this is equal to x transpose w 1 2 times x minus w 1 2 transpose times x plus w 0. This is a matrix. This is a vector. It's a matrix, a vector. And in this case, this is a constant. So I get a quadratic function of x. That's the log of the posterior probabilities. So for example, if I were to, let's try three classes that we can now draw the decision boundary. So I have x 1 and x 2, and I have some black dots, some blue dots, and I have some red dots. If I were to do now the ratio, so what I'm going to compute is the probability, the log of the probability of blue over the red. This is what I mean, the probability of the posterior probabilities. What I'm going to get is for each x, I'm going to get a number, the log of the posterior probability. I'm going to assign it to the class that it has the largest ratio. And the decision boundaries are going to become quadratic. So for example, for blue, I might get something like this, so forth, and then as this follows. So I will be able to separate out a region, a quadratic region associated with one, and then other regions associated with the red and blue. So the idea is that you don't need to compute at each point the log. The posterior probabilities, you model it basically as a close form relationship between the variances and the means. So those W's tell you what the posterior probabilities are going to be, the ratio of the posterior probabilities. Okay, alright. Any questions about this? Alright, so let me now spend a minute telling you about what to do when your data is not Gaussian. So our problem is that we have some data that we've collected. Let's begin with a one-dimensional version of this. And we can't assume that the distribution is Gaussian. So my problem is I want to formulate the probability of X given that Y is equal to 1. But what I have is a bunch of data, and the data looks like this. So here's X. And when I look at my data, so I have one data point here, two here, three here. Something that looks like this. So maybe it's even worse than this. Maybe I have a couple of data points out here. So clearly it doesn't look like a Gaussian. This is my X given Y. So I want to represent this with a density function. Why do I need the density function? Because I need to be able to describe the likelihood. So how do I do that? Well, the way to do it is to plop a kernel. And I'm going to show you what a kernel is at each data point. And when you have a data point, you would just plop a kernel on top of it. And the more data points you have, the more kernels you build up on top of each other. And then what happens is that you need to normalize the whole set so that the integral of your distribution is one. So that's what kernel density estimates do. So our kernel here, kernel as a function of U, so I need to separate what U is from X. So suppose that U is my variable, and kernel is K. And we're going to describe what this kernel is. Kernel is going to be just a Gaussian sitting at a particular data point U. So this kernel is just something that I'm going to have at the data point U. So when I say form a probability p of X, suppose that I take all my data points that belong to class one, and I'm going to now formulate my density function as follows. I have n data points that belong to class one. And I'm going to sum up all the K's where each point for where I plop my kernel was on top of my data point. So this is going to be X minus Xi divided by H. So this is function of U. This is how I define U, K of U. This function is described like this, like this. And for U, for each data point that I have, I'm going to give it this function. What is H? What H does is the smoothing of this kernel. So this kernel is going to be really sharp if H is small. It's going to be broad if H is wide, right? So if H is large, it's like a variance. So if H is large, what you're doing is you're plopping a kernel that has a lot of breadth. If H is small, you're saying I'm going to have a kernel that's really sharp. What that means is that when you end up computing your density, you're going to get things that look like this. If H is one, you're going to get something that looks like this. If H is two. And everything in between. It depends on how sharp do you want your density function. Yeah. So like if your X is quantized, how far are you hitting? Exactly. Okay, what about if you have a multidimensional X? Then what you have is that K becomes a function of this vector U. And that's just equal to... So the usual normalize Gaussian. And then this P of X, given that you belong to class Y. So probability of X given that you're having some class is going to be as follows. So this is all the points that are in class Y is equal to L. So now instead of summing up one-dimensional Gaussians, you're summing up M-dimensional Gaussian where M is size of vector X. And H again is our... I'm sorry, this is K, a function K. And H is our smoothing parameter. Okay, all right. One last topic to tell you about. Sometimes when you're doing this kind of analysis, what happens is that you end up with a data point that's missing some values. So say that you are... So I want to tell you about missing or bad data. So you're doing classification and you have some vector space that you're trying to classify. Say that we have this two-dimensional space. Then we have X1 and X2. And then we have some data point that's missing some component. Say that we are missing X1. So we pick a point and one of these things is there and the other one isn't there. What do we do with that? How do we classify it? So let me show you the process. So suppose that we have a number of classes. So we have some class out here. It's called as class 1. So what am I plotting? I'm plotting P of X given that Y is equal to 1 times the prior probability of Y is equal to 1. That's the numerator of my posterior probability. Say it looks like this. It has a center sitting out there. And I have another class out here and I have another class out here. This is class 2. This is class 3. These three classes. So suppose that you give me a data point. And the value for that data point, I have the X2 value. Here it is. I have the X2 value for it. But I don't have the X1 value. So what do I do? So let's consider two approaches. So my problem is that I am missing X1 for some data point I. I'm missing this component of my vector. I have a bad data point. So suppose that we do this. Suppose we take the expected value of my data for X and replace it for the part that I'm missing. So if you look at this distribution here, what's the expected value of X1? Approximately where is it? In the middle. So someplace out here, right? So if I were to replace my missing data point with this value here, which is the expected value of X1, what happens? Well, then I end up with some point out here, right? That becomes my point. Then I would say, well, that belongs to which class? Class 2. Okay, but that's wrong. Let me show you why that is. So if you don't have that data point, you can still make a classification. Why is that? Because if you were to compress everything now to X2, so you don't have X1, right? So if you were to compress everything to X2, what you end up with is a distribution that looks like this. This is 1, this is 2, this is 3. So if you just look at compress, draw the distributions over over to one side, find the marginal to one side, then what happens is that the point that I wanted actually belongs to class 3. So this is the intuition. I'll show you the math why that's true. But the basic intuition is that when you're missing a data point, you can still make a decision by integrating out the effect of the variable that you're missing, the component that you're missing. So if I'm missing X1, I can still compute the probabilities based on X2, which means integrate out X1. So project everything out to X1. Here's what the distribution looks like. I have a blop at 1, a little blop at 2, and a blop at 3. Well, my data point is sitting here, which is the center of 3. Okay, so let me just show you why that's true. What am I going to just do? Okay, so my vector X, I'm going to classify as good and bad. So there's some component that I have that I like, I'm going to call XG, and some component of it that I don't have, and that's my bad component of it. So probability of Y is equal to being some class I, given that I have some good data, is the joint probability of being in that class and X good divided by probability of X good, which is the integral of P of Y is equal to I X good, X bad, if I take out XB divided by P of X good. So I just wrote the base probability as the joint probability divided by the marginal, and I'm going to, this X good is the same as this with X bad integrated out. So this quantity is just the joint probability I can write as P of X, given that Y is equal to I times the probability of Y is equal to I times integrated out DXB, X good. So what this is, this term, the numerator here, what it's doing is that it's saying to integrate out the effect of the dimension XB, which is the bad dimension, which is what we did here. We wrote the likelihood times the prior probability after we integrated out the effect of X1 by projecting the data over to X2, which is what this numerator is. So that's the posterior probability, given that you only have good data. So you just project the data onto the dimensions that you have, and that becomes your likelihood function. Alright, see you on Wednesday.