 So we are done with the Bellman equation, which probably means that the next few lectures are going to be easy for you guys, since I think the hard stuff is done. But before I tell you guys about classification, which is the next three lectures, do you have any thoughts about the Bellman equation that I can help you with? OK, well, you have homework with signal dependent noise, I think, today. Maybe on Monday you'll have some questions about it. I want to spend some time talking about classification. This is the basics of classification, thinking about how to do some of the techniques that are fairly common. Today we're going to cover Fisher discriminant analysis, which is useful from the point of view of thinking about how to optimize cost functions that are not easy to find the derivatives for. So you'll see a special kind of a cost function, one where what you want to do is take the mean of two samples and make the means as far apart as possible weighted by the variance. Now you want to make the variance as small as possible and you want to make the means as far away as possible so you can classify these two things. And so Fisher discriminant analysis is a way of doing it, but the mathematics involved in maximizing this distance while minimizing the variance is a little bit interesting and so it's worthwhile looking at because it's an example of a cost function for which there is a closed form solution, but not through the derivative approach, which is what we've been doing. So that's one of the things that is worth knowing how to do. The second thing I want to show you is the Bayesian classification. It's a sort of beginnings of Bayesian classification, which is basically the idea that you're going to have a likelihood function, you're going to have a prior, and then from the likelihood and the prior you're going to formulate a posterior probability. And we're going to talk a little bit about things like your uncertainty associated with that probability distribution, so you make a classification, how sure are you that this classification is good and error rates associated with your classification. So let's begin with a simple classification problem where you've been given some data and the data is labeled, so you have X vector, which tells you the features of the space, and then you have a label associated with it that tells you the class. And from that you're going to build a model, and we're going to build a linear model, and the natural approach initially is to just use regression, which is to say basically, well, I have all this data X and the label for it, and I just want to build a model that says, well, predict the class based on the weighted combination of the features. And if we use regression to do it, then what we'll see is that the problem that comes about is that if we look at the distribution of our errors in our estimation, so when you use regression, what you end up with is saying that for this feature, you're going to belong to class zero or one, but you're going to get a real number, like 0.3 or 0.8 or something if you use regression. And so you can say, all right, anything above 0.5, I'm going to say class one, anything below 0.5, I'm going to say class zero. But the problem is that if you use regression to do classification, then what happens is that the errors that you have in estimation, the difference between the truth and your prediction, these errors, are not going to have a distribution that's Gaussian, which is the basis of our assumption of saying that we make a prediction, we make an observation. The difference between those two is just noise, and noise has certain properties, it's supposed to be Gaussian, with a mean and variance that are independent from each other. So we'll see that if we use regression to do it, then what we're going to end up with is a error distribution where the variance depends on the mean, and that's going to become a signal dependent problem. And so regression becomes not a good way to go. So then we'll use other approaches. Fisher discriminant is the first thing that I'll show you, and then next I'll show you a simple approach using Bayesian. So we have a feature X that we've been given. So this is vector, and we're going to classify it into either belonging to class 0 or class 1, and the data is X1, and so this is Y. Either class 0 or class 1, and we have Y1, and we have second data point. It's labeled, and so forth. And so what we can do is build a model that says my prediction is going to be equal to some weighted combination of these features. And I can write this in matrix form where X is going to be equal to 1, the first element of the X vector, the second element of the X vector, 1, the first element of the X vector, the second element. And then my weight vector W is going to be W1 through as many as whatever number of X's that I have. Sorry, sorry, W is 0. Right, so X times W, 1 times W plus X's times the W, this vector, this representation. So I can find my maximum likelihood of W hat. Maximum likelihood is going to be X transpose X minus 1 X times Y. So using regression, we can find our set of weights. Now, when we do this, what happens? So suppose we have a scenario that looks like this. We have some data points and we have two classes, and what we're going to do is we're going to use, so this is X1, this is X2, and each one of these data points is an instance where we've collected one of these guys here. And the color is whether it belongs to class 0 or class 1, right? That's the data. So if you use regression to do the classification, what you're going to end up with is a line that separates these two classes. And the equation for this line is going to be equal to 0.5, is equal to W0 plus W1 X1 plus W2 X2. That's going to be the equation for that line. And your line is the boundary that separates these two classes. And W0, W1, W2 are going to be what comes out of this representation. All right, so that's fine. You know, this is okay. Now, but the problem is that, let's take another example. Let me show you what happens when you use regression to try to find these variables. So suppose that I have data set that looks like this. So I have one group that I have a few data points with small variants. And I have another group that I have few data points but with large variants. What's going to happen is that when you use regression to find the decision boundary, you're going to find a decision boundary, you know, like before, someplace out here. But then, if you plot the line for 1 is equal to W0 plus W1 X1 plus W2 X2, you're going to find that there's some data points that are, you know, outside of your class, if you will. They're beyond 1. And there's going to be a lot of data points actually outside of 1. So it's kind of strange. What does this mean? You know, you use regression to try to estimate that you don't get either belong to class 0 or class 1. You get this continuum. And this particularly happens when these variances are unequal. So let me show you the root cause of the problem. Suppose that on every trial, I, we define the error, which is the difference between the true class minus our estimate. And we're going to use regression to do it. So, you know, we have some probability of the class being 1 given X. And, you know, say this is data, some number that tells us the probability for, you know, X is this. What's the probability of being equal, being at class 1? Say this is data. And of course, the probability of Y being class 0 given X is 1 minus data. So let's ask, what's the expected value of Y given X? Well, that's the probability of Y equal to 1 given X times 1, plus the probability of Y is equal to 0 given X times 0, right? This is the two instances of Y. It's either going to be 0 or 1. And what's the probability for that? So this is the two probabilities and this is equal to theta. What's the expected value of Y squared given X? That's equal to the probability of Y equal to 1 given X times 1 squared, times the probability of Y equal to 0 given X times 0 squared. That's just equal to theta again. Right, that's just this. So what's the variance of Y given X? That's equal to the expected value of Y squared given X minus the expected value of Y given X quantity squared, which is equal to theta minus theta squared, which is theta times 1 minus theta. So why is it that I'm interested in variance of Y? Because the variance of epsilon is going to be equal to my variance of Y given X. And we see that this variance depends on the mean. So the variance of my error depends on the mean of Y. It's not independent of Y. So that's the problem. And it's a problem that we can't fix if we're going to use regression. So the way to approach it is one of the ways that people have used in a frequentist approach. So a frequentist approach is using things where we're going to maximize likelihoods and so forth. It's something that's called Fisher discriminant analysis. And let me show you what that is. So the idea is that when you take the data set and you project it, so let me do it on this example here. So when I find a linear function like this to represent the data, effectively what I'm doing is that I'm fitting some line to my data and then I'm looking for the distribution of the projections of those data points onto this line. So if you look at these blue dots here and you projected each one of them to this line, what you're going to get is a distribution that looks like this. And if you do the red ones, you get a distribution that looks like this. So when you fit your data to these linear models, you're projecting them onto some line. And that projection is going to end up having a distribution associated with this W transpose X. And what you want is that you want to make the mean of the distribution of one of the classes as far away from the mean of the distribution of the other class. So for example, if I have this data like this, let me pick another line. So I have my data points, something like this. And now if I pick a line that looks like this, then what happens is that if I project my data points to this line, what I'm going to get is some distribution that looks like this. And you notice that this has a greater overlap and the centers are closer to each other than this one. So in some sense, this is a better way of separating my data than this. And so what Fisher discriminant analysis does is basically find the equation for this line so that when you take your data and you project it onto it, the classes are as far apart as possible. Now, it's not good enough by itself to these means to be as far away as possible. What you also want is that the distribution itself should be as tight as possible. So the means should be far, but the variances should be small. So that's what discriminant analysis is. It's defining this W so that it takes the means of the data and makes them as far apart as possible. So Fisher discriminant, so what does that look like? So what we want is that we have some class called y0 where we have n1, I guess n0, should I call it n0 data points. And there's a mean, mu0 associated with it. So what I mean by this mean is that this is a vector, which is the sum of, sorry, 1 over n, I need to put 1 over n in the front of it. I belong to class 0 of x of i. So the mean of my data is a vector which basically represents the mean of my data. That belongs to class 0. Okay, that's what I mean by mu0. And I also have some variance, sigma associated with that data, which is going to be the difference that it's going to be the sum of 1 over n0 minus 1 times x of i minus mu of i, I belong to class 0. So I have some data that's labeled, like the data there, the blue. Each point is a vector. I find the mean of that vector, and I find the variance of that vector, okay? That's what sigma0 and mu0 is. And similarly I have class 1 where I have n1 data points where mu1, sorry, okay? So all the data points that belong to each class, we find the mean and the variance, okay? All right, so what Fisher says is that what you want to do is find the vector w. So that after you project your data onto this vector, so basically you have this model that says y hat is equal to w0 plus w transpose times x. This is my model. And, you know, what I want is find this vector w so that I can make the centers after I've projected my data as far apart as possible between these two classes. So the expected value of y hat given that x belongs to class 0 is going to be equal to w0 plus the mean of x, w transpose mu0, and variance of y hat given that x belongs to class 0 is going to be equal to the variance of x. So it's just going to be w transpose sigma0w. So this is the mean of my data when it's projected onto this line, right? And this is going to be the variance of that data when it's projected onto this line. So here's my, when I take my x and I multiply it by this w, I'm projecting it someplace. And that projection is going to have a mean and it's going to have a variance. Yeah, yeah, yeah, yeah, yeah, yeah, yeah, exactly. This shouldn't put this thing under it. But that's what, that's what I mean. So this is a matrix, sigma is a matrix. This is a, this is a number, this is a scalar, okay. So, all right, so this is when I project things and, you know, similarly the expected value of y hat given that x belongs to class 1 is just going to be w0 plus w transpose mu1. And this variance is going to be w transpose sigma1w. All right, so what, what Fisher says is that you want to find the w that maximizes a function. And that function is going to be the difference between the two means of the projected data, which is w transpose mu1 of mu0 minus w transpose mu1. You want to take this and make it as far apart as possible. Let me write it for the, for the purpose of, of course, it doesn't matter if I write it this way or this way, right? It means the same thing. I want to make those as far apart as possible. And what I want to do is make the variances as small as possible. So that's going to be 1 over n0 times w transpose sigma0w plus 1 over n1 w transpose sigma0w. Sorry, sigma1. Waited by the, not 1 over, but just n0 and n1. So depending on how many data points you have, this is saying that, you know, if I have a whole lot of data points that belong to class 0, then this matters more than this one. Okay? All right, so what we want to do is maximize this ratio. And it's interesting to ask, how does one do that? How does one find a w that maximizes the scalar quantity? And I want to show you a little trick that helps us do it. So obviously, you know, this is different than everything else we've seen so far because our cost function is not linear in w, right? So, you know, I have this nonlinear function of w here. And when I find this derivative, it's not going to be very simple for me to maximize things. So finding a derivative and maximizing it, it's not going to work. So a better way to do it is as follows. Let me simplify this and write it as follows. So I'm going to call the difference between the means. So let's mu 0 transpose minus mu 1 transpose. I'm going to put it together like this. I'm going to put a w here, a square. So that's fine. I didn't do anything interesting. Let me bring the w's out and from the bottom as well, like this. And let me call this difference here, m. So let me define m as the difference between the means. So this is going to be equal to m transpose w squared, just a scalar quantity. Down here, let me define a matrix S to be equal to n0 sigma 0 plus n1 sigma 1. So these are various covariance matrices. This is going to be a positive definite symmetric matrix. So this is going to be equal to w transpose SW. So we have a w in the numerator and we have a w also in the denominator. And the question is, can we represent this in a way that we can easily maximize our vector? And the idea is as follows. Let's say I write, so this is how S is defined. I can always take my S matrix and write it in terms of two, a multiplication of two matrices. And this is, these are called the square root matrices. Basically R is in a way the square root of S. So there exists some matrix R that I can write S as R transpose R. So that means that I can write this as mtw squared w transpose R transpose Rw. Alright, so that doesn't seem very interesting so far. Let's now see if we can represent w in such a way that we can get rid of the R's in the denominator. So suppose I introduce a vector v that's going to be equal to R times w. So some new, I'm going to project w through R to make some vector v. So let's see what happens. So if I do that, what I have is that w of course is equal to R minus 1v. See what happens in the denominator now. I have w transpose, which is v transpose R minus 1 transpose times R transpose times R times w, which is R minus 1v, which is going to be equal to, this cancels, this cancels, I get v, v transpose v. Alright. So what's v transpose v? Well, that's just, this is the magnitude of v squared, right? So what I just end up doing is that represent v now in terms of a normalized vector, v divided by its magnitude. So this is another vector here. This is just a scalar, this is just a vector quantity that's being multiplied by this. Just to make it a little bit easier for us to see it, let me write it like this. Instead of writing as mT R minus 1, I'm going to write it as R minus 1 transpose m transpose times v, because that's the same thing, this is the same thing as this. So this is a multiplication of two vectors, a vector times another one, a vector transpose times another vector. So what I want to do is maximize this projection. And so the way I do it is by making sure that this is in the same direction as this quantity v. So, so the vector v that maximizes this quantity here is equal to some vector that's in the same direction. And I'm going to put an a here, which is some arbitrary scalar, times that R minus 1 transpose times m. That's the vector v that maximizes this quantity. It's a projection of two vectors, right? So when v is in the same direction as this, the projection is going to be as large as possible. So that's v. So what's w? This is equal to a R minus 1t. What's m? m is mu 0 minus mu 1. What's w? w is equal to R minus 1 times v. So that's equal to a R minus 1, R minus 1 transpose. That's mu 0 minus mu 1. This is equal to a R transpose R minus 1 mu 0 minus mu 1, which is equal to R transpose R is just s, which is equal to, a is arbitrary, so let me just set it equal to 1, s minus 1. s is the sum of the two variances. It's going to be 1 over n 0 sigma 0 plus n 1 sigma 1 times mu 0 minus mu 1. And this is going to be an inverse, not a 1 over. So the w that maximizes the distance on the projections is going to be one that is the difference in the means divided by the sum of the variances. So that's w. Now our model was a little bit more complicated than that. Our model also had a w is 0, right? So this was our model. I just found w for you. What about w is 0? W is 0 is going to be the difference between the expected value of y minus w transpose x, which is 1 over n times the sum of y over i minus transpose i. So we use this w equation to find the Fisher w and then the difference between that and the observation y is going to give you w is 0. That's going to be the Fisher discriminant model. And what it does is maximizes a particular cost. That cost is this. And what we did is that we found the w that took the data points and after they were projected onto some space described by this w, we got the means as far away as possible and the variances as small as possible, the sum of the two variances. That's what the cost did. Okay. All right. So, yeah. So really what this is kind of doing is you're taking the axis between the two means and sort of rotating it by a factor that represents the skewness of the variances of the two clouds, basically. Axis between the two means. The difference in the means. You just use that term, then it would just take a line going between the means of the two clouds. And so the matrix that you're adding on the front of it is like a rotation matrix that sort of like rotates that axis to accommodate for the skewness of the variance. That's a good point. Nicely put it. Yeah. Yeah. The first one is like a rotation of the axis of the mean. Yeah. Good point. I like that. I hadn't thought about that geometric interpretation. Nice. All right. Let me now show you a different way of classifying more, so this is a frequentist approach. Let us use a Bayesian approach. And, you know, in the Bayesian approach, what we assume is that when we get data like that, we can form a likelihood function. And from the data, we can also formulate a prior. So based on that, we can then compute the Bayes probability, the posterior probability. And the point is to show you simple ideas about that with regard to what the shape of the function looks like, the posterior probability, and also what the uncertainty about your classification looks like when the variances are equal or unequal. And so let's do some examples of this to get a sense of it. So Bayes classification. So our data is labeled, comes as follows, x1, y1. This is a vector where y belongs to some class, 0 or 1. Let's write it as a set. So when we formulate the problem as a Bayes classification, what we have is a probability of having our class y being equal to, say, class 1, given that we have some x. And that's, of course, the probability distribution of x, given that m class 1 times the prior probability of being in class 1 divided by the marginal, which is just a probability of x. And this probability of x is going to be equal to the sum of the two probabilities associated with probability of x, given that y is equal to 1 times the probability of y is equal to 1 plus the probability of x, given that y is equal to 0, times the probability of y equal to 0. So this is the marginal. This is our likelihood. This is our prior. OK. All right. So let's do a simple classification. Suppose that my data set is height. So we have some height of individuals. And we want to classify them as being male or female. So maybe I'm going to say 0 and 1 represents our female and male representation. So if it's class this, if it's 1, we're going to call it the male. And so let's begin with our likelihood. So maybe probability of x, given that you are male. So here's height. And you have probability of particular height, given that you're a male. Maybe it looks like this. And then you also have probability of height, given that you're a female. And the mean is going to be, as we know, something smaller. And if the two variances are equal, so suppose that the two variances of our likelihood function are equal, then what's going to happen is that these are my probability of x, given that y is equal to 1, probability of x, given that y is equal to 0. So what am I doing? I'm multiplying this times a prior probability. So say that I have equal likelihood of having a male and a female. So this just gets multiplied by some number, 0.5 in this case. And then what I'm going to do is that divided by this quantity here, which is probability of y given x. Well, this is just the sum of these two things, times the prior probability. So that p of x is going to look like something like this, right? There's just some of those two multiplied by the prior probabilities. All right, so now if I divide this quantity by this, what I'm going to get, if I take this red one and I'm all divided by this blue down here, I'm going to get a function that looks like this. If I take the blue and divide it by this, I'm going to get a function that looks like this. And this is now the probability of male given a particular height. And you notice that the place where these two cross, this is your decision boundary, where these two lines cross, this is going to become your decision boundary. This decision boundary is going to be the same place as this. And so in principle, what you want to know, your decision boundary is where the ratio of the probability of x given y is equal to 0 times the prior probability of y is equal to 0, this is the location where this is equal to 1. And the numerator and denominator are equal to each other in this case. That's your decision boundary. I don't know where I am. On the other hand, if the numerator is bigger than the denominator, then I'm going to be in class 0. If the denominator is bigger than the numerator, then I'm going to be in class 1. My decision boundary is when those two probabilities are the same. So that's pretty simple. Let's not consider a condition where the variances are unequal. Do you have any questions about this? So far, all right, that's pretty easy. But let's consider when the variances are unequal. So suppose my likelihoods look like this. So here's my height, that's x, the x variable. And suppose my males have some really wide distribution in the height of the males. I have some really short males and some really, really tall males. On the other hand, the females are like this. So I have much lower variance in the females as compared to the males. So what I'm plotting here is p of x, all right? So okay, so what is the marginal probability? What does p of x look like? So suppose that again, my prior probability is 0.5. So that equally likelihood. So I multiply these guys by 0.5 and I add them up. And of course what I'm going to get is a long tail. All right, so you notice that if you look at this function, I have two decision boundaries now. Because if my height is bigger than this number here, I'm going to say that it's going to be a male. And if the height is less than this number, I'm going to say it's going to be a male. Only in between I'm going to say it's going to be a female. So if I divide, so I'm now going to plot for you p of the probability of y is equal to male given x. It's going to look like this. This is the probability of y of 1 given x. And the probability of being female given x is going to look like this. So all I'm doing is plotting the posterior probability. In between probability of x given 1 is 0 and the probability of y is 0. The prior probability, what does that do? 1. Yeah. If the priors are unequal, then the two get weighted by different numbers. So if the prior probability of y is equal to 1 is large, say you have 90% chance that what you're looking at is a male. Then what that means is that this thing gets pushed up, this gets pushed down. Because this gets multiplied by the prior probability before it gets divided by the p of x, the marginal. So what the prior probability does is that it can increase the weight of this before you add it to this. It's like multiplying it. It multiplies it by something and then it pushes down this guy if they're unequal. So what it does is that basically it then makes this. If my prior probability is large for p of x given y 1, then what that does is that it makes my decision boundary come in because it's pushing up the blue line. All right. So uncertainty of my classifier. So this is my posterior probabilities. So I can tell you that how probable it is that I'm in class 1 or I'm in class 0 based on this posterior probability. But what about my uncertainty? So what I want to know is basically what's the variance associated with my estimate given x? This is the probability of it. But what about the variance? And to give you intuition, you sort of would guess that my uncertainty is going to be highest right around my decision boundary. So basically right around the place where the two probabilities are close to each other, that's going to be the place where I'm most uncertain. And then as I get closer to the edges, that's going to be the place where I'm most certain. So what I want to show you is how to compute that variance, that's your uncertainty about your classifier. And here's how we're going to do it. So let me show you what I mean by that. So I have a probability of y being equal to class i given x. That's why I just showed you how to compute it. And that probability, I can do it like this. And I get a distribution p of y given x, which is going to be equal to two things. It's going to have, this distribution is going to have two numbers, it's going to have a probability at y is equal to one given x and a probability at y is equal to zero given x. And that's what the distribution is. Okay, so what is the variance of this distribution? So I have an expected value of y given x, which is the probability of y is equal to one given x times one plus the probability of y is equal to zero given x times zero. And this is going to be equal to just the probability of y is equal to one given x. That's the mean of my distribution. And then we have the expected value of y squared given x. That's going to be also equal to this. So the variance of my y given x is going to be equal to the probability of y is equal to one given x minus the probability of y is equal to one given x, quantity squared. So if I were to plot this relationship, if I were to plot this number, probability of y being equal to one given x minus its value squared for these two things that I showed you here. So let's plot it for here. So I'm going to plot the variance of y given x, which is equal to this quantity, probability of y is equal to one given x minus the probability of y is equal to one given x squared. I'm going to plot that as a function of x. What I'm going to see is that this function looks like this, where that's the decision boundary. It's going to have a maximum value right at the decision boundary. If I do it here, if I plot probability of y equal to one given x minus the probability of y equal to one given x squared, it's going to look like this. My uncertainty is going to be greatest right here. That's basically the variance of y given x. That's my uncertainty about my classification. So what's going to be your error rate when you do your classification? So the final topic, a little topic here, your error rate. What do we mean by how good is your classifier? You can tell me this by looking at your misclassification. Let's look at this plot on top. So if I plot the likelihoods multiplied by the priors, then the misclassification is going to be right here. So the area underneath this curve tells me the probability of picking class zero when in fact it was class one or the opposite. So this area here, the sum of that area is going to be, this area from one side to the other is going to be my error probability, so let me write that down for you. So, well, maybe it's easier here. So this area here is going to be a range where it's going to tell me the integral of, let's call it r0, of p of x given y is equal to 1 times the probability of y is equal to 1 dx. This area is when I'm going to make a mistake in classifying something because I'm going to say if I fall here from here to here, if I fall from here to here, I'm going to say that belongs to this red region, but in fact it belongs to the blue region, so I've misclassified it, plus r1 p of x given that y is equal to 0 times the probability of y is equal to 0 dx, plus this region here. This is my error rate, Bayes' error, it's going to be the sum of these two things because I've misclassified my data. That's what this error rate is. What you do is you take your likelihood p of x given that y is equal to 1 times the prior probability that y is equal to 1 and you integrate it over the region where you are classifying it as the other class over r0, over that region, plus the other class that you have, p of x given that y is equal to 0 times the probability of y is equal to 0 integrated over the region that you're calling a class 1. It's the area underneath those two curves, the blue and the red curve just adjacent to the decision boundary. So for this, the error rate is going to be as follows. So let's take the red curve. So the red curve is my, you know, how I made my decision. So I'm going to say here, when my height is over here, I'm going to say it's a male. When the height is over here, I'm going to say it's a male. When the height is here, I'm going to say I'm going to be a female. So my error rate has to do with this area here underneath the red and it has to do with this area out here. That's when I misclassify, right? And then, in addition, I'm going to misclassify here when I'm blue, the sum of those three areas is my error rate. That's the error rate of your base classifier. Okay? Yeah? Can we select where to separate the regions? Okay. So you have to know your decision boundary first. Well the decision boundary comes where the ratio between the likelihood times the prior is equal. So what does this mean? Okay. So in this case, what I've drawn is the likelihood. So this is the likelihood of p of x given y is equal to 0 and the likelihood of p of x y is equal to 1. That's my likelihood function. Suppose that the prior probabilities are equal. If they're not equal, they're just scaling these. So you have prior probability and you have likelihood, right? I've plotted the likelihood. Say again? Yeah. So your decision boundary is going to be when they become equal and of course you want to pick the function that's larger. As soon as they cross each other, you want to pick the function that's larger. So in all of this condition, so here you're going to pick the blue, here you're going to pick the red, here you're going to pick the blue. That's because it gives you a higher posterior probability, right? Because so the posterior probability is just the division of these things by some constant. Here it is. So it really doesn't matter. All that matters is which one is bigger. Yeah. I can't separate them. Oh, no, no, no, no, no, no. This still works. This still works. So if you have two means that are exactly the same. I can still use Bayes because let me show you. Because here I can plot for you, here's one function, right, here's one likelihood and here's another likelihood, right, so they're exactly the same mean, right? But I can still do posterior probabilities, right? I'm going to pick this one here. Here's my decision boundaries. This works. This doesn't work. Yeah. The numerator is going to become zero in this one. Yeah. Good question. Okay. Did you guys understand the error rate? So the error rate is going to be for this function is going to be this area, right? Because I'm misclassifying things there and it's going to be this area, the integral of those two things. Good. And these are probabilities, right? So x can be a vector. I don't care. These are just probabilities described in many dimensional spaces. It doesn't matter. All right. Thank you so much.