 Hello and welcome to lecture six of our introduction to machine learning course. Today we're going to talk about linear discriminant analysis Which is another linear classification method in addition to logistic regression that we discussed last week So in fact, there are several different ways to approach the linear classification problem. For example, if you open the Elements of statistical learning, which is one of the textbooks that I always recommend You will find a chapter called linear methods for classification and in this chapter it consists of three parts one part is about logistic regression and other part is about linear discriminant analysis and the third part is called separating hyperplanes and talks about methods like perceptron and linear support vector machines This is interesting because the previous chapter is called linear methods for regression and it only talks about linear regression In fact, I don't have a very good answer why there are more approaches to linear classification than to linear regression And part of that might be just tradition Or in to some extent it could be that the fact that our predict that the variable that we're predicting is categorical allows the several natural approaches We talked about logistic regression last week and there the the point the whole point of logistic regression is to model the probability of Each class given an observed data given some predictors You're predicting. What is the probability that this is class one? For example as opposed to class zero here in linear discriminant analysis today. We're going to primarily model distribution of the data Given a class we will discuss exactly what it means in the next slides, but that is the difference here and in these methods That can be called separating hyperplanes. They directly optimize the linear decision boundary Without a probabilistic model. So these are not probabilistic models We will not talk about them in this course just for the lack of time. We would need to have another lecture for that So here's the same thing as a schematic and as I said logistic regression can be seen as modeling class Distribution given the given the data and LDA can be seen as giving the distribution of the data in each class given the class So let's try to understand this difference better of course if we're doing Classification what we are actually looking after what we're actually interested in is the probability of each class If it's a binary classification problem then probability that it's class one and probability that it's class zero given a data point given X This is what logistic regression directly estimates in This lecture we will Just approach the same thing the other way around. It's a bit of a roundabout way. We will assume some model for each class It's a probability distribution of Axis in class zero and some probability distribution of in class one Assuming this model and assuming some prior probabilities for class zero in class one We can use the bias rule to get to the probability of class given an observed data And this is what we what we want, right? So we get to the same object as in logistic regression, but in a in a roundabout way via the base theory Here's how it works This is the bias theorem written out for you We have here the object that we want probability that a given X sample belongs to the class K out of out of potentially multiple classes and This is given just by a bias theorem as a probability the other way around right What is the probability to observe X given that it belongs to this class times the prior that the class The prior probability of class K and you have a normalization factor in the bottom That just sums overall possible classes the same term We talked about that in one of the previous lectures So in fact the denominator is not super important if you just want to choose the class right because the denominator Is always the same for all classes So if you just want to assign your x to one of the classes you need to find the class for which the nominator is the largest numerator is the largest and I will introduce this notation here. So the prior will denote as pi K. That's a prior class K and F K of X will be the probability distribution function of The data in class K You may wonder though in this connection to linear regression or this this analogy to linear regression that I mentioned earlier in Linear regression if you go back to one of the first lectures there would just model probability of Y Continuous variable Y Given X and one can think about whether there one can do the same trick where one can postulate some probabilistic model of X given Ys Ys continuous some prior Over Ys and then use the observed data to estimate the most likely why it's the same roundabout way via the bias theorem I'm just posing this question here and we might get back to that in one of the one of the last lectures in this course Here in today's lecture, we will be exclusively talking about Gaussian densities. Okay, so they were the bias Theorem that you saw on the previous slide that relates to any densities Here we talking about Gaussian days, so it's multivariate Gaussians because we can have multiple multiple features right and this is the probability density you see written here for a Multivariate Gaussian Let's say I will use the letter P that you see here to denote the dimensionality of the predictors This entire thing in the front is just a normalization factor so that this thing Integrates to one as the probability densities should right and the more interesting objects here are the mean the mu and the covariance matrix the sigma and We can have a separate Gaussian for each class. That's why they are indexed by k over here So let here's an example imagine we have two predictors We have x1 and x2 nothing else and there's three classes and we let's assume that we know the true The true parameters of all the Gaussian, so we know what the moves and we know what the sigma is and we know what the prior is and Here's an example. This is a spherical Gaussian that X1 and x2 are uncorrelated and have the same variance. This is this is here the Covariance matrix will be diagonal the variance of x2 is larger than the variance of x1, but there's no correlation So there's zero off diagonal and and this covariance matrix Has some correlation in it now you observe point. Let's say here the star and you're asking which class Does this point belong to? So the bias theorem just tells you how to compute that if you know all these parameters that are written here as Greek letters and you know the position of this point you can find out the Probability that this point belongs to this class the probability that belongs to this class and the probability that it belongs to this Class and they will sum to one Right, so in this case Probably it belongs to this class most likely right, but you just plug it in here And you compute you compute the probabilities. Why do you need the priors? So the prior just imagine that Imagine that this price very low what it means is that Most of the points come not from this class. This is a very rare class Let's say the priors 1% here This just means that there's very very few points of this class in your data Whereas let's say here the priors. I don't know 49% and here's the remaining 50 then actually It will be most likely that this point came from here Right the probability density maybe is a bit larger for this class But it's very unlikely that it comes from here because these classes so rare So it's much more likely that that this point came from here given the high prior of this class That's why you need two things you need the prior and you need the likelihood right and then multiplying them together you get the actual Probability the posterior. That's how the base theory works. Okay So now let's let's that was the conceptual picture. Let's now talk through the math So we will for simplicity. I don't want to talk about the priors anymore So I will just assume that the priors are all the same This doesn't change math much just simplifies it a little bit because we don't need to Carry this priors with us and let's consider binary classification problem So from now on actually throughout this lecture. We'll just talk about the binary problem. There's two classes Okay, and that's again the multivariate density and what is What is helpful to write down is the equation for the decision boundary so called decision boundary which is the The the the line or the hyperplane that separates or the surface that separates the Part of the predictor space where the probability to observe class one the probability that the point belongs to class one is higher Larger than 50% and the part of the of the space where the probability that the point belongs to class two is Larger than 50% Okay We don't we talked a lot about this last week that we don't if we actually want to do binary predictions We don't necessarily have to Say if the probability to observe one class is over 50% then it's that class we can we can Depending on the on our goals We might want to reduce the false positives or no increase the the true positive rate and so on so we might choose a different Cutoff or threshold see last lecture for this discussion But we will ignore this for now. So let's say our threshold is 50% and Then the decision boundary will be just given by by this equation It's whenever the probability of class one equals the probability of class two that that point lies on the decision boundary, right? And let's just just work it out in this case. So here is our density I'm just saying then that since the price are the same what we need to do is we just need to equate this with For four class one with the same expression for class two and as always or as often here We have exponents so it's convenient to take logarithms if probability of class one equals probability of class two We can take logarithm on both sides. That's what I'm doing here taking logarithm of the whole this entire expression for class one I get this Factor with two pi that just cancels on both sides because it's the same and then there was this factor In with determinant of of the first covariance matrix and the second covariance matrix, but these are just numbers, right? so this is just some constant if and Then these are the most interesting terms and this is this is an equation So this this is an equation for the decision boundary and if you look at it closer You will see that this is a quadratic function. So this is just a number This is just a constant. This is another constant here if you open the brackets whenever you have mu times sigma times mu This is also a constant and whenever you have x times sigma times x. That's a quadratic Function of coordinates of x. So what is written here is? Maybe a complicated way through matrices to write down some quadratic polynomial of the coordinates of x That's why this is called quadratic discriminant analysis. So what we're getting here is And this is That's a binary problem, right? So we have class one and class two. There was here's the mean one Here's the mean two this some covariance one There's a different covariance matrix in class two and then the decision boundary will be a quadratic function So it's some parabola or perhaps a circle some quadratic line in this two-dimensional play But this is not linear discriminant analysis, right? And I promised you that we're going to talk about linear discriminant analysis. So how do we get from from this? Qda that is actually rare in at least in my experience. It's rarely used in practice How do we get to LDA which is used very very often? And the trick here is to make an additional assumption. So this Qda was a very general setting whenever you know that your Class densities are Gaussian You don't make any further assumptions than the Qda quadratic discriminant analysis just follows from that mathematically We need to make one more assumption to get an LDA and this assumption is that the covariance class Coverances are the same. We assume that the covariance matrix in class one Equals the covariance matrix is class two. Of course the means are different So you have one class sitting here and another class is sitting here They have different means, but they have the same covariance matrices, right? And Then This simplifies a lot So let's let's see what happens if I replace sigma one and sigma two with just sigma or these term cancels out great Very convenient and then you can open the brackets And put everything ah if you open the brackets the first thing you will notice is that there is a term x sigma Inverse of sigma x and on this side you have x inverse of sigma x. So this is the quadratic term. That's what makes the Qda Quadratic that's what makes the Qda the Q in Qda and this thing will just cancel out because sigma is now the same, right? So it's again great. We cancel the quadratic term What survives are the linear terms. So here you have x sigma mu one and here you have x Sigma mere two these are different. So we can collect all the terms with x here on the left and all the terms without No, sorry. This is just real the same thing rewritten Now I can collect all the terms with x here and all the terms without x on the right. So this is just some constant And on the left we have some function of x which is actually a linear function So I will divide by both sides by two and then this is The equation of the decision boundary, right? That's what we're computing here That this is the decision boundary of linear discriminant analysis and it is linear So we have x transpose times something and something is just some vector So that's for the moment. We can just think about it as some vector If you know the sigma and mu you plug it all in and you compute some vector So what we have here is x times vector equals to constant. This is a linear projection Of x on to this vector sigma inverse of sigma times the difference between the means And this corresponds to the linear decision boundary. So let's take a look at the picture of how it works Now from now on whenever I draw a picture for the linear discriminant analysis I will try to draw them such that the covariances are the same right because this is our assumption from now on So here's class one. Here's class two and the decision boundary is a line So why how to see actually mathematically that so make sure that you understand why it follows from this equation That this is a line and this is because so let's Take this vector here oops, sorry this vector without x and It let's say it points in that direction. So whatever that doesn't matter So let's say this is this vector And now whenever x point is such that I compute the scalar product We can say that we project this point on this vector and we compute We compute the product of these two lengths right and whenever this is equals to this given number This is a decision boundary whenever it's larger. It's class one whenever it's smaller. It's class two So you project the points here and there's some threshold And if it's on one side of the threshold it's class one and if it's on another side of the threshold It's class two. So the decision boundary is just a line perpendicular to this vector And crossing it in the point that corresponds to this threshold on the right So just to make it clear just to say it again It's not an assumption here that the decision boundary is linear The fact that the decision boundary is linear follows From our assumption that the covariances are the same Whenever the true covariances are the same the the optimal decision boundary is a line Okay, um, let's think a little bit About the role of this inverse sigma term or factor in here. So in fact One could think like before doing any of this math. I could have asked you what do you think is the best? Um, is the best decision boundary and one could maybe naively think that well One can just project on the line that connects the two Centroids of these classes. So let's say that's the same picture. That's the same two Gaussians. Here's my mu one Here's my mu two. So imagine I didn't have sigma in the inverse of sigma. Let's just let's just pretend It's not here then this Will be the difference mu one minus minus minus mu two, sorry And this is the line orthogonal to it and this would be our decision boundary And it's clear that it's not actually very good, right? I mean it will classify points here correctly points here also correctly But like points over here They are on the wrong side of the decision boundary. They're actually more likely to come from this Gaussian than from this Gaussian And what happens here with this in the the the the What this factor is that it corrects for the covariances I mean the point the why is this wrong is because the covariances stretched in this direction So when you to account for that that's what this does and this actually then finds some other vector here and the decision boundary is Orthogonal to that and that's a much better decision boundary that separates these Gaussians obviously much better in this particular case So it works out. I think it's intuitive and the math works out great So I said that I have just said That it's the fact that covariances are stretched in a particular direction that makes it not always optimal to just use The line connecting two centroids and the orthogonal to that as a decision boundary but in some cases it may be optimal and the the Condition for that is of course that the covariances Or one of the conditions for that is that the covariances are just not stretched in any direction. They are just spherical So whenever the covariance matrix is proportional to the identity matrix Which means it's a diagonal matrix and it has sigma sigma sigma sigma sigma sigma Sigma squared the same value everywhere on the diagonal. This is called a spherical covariance matrix and In this case linear discriminant analysis reduces in fact to something that is sometimes called nearest Centroid classifier, so let's look at the picture here So now here are my spherical covariance matrices, right? It's just a circle in two dimensions can be large It can be small depending on the on the size of this variance But it's just a circle and now I have those two and of course here you don't need to correct For you don't need the term inverse sigma Why because the inverse of identity matrix is just identity matrix So it doesn't it doesn't add anything here and you get the decision boundary that that that is orthogonal To the line connecting the centroids, which also means that if you want to classify the point Let's say here you want to classify that point. Well, it's on the right side of the decision boundary But you can equivalently say you just look at the distances to this centroid And to this centroid And whenever you just pick the class, which is closer We in a sense that the center the centroid of this class is closer the mean, right? And for this point it will be that class. So it's it's the same rule the decision rule is the same as this uh decision boundary There's one little difference here though is that when you say when people talk about nearest centroid classifier This is non probabilistic Then decision rule it just It just says You have your point, you know where the class means is you measure the distance from your point to class means you choose the one with the smallest With the smallest distance just gives you Then the the the answer which class it is or your guess your best guess But doesn't tell you what the probability is that it's in this class versus another class This spherical lda. So lda with the assumption of the spherical covariance matrix Is equivalent in the sense that it will make the same Uh prediction, but actually it will allow you to compute the probability Through exactly the same machinery with a bias rule and everything you just need to plug This particular sigma in there. So maybe maybe it's not exactly correct to say that it reduces to it But in some sense it does in terms of the binary prediction it does Under the assumption that The priors are the same right if the priors were different Then it the line would still be orthogonal But it would shift it would it would it would not cross this line in the in the middle So this it's in the middle only because the priors for both classes are 50 percent If it's much more likely that the points are from this class because this class is much more numerous Then the decision boundary will move in that direction And this will not be nearest android classifier anymore. So that's in fact another difference here Okay, and I point put this spherical lda in quotes because this is not I think a standard term But one could call it like that Okay, so far everything that I said about qda lda spherical lda whatever all of that Assumed that we know what the true covariance and the true means and also the true priors are for each class Of course in any real situation, you don't know that you observe some data Some training data you want to fit the model on the training data You don't know our priority what the sigmas and the moves and the pies are so how to do that? Well, you can you can estimate or you have to estimate all these parameters Looking at your training data right and this is so I didn't mention that because it's almost Because it's very easy. We're just using the standard formulas for estimating the parameters of a Gaussian Probability density. So if you have a bunch of points that belong, you know that they belong to one class Let's say to class one That's a training data. So you know that they belong to class one You're just fitting Gaussian to them. That's all so you're fitting the the mean as the average That's just the maximum likely Estimate of that right we talked about About that in previous lectures and you did the exercises about that and we also fit the covariance as just summing the squared deviations from the mean and Averaging them so This can be either without minus one this then this is the maximum likely Estimate of the covariance or one can correct the bias by subtracting one This actually doesn't matter for today's lecture very much You just estimate the covariance of these gauss and then you use so that's why I'm putting hat here Because this is an estimate giving the training data and then you use these Values with a hat in all the formulas That I had in the previous slides to compute your prediction for the test data. There's one ingredient Still missing and this is the prior But this is in fact the simplest thing to to compute If there are two classes and you know that together your training set has maybe 1000 points And let's say 600 of them are from class one and 400 Are from class two then you just estimate the the prior of class one as 0.6 And the prior for class two is 0.4. It's as easy as that So this will correspond if you do this for each class right for the class one and for class two Then you are in the qda situation You will have two means class one mean and class two mean and you will have two sigmas class One covariance matrix and class two covariance matrix And of course if you do this on the real data Even if in reality The covariances are the same you will get some sample of points in class one and some sample of points In class two and then you compute the empirical covariance matrices using this formula and they will not be exactly identical so you will have Even then you will still have slightly quadratic decision boundary if you're using the qda So if you want to use lda you have actually to estimate a slightly different object you have to estimate single covariance matrix that describes Both classes so you can think about how to do it and there's maybe one can do it a bit differently So one can compute these two things and then average for example or somehow try to combine the sigmas in a different way, but Here's the more direct way To to define it and that's how it's usually done. This is Called a pooled covariance matrix estimator and pooled means that we pool points from both classes we take All points from class one and take the squared deviations from the mean of class one That's the inner sum here And then we take all the points from class two and take the squared deviations of all points From the mean of class two and then we just keep adding them together, right? That's where there's double sum. So we sum across both classes pooled together And then we normalize and actually it turns out that if you want an unbiased estimate You have to subtract not the one here, but the number of classes But again, this is a small difference and it doesn't matter for us today illustration Here is some data in two dimensions. So here's a bunch of points that belong to class one That's a training set and here's a bunch of points that belong to class two You can of course estimating the mean is just the mean of these points and this mean is the mean of those points And here's the covariance of class one. Here's a different covariance of class two So each is a two-dimensional symmetric matrix You do slightly different thing In lda, right? So this is the same formula as before and this is an illustration of that. So You imagine that you subtracted the mean of this class from all the crosses And then I'm plotting them here. So the mean is now zero Because I subtracted the mean right and I have a bunch of these crosses Here around zero and then I take all these circles And I subtract the mean From from all of them. Then this whole cloud also moves to the zero. So you should imagine So it starts overlapping with this entirely and then I'm fitting one single covariance matrix to this cloud And that's what it is in disguise Because you're always subtracting the respective mean so you can interpret this double sum here It's just fitting the covariance matrix To this pooled cloud that has means subtracted. I hope this makes sense and since You you have two means that were subtracted and not only one you're subtracting two here And not one if you want an unbiased estimate Okay, and that's it. So if you do this And you can do so let let let me say that you can do this in principle It doesn't matter if the true Coverences are the same or not like if you you make an assumption that true Coverences are the same and then you just compute the sigma you can do this in principle Even like technically you can do it even if the true coverences are very different. You will still get something sigma. This will be the best Coverance matrix single covariance matrix that fits both Classes in your training data The best choice the maximum like this choice, okay Fine, so now I told you how you find the new one the two and the sigma And the pies the priors and I told you the formula that you need to plug or we derive the formula that you need to plug All of that in in order to get the prediction given an x-point What's the probability that it belongs to class one and the class two and if you say you are happy with drawing With thresholding at a 50 percent Then you have your your binary decision That's the full Story about lda There's of course the problem that we discussed that we that we Discussed a lot in previous lectures in different contexts and that is overfitting Issues that you will be getting here Because and I hope many of you already suspected that because we have this inverse of the covariance Matrix term here right which is related to what we had in linear regression Where there was also an inverse of this x transpose x term and we talked a lot about that back then About how this inverse can cause numerical problems, especially if some singular values of x are small Which will always happen if you're If your dimensionality Is Is large For your sample is somehow too large given your sample size or to say the same thing the other way around If your sample size is not large enough given your dimensionality And you will actually run into into Even worse problems that this formula will not even be computable If dimensionality is larger than the number of points right so These causes overfitting because we are in the high variance regime right see previous lectures for all these discussions But we can use exactly the same logic as we use before to deal with this to tackle this here So remember that and recall that in linear regression We had this yeah as I said this was the term and for example rich regularization We defined rich regularization via penalty term and the loss function blah blah blah But in the end it just meant that you plug lambda i Inside the brackets over here and use the same formulas everywhere else and it just works out and this is the the rich the effect of the Of the rich penalty so we can use the same here because here we have a problematic term that is inverse of some matrix so we can add Some diagonal Identity matrix with some little with some lambda factor and then take an inverse and this will fix the problems if this has small some small singular values or eigenvalues then after adding this it they will not be so small anymore So when you when you compute the inverse they will not explode. So the variance Will not explode either. Hopefully So we can write the same thing a little bit differently, which is how it's usually done in in the lda discussions Apart from writing sigma plus lambda i i will Insert this one minus lambda term here So this allows for maybe a more natural interpretation at least in this context when lambda now changes only between 0 and 1 It doesn't go to it doesn't cannot grow arbitrarily large The the range of possible lambdas is from 0 to 1 if lambda is 0 you are just Not regularizing anything, right? It's just lda lambda 0 means lda lambda 1 Means that you forget that you even had your sigma and you just have identity matrix Which corresponds to the spherical lda as i called before basically corresponds to the nearest centroid classifier So you can see this thing as interpolating between lda Raw vanilla lda and the nearest centroid classifier or something that corresponds to it And in the middle you have this entire spectrum. So you can imagine You can imagine doing cross validation to choose the best lambda. You can imagine how the test curve will will have large Will will go up due to high variance in the lda regime. So near lambda 0 and how it will be large Due to high bias In the nearest centroid regime Right and it will have a sweet spot somewhere in the middle Which is where you want to be and you need a test set or you need cross validation to find this lambda So that's exactly the same logic as always in this course That's not the only thing you can do though here. You can interpolate You can interpolate different things using exactly the same logic or very similar logic So here we're interpolating between lda and the nearest centroid, but I can also say well, why not interpolating between quadratic and linear discriminant analysis So I computing I compute imagine that you computed all the Sigma k for each class or two separate sigma k's and you also compute a pooled estimate sigma and then You're interpolating between these two things. It's still it will be quadratic unless lambda is one But it will become more and more and more linear, right? Your decision boundary will like straighten up and up and up until it becomes completely straight at lambda equals two one and this may may help Overfitting if you're in a regime where qda overfits, but lda is not right So there's actually some hierarchy of of model complexity here qda is the most complex model lda is simpler Spherical lda is even simpler. That's why you can Move smoothly using this lambda terms from more complex model to the more simple model And optimize that on your problem at hand and actually there's even more choices that one can that one can write down here in this framework Depending on how you model these two Covariance matrices for the two classes So there's two choices like two orthogonal choices that you can make one choice is whether you assume that the coverances are the same And another another option is to say so this would be the same shared coverance matrix Another choice would be to say no they can be different Then i'm feeding two separate coverance matrices and the second Dimension here is how complicated the coverance matrix you allow for each of these classes It can be no constraints. That's what I call a full coverance matrix. It can be diagonal coverance matrix so zero off diagonal and just Values on the diagonal but different values on the diagonal or it can be a spherical coverance matrix Which has the same value everywhere on the diagonal. So this gives you six options. This is qda. This is lda This is called diagonal lda and this is an actual term people use it sometimes diagonal lda This is a nearest centroid almost right Here you would get spherical qda, which is which I've never seen this term in the literature, but I think it makes sense This means you assume that the coverances are different, but both are spherical so maybe one class is like that and another is like that fine And finally over here you get something that you could call diagonal qda Also not a term that I that I see often But this in fact has another name and this is called naive bias I will talk a bit about that on the next slide Actually these these six models Have some order on them in terms of which ones are more complex and which are more simple, right? So always when you go down your complexity decreases and if you go from Here to the right then your complexity also decreases And assuming that the dimensionality is p you one can actually work out How many parameters these covariance models need in each of the cases so I can tell you for example here There's only one parameter in the bottom right Bottom right exactly you're fitting a spherical covariance matrix just has one parameter sigma and it's the same in both classes So that's one parameter As an exercise you can think you can count number of parameters in every other Cell of this table And another comment here is that one can regular like one can interpolate between more complicated and Less complicated models in any combination. So if you interpolate between lda and the neuroscentroid That's usually what is called regularized lda, but you can interpolate between qda and lda or in any other way possible in principle All of that would make sense. Okay, so I want to Talk separately a little bit about this diagonal qda situation, which is known as gaussian in this case gaussian naive bias And naive bias is a more general term that can be used in non gaussian situations, but I'm not Going to talk about that today just want to introduce the concept of what naive bias means because you might Um, you might hear this term often in in some applied context. So when we're doing But remember that this is just diagonal qda. It's not a new concept It's just qda with covariance matrix here being diagonal here being diagonal, but possibly different in two different classes What it means is that all correlations between all features Within class correlations are assumed to be zero. So we're ignoring the correlations And that's of course simplifies everything by a lot because what happens is that the probability density f k Of x of class k actually decomposes into the product of marginal probabilities for each coordinate So if you have a gaussian that Has no correlation Then it's a product of a gaussian over coordinate one and a gaussian over coordinate two Gives you two dimensional gaussian. So here's written down as an exercise But it's a very simple exercise when just has to check that the probability density here Under the assumption that the sigma is a diagonal matrix can be actually written as a product of Univariate gaussians Centered around the respective component of mu and with variance, which is the respective component of sigma That's a very easy exercise So here's how it looks if you have a diagonal covariance matrix. That's what I what I tried to say before, right? So diagonal covariance Does not mean that this is somehow diagonal here Diagonal covariance means that if you imagine covariance matrix that is a two by two matrix has variance of x one variance of x two on the diagonal and the covariance between x one x two off diagonal So this covariance is zero Means that this is actually not stretched in the diagonal direction at all And that's called diagonal covariance matrix And this is a product of these two marginal Uh densities that you get if you project your entire data on x one or you project the entire data on x two So here's just as for comparison if you have a non diagonal covariance matrix that has some correlation Then the marginal densities are still gaussian because if you take a gaussian project anywhere linearly You still get a gaussian distribution Very convenient property of a gaussian So the marginal here is a gaussian and here it's also a gaussian But this is not a product of these two Right because if you take the product of these two, you still don't know how it should be oriented Or in other if you take the product of these two, you will get something that is not Uh, that doesn't have any correlation That's the the better way to to phrase that Okay, and now naive bias is the the qda classification That arises if you have two Classes with two Coverences and they are both diagonal, but they're not necessarily the same. So here's my class one Here's class two each can be decomposed So to say in these two projections this one and this one and now if you want to classify this point Or in fact even not compute the probability that this point belongs to class one, for example Then you can just look at the coordinate over here and compute the probability that it belongs to this class In this one dimensional situation And then compute the probability that it belongs to this class in this one dimensional situation Then you just multiply these two probabilities and that's it This gives you the correct two dimensional probability that it belongs to this class So all computations can be done Univerrently, right? You don't need to take care of the You don't need to actually think about the covariance matrix at all here You just need to have these individual variances of individual features within each class You you can erase this two dimensional picture and just keep these one dimensional pictures for each for each feature So that's very fast to compute this will not overfit. This may have high bias But this will have low variance can be in some situations actually very practical That's why it's called naive right because it ignores actually all the correlations that may be in the data By just assuming that covariance matrices are diagonal Right if you think about the decision boundary here, it will not be linear Just want to say one thing because this is still QDA, right? So if you work out the decision boundary, it will still be some so in this case, for example, it will probably Go like that and then curve here all right now To something slightly different in fact, there's a whole parallel or different way to derive The lda and depending on the textbook you pick You may see that you may see lda introduced as I just did Until now And then later this part is mentioned But in some cases you will see the the other way around it will introduce like I will now Describe and then later the connection is made to all this probabilistic stuff and bias theorem from before So there's two different ways to look at lda arriving to the same thing And this was probably the more the historically the first the first thing So lda is also sometimes you will see the term fishes discriminant analysis or fishes linear discriminant all of that means exactly the same thing And what fissure did originally to to derive essentially lda Was was a very different picture So he asked this the the the following Question he posed the following problem imagine that you have two classes Imagine that you assume that the coherence is the same right? This is the same assumption as before now Let's try to find a good Linear projection of the data so we can project this data linearly on a line line can be here or it can be here can be anything What is a good one dimensional projection in terms of Separating the two classes from each other So we want to find the projection such that when you project all the data the classes are as separate as possible How we need to quantify what it means for the classes on in one dimension to be as separate as possible And the definition that that fissure used there was to somehow measure the between Class spread how far the classes are from each other and then measure how much variance the classes have That's called the within class spread And then take the ratio the higher the ratio the better because the high we want the between class Spread be high and within class spread below this contributes Both things contribute to the high ratio, right? So here's one way to to define that So let the means so you imagine that you projected everything to one dimension and now you just have one dimensional quantities Instead of the vector So you have the mean of class one the mean of class two and you have some variance over here I will denote s as just sum of square deviations and then divided by the sample In the same here and then here's the fissures ratio We have a square difference between the means The further the means are from each other the higher the numerator And then we have the sum of the sum of squared deviations from the respective mean in the denominator And now with a little bit of algebra one can or calculus one can find the solution to that So what is the best direction to achieve that? So I put a star over here. Oops. Sorry a star over there Because I don't want to To explain all the steps mathematically in much detail. Maybe we'll talk about that In the exercise time. I will just briefly go over it to to give you an idea of how it works. So If I denote by w the the direction So the axis on which I project the data then I can rewrite this term like that So I have here the m one is just the projection of the move vector on the w And we have the square of that and in the bottom. I'm using the same sigma as I defined before That's the pulled covariance matrix and you can Easily convince yourself that if you compute this term over here So you multiply it with w on the right and with w transpose on the left You get a number that's the same number is here up to normalization context constant over here That's why I'm writing proportional and not equal So here you just don't see this w here. You see the w and the question is what is the w that maximizes this Uh term. So how do you solve that? Well, there's a trick That's actually very useful whenever you deal with with problems like that So I want to mention that you make a change of variables. You say Here, it's not obvious. What is the best w because this denominator is somehow complex So let's introduce a different variable called v. That is square root of sigma square root you can always if you have a matrix that is Uh positive definite that has all the uh singular values Positive you can define the square root as the same matrix where all the singular values are the square roots Of the of the original singular values and then the square of this matrix. So this times itself will be just sigma Okay, so we Define v like that. Then we plug everything in here and now in the denominator We just get the norm of v. So that's very convenient because here we can see directly That if you multiply v by any constant Then you will get this constant squared above and the constant squared below so it will cancel So actually the length of v doesn't matter So we can choose any length of v that we want we can choose length one If we choose the length one then we can just Erase that because this is just one and then we want to maximize the length of v times A given vector and that's of course Will be maximized if v points in the direction of this very vector which just means so there's a long explanation but actually A pretty obvious then result is that the v should be proportional to this thing over here And now we need w right? So we need to Convert v into w using this formula and you get the same thing as before So that's a very short explanation of how In fact, it turns out again not obvious but in a priori, but it turns out that if you want to maximize this ratio Of class separability, then you derive exactly the same thing that we had before That followed from a very different logic In LDA here's an example Just an intuitive intuitive explanation. So here are the two Gaussians the covariance is assumed the same And now just think about what happens if you project this into different directions So if you project here, for example, here's the one mean here is another mean And The covariance is a pretty large Right, so here the separability between these two classes According to the Fisher's definition will be rather low because you're dividing by large variances whereas the optimal projection Is around here and if you project here, it may even be that the distance between the means will be a bit smaller But the covariances will be a lot smaller in this direction. So the class separability So between these gaussians and these gaussians this separability will be Larger Which is what the Fisher's criterion is after right? okay a comment on the Discriminate analysis versus logistic regression and we're approaching the end the These are two linear methods. They're both popular. You can in practice See both of them applied very often One should always remember when when doing that when when applying them that there can be overfitting issues in both And you can regularize both By in different ways, but for example with the reach penalty that can be applied to logistic regression and to lda So whenever you do this in practice, please make sure to regularize to cross validate to somehow choose the best regularization And if you do all that then in practice very often There's not much difference between lda and logistic regression both can give you the probability probabilistic predictions if you want them both With the threshold can give you binary predictions That's what you want The performance you can find some situations where the performance is very different, but usually it will not Or at least often often it will not so nevertheless one can say the following if the data are truly gaussian then linear discriminant analysis Is provably optimal in fact we just proved it if the data are gaussian and the Coverences are the same then you cannot do better than lda. It just follows from the from the base rule and that gives you the best possible linear classifier Of course the catch is that if the data are not really gaussian then Anything can happen with lda in principle, right and logistic regression does not do this assumption So if the data are strongly non gaussian logistic regression can perform better can outperform lda An example of that would be if the data may be a kind of gaussian But you can have outliers in the data and an outlier to your like you have a nice Gaussian over here and an outlier over there. This will completely screw your estimate of the covariance matrix lda will Start performing were bad, right? But in the but in logistic regression it could be that nothing much happens It still performs well But of course the data can also be completely non gaussian have completely non gaussian shaped without outliers And they are too like lda can surprisingly can perform surprisingly well even given non gaussian data But I think most people would agree that logistic regression is probably the best First choice because it doesn't need this additional assumption And you don't lose that much even if the data are close to gaussian logistic regression is still will be pretty good So not much is lost and it's a safer option final slide Is that I just want to mention Something not lean non-linear In the end. So I introduced that we talked about the nearest centroid classifier before right? So here's here's some example for you. Let's talk about this picture. So here's one class the crosses Here's another class the circles this point denotes the mean of the circle class and this point denotes the mean of the cross class And let's say your test point is somewhere over here. Well, it's closer to here So it will be classified as a cross right, but Of course, this is only this only makes sense If you if you assume that the date has gaussian and spherical and here they are obviously not spherical or not even gaussian And in fact the point here is most likely a circle right because the circle class has this um has this shape So an alternative idea, but kind of I think conceptually not that far from from looking to which Class centroid is closer is to say well, which actually which class Are the points that are neighbors of my Test point. So let's say my test point is here where my fingertip is then i'm finding several nearest neighbors Let's say four in this case. It will be this point this point this point and that point over there I encircle them so I found four nearest neighbors and among these four nearest neighbors three Our circles. Well, then I classify it as a circle. Okay, so that's called k. Nearest neighbor classified super simple um, but it actually can be super effective in In in in very different situations and can be Also a good first choice often The value of k here. So you have k nearest neighbors. You can have one nearest neighbor You can have 10 nearest neighbor. You can have 100 nearest neighbor classifier So the value of k sort of controls the position on the bias variance trade-off if you if you have k equal one That's a high variance situation because you can imagine the decision boundary If you draw the decision boundary here with k equal one, it will like go around all the crosses Right here. So it will it will be prone to to To overfitting the noise in the data whereas if your k is large Then it will rather be high bias situation. So you can see k as a regularly It's not a regularization parameter. It just directly controls you complexity of the model So you can tune the k by cross validation or by applying it to a test set Several comments about that. This is in so-called non-parametric method. So this is the first non-parametric method That that we encounter in this course. This means that we don't in some sense. We are not building any model. We are not We are not We're not fitting any parameters, right? That's what not parametric means. We don't have parameters We have if we want to make a prediction for a test point We just need to find the neighbors in the in the training set Which means that we have to hold the entire training set available at the test time It's not that we fit the model on the training data. We can't forget about the training data We have our model parameters and we apply it to the test. This won't work here here You don't have model parameters. You just have your training data. You have a query point You find the neighbors in the training data This gives you your prediction interestingly though Or nevertheless, it can be given a probabilistic interpretation pretty similar to the To the one we were developing in this lecture because you can view this nearest neighbor thing as constructing a non-parametric estimate of this probability of x given the class k, so this is also something I will leave as an exercise, but it's pretty pretty simple to see that if you If you Found your if you if you say well, I will look at Some amount of nearest neighbors Let's say in this case four nearest neighbors is in this picture and I will denote by c k the number of points among these nearest neighbors that belong to class k Right and nk is as before the number of points in this class overall on the entire training set Then this fraction so let's say I overall have 10 points 10 circles and 10 crosses and here I get three circles And only one cross This means that the probability is three times larger that It belongs to the to the circle that it came from the circle class then that it came from the from the cross class so one can I think when when using nearest neighbor classified people usually don't think about probabilities rather they just say We take the majority vote of the 10 nearest neighbors and that's it But in principle one can say well if you taking the 10 nearest neighbors you You are you are making a probabilistic prediction if it's 7 out of 10, then it's 70 percent That's the likelihood you can still multiply it with the prior If you want to get the posterior Um, so this is this is this doesn't fit the lda story, right? This is not parametric method. The decision boundary is not linear. It's not quadratic. It's not parametric Um some function that just depends on the training set directly. All right. Thank you