 Hello and welcome to lecture five of this introduction to machine learning class. The topic of today's lecture is logistic regression which means that we switch from talking about regression that was the topic of the initial four lectures to classification and this will be our topic for the next four lectures. So what is the difference between regression and classification problems? In regression we're using some predictor variables to predict a response variable which is usually continuous. In classification we want to classify our samples into discrete classes. So for example if you're given an image and you want the algorithm to tell if it's a cat or a dog that's a classification problem. There can be more than two classes it can be cat a dog or a crocodile or a panda that's also a classification problem. So we can think of the response variable in classification of being a categorical variable. So a variable that takes values in the set of cat dog crocodile panda. It's important that these are discrete categories and that there is no order on them, right? So the crocodile is as far from a cat as from a panda. We don't assume any ordering because you can have discrete values that are ordered like integers. This is not what we are going to talk about now. So for us in all these classification lectures we assume that the categories are unordered and the response variable Y is categorical and if there are only two classes which is by far the most common situation at least in textbooks this is called a binary classification problem. So this would be cat versus dog. If there are more than two classes this is called multiclass or multinomial classification problem. It is convenient we will mathematically convenient to label each of the categories with an integer starting from zero. So talking about binary classification problem we will be talking about response variable Y taking values in zero or one set, right? So it can be zero or it can be one. If there's more than two classes then it will be from zero to k minus one where k is the number of classes. And we will start and actually spend most of the lecture talking about binary classification problem so I will not be talking about cats and dogs anymore I will be just talking about zeros and ones from now on. So the first question that's really important and that we need to clearly understand is why we need something else why can't we just keep using linear regression as we did we have all the machinery nicely set up now we have Ys that can be zero and one so maybe we can just do the same thing and in fact in some situations it's not too bad but there are problems with that so let me let me try to explain the problems. Here's an example dataset very simple dataset there's one predictor X so it's like simple linear regression that we had before there's only one predictor X there's a response Y all points are either on the zero line or on the one line right as we as we agreed on on on the previous slide and here I assume that one can actually predict pretty well so for all let's let's take a concrete example so previously I was talking about predicting height from from age I think so here we can talk about predicting for example where the person went to school or not or started going to school depending on age so X is age and for a very young ages the answer is no that's why all the points here are zero and then from some age it's mostly once and of course there's some gray zone where some people go to school earlier or don't go to school until until later so imagine that that's the data and we run linear regression and we get the the line of the best fit and it looks like that and then we can we can additionally say that well whenever the prediction is above 0.5 we say it's one and whenever it's below 0.5 we say it's zero so we categorize the the prediction to get actually predicted class is it is it a good model or is it a bad model in this case actually it will not be so bad it's conceptually like theoretically it's unsatisfactory I would say because we're predicting values that are above one or below zero often and this doesn't seem to make a lot of sense but there's a particular practical problem with that too so imagine I'm adding one more observation here and that's here on the right and this can be a person that was added to the dataset that's really old and of course went to school so sits here at y equals one for our linear regression model this is a huge outlier there will be enormous error right so our mean squared error here will grow by a lot and the the the line will change to adapt for that and will go a lot lower over here to minimize this error which means that we'll start making errors for all these points over here and the model will start being much worse and this all often happens in linear regression if you have a strong outlier the problem here is that this is not really an outlier right it it it's perfectly fits to to the model like the this categorical classification model that we're trying to build here so it's only outlier because we have this linear model which is of course inappropriate for for this case so that illustrates why it's not a good idea to use linear regression even though in some cases it might work fine so we want something else we want in fact to construct our model such that we're predicting y hat that's our predicted values remember to be between zero and one not outside and why do I say between zero and one even though y without hat so the actual values are either zero or one right so that's another important point we want at least in this lecture we will want to predict something that can vary from zero to one and take all the values in between because we will be interpreting this as a probability that the sample belongs to class one if the probability is 100% okay that's one coincides then with the sample one but maybe the probability is 80% and then we'll be predicting 0.8 probability according to our model so the the deep conceptual statement here philosophical perhaps is that in many cases it is actually it makes more sense to predict probabilities than just class membership so think about it like in practice you want to maybe it's a classification problem where you have a patient date and you want to predict if a person is sick or not it's much better if your model tells you the person is sick with probability 99% or the person is sick with probability 52% rather than having a model that just tells you the person is sick or the person is not sick right you you want this probabilistic statement and actually in most cases you want it so I'm arguing that it's good and that's what we'll try to do here that's what linear regression try that's what logistic regression tries to do is to predict the probability that's why the predicted values will be can vary from zero to one take all the values in between okay that's actually I think why logistic regression is called regression even though it's a classification problem because we actually will be dealing with continuous predictions and not with discrete predictions funny enough okay so how do we achieve that we want we want so here's our x variable let's say and we want some function that maybe is goes to zero here and goes to one here and and kind of has this this shape this will fit to our for example example with age and and go into school or not so it turns out that there are many such functions so you one can choose different things a very convenient and common choice is a so-called logistic function that has the sigmoid form so it's also called sigmoid function often that has the say this this equation so you can you can easily see that when x goes to minus infinity this goes to zero because this becomes a large number and when x goes to plus infinity this becomes a very small number so you have one on one so it goes to one and at zero it's right in the middle at point five okay so this is a logistic function without any parameters that's just like the raw the logistic function as it is we of course want to fit logistic function to our data set so for example if we have this this this imaginary data set with edge age versus school attendance we want to adjust some parameters to fit this logistic function so how would you adjust this parameters well maybe you want to shift the entire logistic function left right or maybe you want to squeeze it or or stretch it right that's two parameters that one can think of so in fact if I write logistic out of this linear linear function of x that this will do exactly that I can rewrite it a bit differently plug it into the formula and then you will see that if I write it like that then the the center the middle of midpoint is at when whenever x is equal to a right that's when this is zero so you get one half as the as the predicted value and then the second parameter this beta the beta one it actually governs the slope here so the larger the beta the steeper this curve the smaller the beta the more shallow the slope is here and one can take the derivative and verify very easily that the slope is actually just proportional to beta one it's beta one over four so this is the analog of simple linear regression for for this logistic regression would be fitting beta one and a or beta one and beta zero equivalently so that this this this sigmoid stretched and shifted in some sense fits this point the best right if there's more than one predictor then it's exactly the same thing as we had in linear regression we just replace this this linear function of x with the linear function of of the vector x where we assume that vector x has the intercept column the so the first value of the vector x is one as we had in in linear regression so that's exactly the same here and we're just talking about the beta vector here so that's so observe that after you multiply beta by with x you just have a scalar it's just one number and then you can put it through the logistic function to get a probability okay great so one thing that i'd like to discuss here is that linear regression is called a linear classifier why is it linear so we the logistic function is obviously non-linear function however we will it's it's people say that this is a linear classifier because the probability is some transformation of a linear function of x you take x you multiply it by beta that's a linear function okay and then you apply non-linear transformation to get a probability out of it to squeeze it to the 0.1 range um but before you did that that's a linear function of x and another way to see why why it makes sense to call it linear then is to look at decision boundary that you're getting out of that with when when you have several predictors so let me explain this little cartoon so here you have two predictors x1 and x2 this is this kind of image i will be using a lot during this lecture so circles denote one class and crosses denote another class and let's say you fit the the model to this logistic regression model to this data then for each point on this plane you get a predicted probability right so you can think of that as a the plane is is horizontally and then you fit this um you you you fit a a surface which is a plane that's like a prediction plane that that is linear and then you transform it through the sigmoid and get the predicted probability so where are the all the points that have the predicted probability one half and i argue that that's a straight line here because this corresponds to the points that have that that where beta times x is zero and beta times x is zero that's an equation for a line so and and we can call it a decision boundary because if we say that everything that has probability above 50 to belong to class one we'll just call class one for example and and and below its class zero then this is our decision boundary right and this boundary is linear in logistic regression and whenever the decision boundary is linear we call it um always linear uh we call this algorithm and linear classifier not all problems obviously can be solved by linear classifier or by logistic regression so here's an example of a of a problem that clearly is where the two classes are clearly separated it should be easy to in fact distinguish class one from class zero right but there's no linear decision boundary that can do that you you need a circle as a decision boundary um but in fact it's the same thing as we discussed before with regression where remember that we we talked about polynomial regression which is still linear regression with just added polynomial features and then you can you can actually fit non-linear functions with it so with the same trick works here you can add polynomial features to your problem and this will at least in some cases allow you to to convert this kind of problem to a problem that can be solved by a linear classifier so in this case the correct the correct decision boundary is a circle equation of a circle is quadratic so uh let's try let's let's add quadratic features and in fact it's enough to just add x one squared so you have three features now x one x two and x uh one squared and all the crosses have larger value of x one squared compared to all the all the circles right so you have to imagine that along this third dimension all the circles are in the bottom and the crosses are over here in a larger circle so it's very easy to linearly separate them it's just a plane that goes in between them so the separation the decision boundary is just a plane here and if you if you collapse it and and draw the same thing here then actually the decision boundary will be a circle so that's that's exactly analogous to polynomial regression for the regression case one can do the same thing with logistic regression and obtain some non-linear classification boundaries but let's get back to to to just setting up the logistic regression because we actually are not finished we discussed how we can predict like what's the formula to obtain y hat values from the x given some parameters beta but what we did not describe the missing ingredient is the loss function so how do we judge if a given sigmoid given the parameters fits the data that we have well or or it doesn't and for this we need a loss function and again the question is here conceptual question is why can't we just keep using the same loss function that we had for linear regression we can use this formula for y hats now okay and then still use the mean square error loss function so whenever this is zero the y hat should be close to zero that's what we want whenever this is one the y hat should be close to one this is again what we want it seems it seems rather meaningful so from from the way I'm asking you probably guess that it's a bad idea but it's not entirely stupid idea one can do it like that the reason why it's again arguably not so good is because imagine that you predicting class one with 99 probability and another model predicts class one on the same sample with 99.99 probability from the point of view of mean square error that's around the same mean square error right around one however probabilistically these are very different things it's one thing to be wrong let's say the the truth is that this sample actually belongs to class zero so in this case you predicted one you something happened that you predicted would happen with one percent probability and in this case a different model something happened that you predicted would happen with probability zero point zero one percent that's hundred times more unlikely you should definitely want to penalize this model much more for this specific case right and this is something that means square error will not do so we want um we want a a a better loss function that is somehow takes into account this the nature of our probabilistic prediction problem so we can either just think of one but let's try to derive it in some sense so recall previously we discussed that the mean square error loss function actually follows from a assumption of Gaussian noise if we use the maximum likelihood principle so we assume the probabilistic model that has the Gaussian noise we say we want the most likely solution of that that's the maximum likelihood principle and mean square error loss function just just pops out of the of the formulas can we do the same trick here yes we can so let's let's try to do that um of course we don't have Gaussian noise here we actually don't have any additive noise we're predicting something uh that's our white hat let's say with 80 probability should be class one but it can be either class one or class zero so this is described mathematically as a Bernoulli random variable so it's a random it's just basically a coin flip it's something that can be one or zero and the probability of one is p and probability of zero is just one minus p so it's a biased coin that's Bernoulli random variables nothing else than biased coin um in our case so let's try to convert this now into the likelihood of our entire training set right that has so i now goes over overall samples in our training set from one to n and whenever we're predicting so h of xi is our prediction of the model what the probability is and then if the sample is actually one then this is this is part of the likelihood the um because that's the probability that was our predicted probability to observe one and whenever sample is zero then one minus our prediction is as written here then the part of the likelihood and the entire likelihood is just a product because it's a probability to see the sample number one under our model times the probability to see sample number two times the probability to see sample number three so it's just a huge product it's just a bit inconvenient to write it here because i'm writing it so this is a product over all samples that have value one and this is a product over all samples that have value zero in the training data um okay so that's the likelihood fine nothing nothing much happened here we discussed before that it's convenient to to work with log likelihood because you get rid of the probabilities and you you get sums which is which is often very convenient and also it's convenient to flip the sign so we just do the same thing here we compute the negative log likelihood and you get this equation um that is clear one little trick is that it's convenient to rewrite this now as one sum so it's a bit um it's a bit cumbersome to write these two sums always so i would like to have one sum that goes over the entire the entire training set how can i achieve that so here's a trick i say i write it like that and then just observe what happens whenever y i is one then this is one and i have this logarithm so that's the same as here and this is zero so this entire thing falls out okay great and whenever i is such that y i is zero then this falls out and this is one and i'm just left with this logarithm which is the same as here so it fits it's the same it's the same i can rewrite these two sums as one sum using this yi variable so see this only works because they take play they take values zero and one so that's uh one place where this is actually convenient notation okay so this is the loss right we we will we will use that as a loss and you see that this logarithm logarithms appeared here um so let me just rewrite it here on the new slide this is the loss function whenever i say h of x i i mean this so you can also just plug this in in here in these two places and and this then is the full specification of the problem if you if you give me a data set and you give me a beta vector i can compute the loss and give you a number if you give me another beta vector the the loss function will give you another number so what you have here given your training data you have your space of possible parameters beta and for each beta you get a value so you have this loss surface that you want to minimize this is not a linear regression problem in fact it's that's not a linear model because as we discussed before a linear model is linear in the parameters and here the parameters beta this is obviously a non-linear transformation of the parameters this is called a generalized linear model or glm um and this is part of a of a big family of different glms we will not be discussing a general theory of glms here i'm just mentioning that um because maybe you you come across this term later on so in some sense it's still pretty linear right because this better beta x factor as i as i try to explain before and there is just this non-linearity in the end which makes this pretty simple to still work with that's why it's called generalized linear model so remember that generalized linear model does not mean linear model okay um good news this function is convex just believe me here one can show that's not difficult that this is a convex function meaning it just has one minimum you can start anywhere you want do gradient descent converge to this minimum great bad news is that there's no way to write down the solution so the position of the minimum the beta hat um vector as a formula there's no closed form solution like we had for linear regression this is not possible here so what we have to do if we actually face this problem is that we have to do some optimization and start with some gas and then go down the loss and converge to the to the minimum and that's how we obtain beta hat in practice usually the sex or called second order methods are used to optimize logistic regression and i didn't mention second order methods before what this means is that i am the method doesn't only use the the derivative of the loss with respect to beta to to know where to go down as gradient descent does but it also computes the second derivative of the loss with respect to beta and and uses that to choose the step size essentially i just wanted to mention it briefly because um these didn't come up previously the second order methods but here for simplicity we'll be just using gradient descent to to to solve it right so it's it's still possible so let's try to derive the gradient descent for logistic regression i will not do the full thing but almost so i will leave it as an exercise for you to compute the derivative of the sigmoid function it's very simple and you can verify that you get this and then using that we can start to compute the the gradients so we have this term in the in the loss function the logarithm of h of x and we're computing the gradient of that and so h of x is a g out of this from from this linear combination beta x and again it's pretty simple to see that when you start computing and it's like the chain rule um you know in in standard calculus so you have log of something so it's one over something times the gradient of something and then the gradient of something is given by this formula over here that's why the first term will cancel with one over it and you have the second bracket over here and then you still have the gradient of what's inside g but that's just beta times x and the gradient of that with respect to beta is just x so you have x here in the end you can take a piece of paper and try to derive it with more steps if you want but that will be the end result and similarly if you this this other term in the loss function the gradient of that is given by this and now if we put this all together here was our loss function we want a gradient of that right so we put the gradient inside the sum and then apply these formulas some things will cancel very nicely so you will see that y i times times this term over here will cancel with y i times this term over here so that's um it's great just just open the brackets and you will see that that some things cancel and that's what you are left with so that's very simple actually here you have y i here you have y i hat essentially right that's how prediction y i hat times the x and then sum over the entire i and we can write it in in in matrix form two where now the capital x is the whole design matrix so matrix that that that has all our samples as rows we talked about that before and y is a column vector of responses and y hat is a column vector of our predictions over here so we get this gradient which is as you see a super simple formula one can program it very easily if one wants to program the gradient descent um it's just one line of code essentially now amazingly if you remember how it looked for linear regression you might have noticed that this is the same formula so the formula I have here in red is identical up to a constant factor like one over n which doesn't matter is identical to the formula we had for linear regression even though the y hat has a different meaning because the y hat is now given by this very different formula it's not just beta x but h of beta x um but it works out that the formula in the end using y hat is the same and this actually follows from the general glm theory so it's the same for any glm which is hints at and and how great the glm framework is but let here I'm just pointing out it as a as an interesting fact now so we're done now we have the loss function for logistic regression we know how to do gradient um we can obtain solution that's it let's now discuss the properties of that can we overfit by doing that yes so in fact everything that we discussed in previous lectures about overfitting regularization bias variance tradeoff all these things all of that applies to logistic regression as it did apply previously so let me show you here's an example that's a training data it has some x's and some zeros um look closely and you will see that if I'm fitting a straight line here then of course I'm misclassifying I'm doing pretty good but I'm misclassifying a few points here um and that's not the best one can do probably here this has high bias because um I will always consistently misclassify the x's over here if if I sample from this data set over and over again now let's let's do exactly the same thing we did in one of the previous lectures where we're adding imagine that we're adding polynomial predictors to the model but I will not be drawing polynomial predictors I will still show x1 x2 plane and just draw now the curved decision boundary it's curved because I have higher order polynomial predictors okay so maybe I added quadratic or the third power and I get a function and so the decision boundary that is like that and maybe that's actually how the data are generated right so these x's still belong to class one there's one misclassified zero well that's noise in the data can happen but if I keep increasing the dimensionality of my predictor space then at some point my model will be so flexible that the decision boundary will become so curved it will just go around you know go around like that every single um every single example in the training set and you will get 100% classification accuracy on the training data of course if you then get a test data and use the same curved decision boundary then most likely you will be often very wrong because this has high very this is a high variant situation right for every new training set just remind about what high variance means if you generate different data sets or if you imagine generating different data sets from the same distribution you will get very different decision boundaries every time because just depends on the noise it fits the noise that's why it's called high variance so this is a bias variance trade-off right the training arrow decreases if you increase the model complexity and the test error decreases that increases again there are some sweet spot you can use cross validation to find the sweet spot if you use a penalty or regularization term like rich penalty or lasso penalty one can use either of them with logistic regression and then the model complexity just goes from like low complexity means low penalty uh no yes low complexity low complexity means high penalty and high complexity means low penalty right that's the high variance part okay one thing that deserves a special mention here is if you increase the predictor space enough then you get into the regime like remember in the linear regression we get to the regime that I called interpolation regime so whenever you have more predictors than the samples in linear regression your training error is zero that's one aspect of that you can have multiple beta hat vectors that give you the training error the training loss of zero that's another aspect of that and then additionally we discussed that under some conditions it can actually be that if you select a particular one out of this multiple the minimum norm beta hat then actually it may it may perform well in practice here for logistic regression we have the conceptually similar thing but it manifests a bit itself a bit differently and that's called perfect separation so um in fact it can occur if you just have two predictors without strong correlation between predictors so this is different from regression in regression this cannot happen so let's say we only have two predictors nothing is very strongly correlated it's a nice data set here are all the zeros here's all the ones we can it seems there's no problem right we can perfectly classify them with 100 accuracy on the training data and maybe even on the test data because this is such a simple problem however for logistic regression there is a problem in a sense because if you think what will happen with your beta hat your loss the training training loss will converge to zero as you run your gradient descent but your beta hat will diverge to infinity why doesn't it happen it happens because the let's say this is your decision boundary right here but then the model wants the prediction here to be as close to zero as possible and the predictions here to be as close to one as possible so you should imagine here the the that you're putting these values through the sigmoid right and then you get something that looks like that or let me show it like that maybe if you're predicting zero zero zeroes then it goes up and you start between ones and the loss will be smaller if this if this is super steep if it's basically a step like that and this corresponds to beta hat actually diverging to infinity so that's not very nice and this is a problem that one one happens if you have the perfect separation in training data one way to to to deal with that is to use some regularization if you put some regularization it's like putting the prior on the beta coefficients we talked about that before so it will not diverge to infinity anymore they will they will stand somewhere and the interesting a very interesting topic is that i only very briefly mentioned here is that actually it is possible that the some in some sense sorry in some sense there will still be different beta hats they diverge to infinity but there can be different decision boundaries too right here so the magnitude diverges to infinity but the direction of the beta vector can also be slightly different and some of these directions may actually perform pretty well in some cases we may talk about this in in in other lectures but what so if you have if you're in a situation where you have a lot more predictors than than your samples happens for example when you train neural networks which is not a logistic regression but very very flexible model that can fit any training data for example then you can reach the situation where the training loss is actually zero and at the same time neural network can perform well on a on a test set and that's because that's not just any beta hat that you get out of the training procedure but some particular beta hat and that's another example of implicit regularization okay so let me in the interest of time to proceed to to several more comments here logistic regression is designed to give you probabilistic predictions right as i try to explain so it does not give you a class label zero or one it gives you a probability if in some cases you want to obtain actually you want to have a class assignment you want to say you want to somehow get from the probabilistic prediction to just binary class label you can do that but you need a cutoff or a threshold so you do you get a probability out of the model 75 percent it's class one and then you say well if it's 75 if it's above 50 i say it's class one so that's your decision rule you just threshold the probabilities at some level and it's common to a threshold at 50 percent but you don't have to threshold at 50 percent that's something that it's a choice it's a decision that you have to to make about whether you draw where you draw the the boundary you know the threshold between between class zero and class one and there are many considerations that can enter into this choice so it's common to choose a cutoff that maximizes accuracy and by accuracy i just mean the fraction of samples that you classify correctly even then if you want to maximize the accuracy it doesn't necessarily mean that the 50 percent the cutoff of 0.5 will be the best you might additionally want to select the cutoff so you first you fit the logistic regression model you're just looking at the loss which is not the accuracy and you choose some the procedure gives you some some beta as the solution the beta hat and then you can additionally choose the cutoff based on for example maximizing the accuracy like in a separate cross validation loop for example it may often be close to 0.5 though if you're if you're maximizing the accuracy however not always you will want to maximize the accuracy in some cases you might prefer to make errors in one direction than in another let's say you're predicting whether a person is sick the same example I used before and you really don't want to miss potential potentially um positively diagnosed potentially infected potentially sick people you know so you rather you rather get a false positive that you maybe then later check then miss a true class one case so in this case you might lower your threshold because you will have additional screening after after after after that or you may be in a different situation where you actually don't want false positives you you don't want to have many false positives you want only to detect something that is like really definitely class one and then you increase the threshold so that you only get as class one something that where the model is 95 percent certain it's class one and everything else is put as class zero right so you can imagine different real-life situations where you have some considerations that tell you whether predicting one or predicting zero is more important there are some trade-offs maybe costs like actually financial costs that that will follow from this decision all that then enters the decision about whether to to do a call so people often just use just just think about accuracy but not always accuracy is a meaningful um measure in real life and one way to to think about this problem of choosing the threshold that you often see in in practice is by this curve that is called ROC curve or receiver operator characteristic weird name um but what what this curve shows is what happens when you change the threshold essentially from zero to one and the axis on this plot uh here is the false positive rate so false positive rate is out of all my predictions out of all cases when i predict one how many of them are wrong how many are actually zero so that's called a false positive what is my false positive rate it can go from zero to one and on the y-axis is the true positive rate which is what fraction of the true ones do i discover flag as ones that's called a true positive rate there are other terms for these things you might have heard like sensitivity and specificity these are confusing terms that people always mix up that's why i prefer to use this this here so and now think how these things change if you have some model and it's pretty good can separate two classes well and then you change the threshold from zero to one so if your threshold is zero everything goes into um yeah if your threshold is zero then everything goes into class one so your false positive rate is uh is one and your true positive rate is also one so you're actually sitting right here at this point that's obviously not useful if your threshold is one another extreme then you just misclassify everything as class zero and you you you end up here with true positive rate false positive rate everything is at zero and now if you if you change the threshold between zero and one you have this sum curve right where yeah there's just some curve depends on how the classes look like and how well your classifier does so the better the classifier the further this curve is from this diagonal line diagonal line is basically when you're predicting random outcome and you change the threshold you will move along the diagonal line if your classifier is not random then it goes more towards this point in for any given real world situation you may have a curve like that and then the question is okay which threshold threshold do you pick and this curve doesn't tell you which threshold to pick because you have to have all these considerations about what false positive rate is tolerable for you or what true positive rate you want to achieve right and then you just looking here and making some choice about which threshold is the best or you can you can it doesn't have to be subjective choice you can objectively write down some some function and say that's my criterion for choosing the the threshold I'm just saying that it's often something additional to the logistic regression model okay one note here on the accuracy related to the previous slide is that a curiosity can be actually a misleading number in many cases for one example is if the classes are very unbalanced so this is called a class imbalance imagine that you have um I don't know only like among all people 95 percent don't have a particular disease and five percent do have a particular disease and that's what you're trying to classify and then if you classify everybody as being healthy you are 95 percent accuracy that's a meaningless number that you you're not predicting anything and you are already at 95 percent and then whether it's 95 or 96 maybe that's actually a big difference but uh it doesn't sound as a big difference if you just look at the accuracy numbers so if you have this very unbalanced classes situation then accuracy is a particularly meaningless measure there are some other measures that um like for example false positive true positive some combinations of those um some other measures that can be more sensible to describe how well your classifier performs what is though important to say on the other hand is that logistic regression itself works just fine if you have class imbalance so there's sometimes people get afraid we have a class imbalance in our training set what should we do um there's actually nothing you in most cases you don't need to do anything it's fine the logistic regression can work with that just fine at least whenever in your training data and your test data that's the same class imbalance which is like we always assume that the test set is from the same distribution as a training set right if it's not the case then you have to do something but um if it is the case it's fine logistic regression will um everything will work out okay it's just that you might need to put additional thought in whether you're using accuracy or some particular other measure to choose a threshold um whenever you have a strong class imbalance all right and this brings me almost to the end of this lecture so this was the binary logistic regression so what do we do if we have um if we have more than more than two classes and it turns out that one can basically use almost the same machinery to generalize it and that's called sometimes called multinomial logistic regression so one way to formulate multinomial logistic regression is via a so-called softmax function something that you might have heard of if you if you if you did some neural network um study or implementations so that's common there so let's now imagine a more general situation where you have k classes maybe 10 different classes so for each class from from from 0 to 9 I will actually use its own beta okay it's own beta vector so now I have in this situation 10 different beta vectors and then I'm using so for each of them I can compute beta x and that's my linear predict prediction that I want then to transform in some sense into probabilities and I'm using the softmax function to do this and the softmax function just makes takes exponent of all my beta x and then divides by the sum of all exponents so this makes sense because then the sum across all classes it will be it will sum to 1 which is what we want from probabilities um and everything is positive of course because I took the exponent so that's also something that we want from probability so at least this makes sense here um so if you can think about softmax it's just a way to transform um like any real number prediction for each of my 10 classes into something um that can be interpreted as a probability is everything is positive and sums to 1 that's one way of doing that um note here an interesting thing is that actually this is sort of over over specified um formula because if I add any vector fixed vector to all betas nothing will change so imagine that I'm adding some psi vector to every beta then here I will have plus psi so I can factor it out as a separate exponent with a psi in the exponent and the same will happen in the denominator in the sum so I can take it out of the sum and cancel so the probabilities that I'm predicting will not be affected by any constant shift of all betas so we can as well say well we'll just fix we'll just introduce one constraint and we'll somehow fix this psi for convenience and one way for example we can say let's demand that one of the betas is zero right you this doesn't change any predictions then because that's equivalent to just choosing the psi that is minus one of the betas then you just set one of the betas to zero and then of course you don't have any freedom anymore and everything else has to be um yeah what it is it's to stay what it is so so another way to say that is we can use the softmax function with an additional linear um constraint and for example we can constrain one of the betas last beta in this case to be zero and an interesting thing is that you can easily convince yourself that for a binary classification problem this becomes equivalent to logistic regression then with this constraint so imagine what let's see what happens if you have beta one um as zero then for for the prediction of uh that y equals one you have here on the top you just have one right because it's zero so the whole numerator is one and the whole denominator is one plus uh exponent of beta zero x so that's basically what we had in the logistic regression formula just up to the minus sign which doesn't matter we can flip the sign here um so yeah so with this constraint we can clearly see that it becomes equivalent to the logistic regression if you don't don't impose the constraint it's still equivalent to logistic regression because uh because of this freedom that i explained before right so so as a summary of that the logistic regression is a special case of the multinomial logistic regression defined by the softmax function and whenever later on we're going to use neural networks for example to to predict um an object on an image and there can be 10 different different objects we'll just be using the softmax function in the end to get the probabilistic predictions out of the network and if you have only two classes then it's related to the logistic regression problem because you just have logistic function in the end all right so this is all for today thank you