 The following program is brought to you by Caltech. Welcome back. Last time, we talked about the bias variance decomposition of the out-of-sample error. And we managed by taking the expected value of the out-of- sample with respect to the identity of the data points, the size is fixed at n, to get rid of the variation due to the data set. And we ended up with a very clean decomposition of the expected value of the E out into a bias term and a variance term that have very interesting interpretation. And they are illustrated here in terms of a trade-off. If you have a small hypothesis set, the chances are the target function is far away. And therefore, there is a significant bias term, which is the blue one. And if you have a bigger hypothesis set, perhaps big enough to include your target, you don't have very much of a bias, perhaps none at all. But on the other hand, you do have variance depending on which hypothesis you zoom in based on the data set you have. So in the bias variance decomposition, we basically had two hops from the final hypothesis you produce based on a particular data set into the target function, which is what we are trying to approximate. The intermediate step was the new notion, which is the expected value of the hypothesis set with respect to d. And the jump from your actual hypothesis to that fictitious hypothesis describes the variance, because here is the centroid of this, which is the g bar. And you are somewhere here. So in order to get there, that is described by the size of the red region. And that is the variance. And then there is another one which is inevitable for a given hypothesis set, which is the hop from that one which designates the best approximation, in a certain sense, within the hypothesis set of your target function, into your target function. And that difference is the bias which is captured here. After doing the bias variance decomposition, we went into an illustrative tool called the learning curves, where we plot the expected value of the in-sample error and the out-of-sample error as you increase the sample size. If you look at a curve, let's look at the bottom one here. Not surprisingly, as you increase the number of examples, the out-of-sample error goes down. If you have more examples to learn from, you are likely to perform better out-of-sample. And another interesting observation is that when you have fewer examples, the in-sample error goes down. And that is because you are fitting fewer examples, and you have the same resources to fit, so you tend to fit them better. And the discrepancy between them describes the generalization error. Then we contrasted the analysis of the bias variance decomposition, which is the top curve, to the VC analysis which we have done before. And we realized that both of them describe a trade-off between approximation and generalization. In the bias variance case, the approximation is an absolute approximation, how your best hypothesis approximates the target. And that is described by G bar, again with certain liberty. And in the case of the VC analysis, the approximation was approximation in-sample only. So it was E in. And then the jump from the approximation to the final performance describes the generalization, whether it's this red region or this red region, which have basically the same monotonicity, except that they have different terms. The final lesson from the theory, which is really what we are going to carry out through the techniques, which start today until the end of the course, is that the number of examples needed to achieve a certain performance is proportional to the VC dimension. And I'm putting it between quotes, because we defined it formally only for classification. But then we took the linear regression case, which is not classification. And we found out that the corresponding quantity, which is the degrees of freedom, same thing, d plus 1, happens to also describe the generalization property. And therefore, we basically have a certain rule that you need examples in proportion to the VC dimension or to the effective degrees of freedom in order to do generalization. And the more you have, the better performance you get. This is the key observation. So today I'm going to start a series of techniques. And today is special because the techniques of the linear models have already been covered in part. Remember, this is the part that we split into two portions. And we've got a portion very early on, out of sequence, just to give you something to work with. And then I'm going to complete the exposition of the linear models today. So let's see where we are. This is the big picture of linear models. And they start with linear classification, perceptrons. We have seen that. And then go on to linear regression. We have also seen that. So that was the part that was covered. There is a third one, which is another linear model that is neither the linear classification nor the linear regression, which will be the bulk of the lecture today. It will be called logistic regression. And then for all of these linear models, we have this nice trick called nonlinear transforms that allows us to use the learning algorithms and the learning algorithms of linear models, which are very simple ones, and applies them to nonlinear transformation. And if you remember that the observation here was that linearity in the parameters was the key issue for deriving the algorithm. So let's see what we finished and what we didn't finish in these topics. Linear classification is pretty much done. We know the algorithm, perceptron or pocket. There are obviously more sophisticated algorithms. And we did the generalization analysis. We got the VC dimension of perceptrons explicitly, and therefore we are able to predict the generalization ability of linear classification. So this is a done deal. Linear regression is also a done deal. We have the algorithm. Remember that was the pseudo inverse, the one step learning. And last time we did a very brief analysis of generalization ability, and we found that it parallels that of the perceptron. D plus 1 is the operative point, the operative quantity in this case. And therefore, we have linear regression as a technique and as a generalization ability. In the case of nonlinear transforms, we are almost done. We did the techniques, but remember we left at a point where we say, OK, nonlinear transforms are very useful tool. And they actually can make us separate any data points by going to a sufficiently high dimensional space. And we had a suspicion that this is not really a safe process to follow. We have to worry about generalization issues. So the generalization issues were left out. And I'm going to start this lecture by sort of tying up the loose ends in generalization for nonlinear transforms before I go into the main topic, which is the third one, that is logistic regression. So the very brief analysis of nonlinear transforms in terms of generalization, then logistic regression beginning to end. That's the plan. OK, nonlinear transforms. Let's remind ourselves. We were working in an X space, and we had a d-dimensional vector that represents the input, and we added the constant plus 1 coordinate to take care of the threshold. And now we are going to transform this into another space using a transformation we called phi. This takes us into the z-space or the feature space. So each of these guys is derived from the rho inputs x. And the transformation we have can be quite general, if you look at it. Any one of these coordinates can be an arbitrary nonlinear transformation of the entire vector x. It doesn't take one coordinate, it takes the entire vector, so all of these guys, computes any formula you want, and then puts it as the feature here. So you can imagine how general this can be. And also the length of this can be arbitrary. You can have this as long as you want. As a matter of fact, when we move to support vector machines, we will be able to go to infinite dimensional feature space, which is an interesting generalization. So this is each of them as a general transformation. And therefore, the small phi i is a member of the big transformation, capital phi, that takes the vector x and produces the vector z working in the z-space. OK, so that's the transform. An example for that which we used was second-order. Instead of using linear surfaces here, we wanted to use quadratic surfaces. And quadratic surfaces in the x-space correspond to linear surfaces in a quadratic-transformed space, z. So this would be the transformation. We got all possible factors that contribute to a second-order term. And now if you put coefficients to each of these and sum them up, you will get a general second-order surface in the x-space. OK. Now the final hypothesis for us will always live in the x-space. The z-space is transparent to the user. This is our tool in order to get more sophisticated surfaces in the x-space, while we are able to use the linear techniques. So the final hypothesis will be mentioned as g of x equals. So you have the linear thing, but you have the transform version of x. That's what you get the dot product with. So this is z. And in the case of classification, you take the sign, plus or minus 1, according to the signal, whether it's positive or negative. And in the case of linear regression, you get the row signal itself. And as we will see, we'll have a third one which is in between taking the sign and leaving the quantity alone when we talk about logistic regression. So that's the summary of nonlinear transforms that we have seen so far. And now we talk about generalization and ask ourselves, what is the price we pay when we do a nonlinear transform? The price we will pay, obviously, in terms of generalization. OK. So this is the transform. Now, if you look at the x-space and ask you, what is the generalization behavior in the x-space? Let's say that you don't do the nonlinear transform. You do the linear model in the x-space. Well, in that case, you are going to get a weight vector in the x-space, and the dimensionality of the weight vector is the same as here. So it would be d plus 1. There is weight corresponding to each of these coordinates. And then you take the dot product, and whether you threshold or reported, depending on which type of linear model you are talking about. But basically, you have d plus 1 free parameters. And we realize that d plus 1 free parameters corresponds directly to a VC dimension. In the case of the z-case, the feature space, we have potentially a longer vector, much longer possibly. And the dimensionality here is d tilde. That's the notation we give for it. And the vector that we'll apply here will be w tilde. That will be a much longer vector in general than w. So for example, in our case, if we use the linear with x0, x1, x2, we would have 3. If we did the full second order, we would get 6. So we get more there. And we have seen that the VC dimension is, in this case, for the perceptron, the d plus 1. So this is the price you pay for the generalization. And here, the price you pay for the generalization can be pretty serious. So now you see that, OK, if I want to separate the points, and then I go to a 17th-order polynomial, and then you count the number of coordinates you have gone to, the VC dimension would be so large that, in spite of the fact that you were able to fit, because you went to a 17th-order polynomial, you don't really have a real chance of generalization. Just to be accurate, this is really not equality here, because we always measure the VC dimension in the x-space. Again, the z-space is transparent to the user. So we go here, and we come back, and we ask ourselves, what can we shatter here, and whatnot? So in spite of the fact that, in this case, if you had the full space, and you were able to choose the points any which way you want, you will be able to get exactly that VC dimension. It is possible that there are certain combinations of points here that are impossible to come by as transformations of legal points here. If you want a simple case, let me just take two coordinates to be identical, same transformation. So obviously, now I'm stuck. I don't have the full benefit of the coordinates, because if I choose one, the other one is dictated. So just because of that fact, in order to be accurate, we will say that it is actually at most d plus 1. Usually, it's very close to d plus 1, d tilde plus 1. So let's apply this to two cases where we use nonlinear transformations in order to appreciate in practical terms what is the price we pay. So the first non-separable case is a pretty easy one. It's almost separable, except for some points that you can consider, maybe outliers. This red point is in the blue region. This blue in the red region. But otherwise, everything can be classified linearly. So one may think of this case, OK, this case is really linearly separable, and we just have a bunch of outliers. Maybe we shouldn't use nonlinear transform, just settle for the linear transform. We will talk about that. So this is one class of things that we go when we look at nonlinear transforms. The other one is genuinely nonlinear. This thing, I really don't stand a chance if I use a line, and therefore, I'm really talking about something that needs to be transformed. So let's see what the generalization behavior goes for both of them when we apply the nonlinear transforms. The first case is pretty easy. You can, OK, so it's almost linearly separable. So here are the choices. You can use a linear model in the x space, in the input space that you have, and then accept that the in-sample error will be positive. It's not going to be 0. So in this case, here's the picture. There is an in-sample error, because this guy is erroneously classified, and this guy is erroneously classified by your hypothesis. So this is option number one. Now option number two is to say, OK, I would like to get e in to be 0, so you insist on e in being 0. And in order to do that, you have to go to another space. So you decide to go to a high-dimensional space. Now you can see what the problem is here, because we are just taking care of two points for crying out loud. And in order to actually be able to classify the surface, believe it or not, you are not going to be able to do it with a second-order surface, or a third-order surface. You will have to go to a fourth-order surface in order to get it all right. And when you do that, this is what you get. Now you don't need the VC analysis to realize that this is an overkill, and this doesn't have a very good chance of generalizing. Of course, you can do the formal thing. You say, OK, fourth-order, instead of having three, I have however many there are. And therefore, for the limited number of examples I have, when I see the generalization behavior, I am completely in the dark. So in this case, that is a straightforward application of the approximation generalization trade-off. We went to a more complex model. We were able to approximate the data better, but we are generalizing worse. So this has been completely covered already. So there is no surprise in this, other than the fact is to understand that at times, you might as well settle for a small training error in order not to use too high a complexity for the hypothesis set. The other one is the case where you really don't stand a chance with linear. I mean, it would be very, very poor approximation and generalization. The data seems to be coming from inherently a nonlinear surface. And in this case, we used this transformation. And this transformation is my way of putting a general second-order surface. And if you look at it, if we used only the x, which would be conveniently the first three guys, that would be the vector x, get weights. I have three weights, so I pay for the price of three. Whereas if I use this, I have six. So I pay for the price of six. So basically, in our mind, you need twice as many examples to do the same level of performance. Not that we have a choice in this case, because the linear doesn't work. But this is basically the formula we have in mind. Now comes an interesting discussion. I don't want to pay the six. I want to go to the nonlinear space, but I don't want to pay the six. So I want to get a discount. So here is a way to get a discount. Not necessarily legitimate, but let's pursue it and see why it would be legitimate or not. Why not transform x into this guy only? The idea here, this is the origin. So x1 goes like this, x2 goes like this. So it seems that I only need x1 squared and x2 squared. So these guys are just making me pay without really contributing, so I'm just going to use this model. Well, if I use this model, it looks like I have now three. So I have exactly the same number of examples, as if this was linear and as if I was doing it in the x-space. If you are smelling a rat, you are correct. And in order to make it clear, let's just pursue this line. I can do even better than this. Why not take two guys instead of three? I have the second guy being x squared plus y squared, because I really don't care about x1 squared, x2 being independent. They are just the radius in my mind. So I need to do this. Now we have achieved a lot, because now we have even fewer parameters than if we use the linear guy. So the generalization must be getting better and better. Now let's get carried away and go for the ultimate. I have one guy. I even let go of the mandatory constant. I just have this guy. And all I'm learning in this case is just what is outside and what is inside the circle, really. And it doesn't even need a parameter, it just needs a binary one. So now I have one guy, and the VC dimension is one, and I can generalize greatly. Well, something is wrong. Now it's clear that something is wrong, but it's very important to articulate what is wrong. What is wrong is that you are charging the VC dimension of this hypothesis set. Think of the VC inequality as providing you with a warranty. Now in order for the warranty to be valid, you cannot look at the data before you choose the model that will forfeit the warranty. Why does it forfeit it? Because of the following. I am going to charge you, if I do the analysis correctly, not the VC dimension of the final guy you got. I'm not going to get the VC dimension of this fellow. I'm going to charge you the VC dimension of the entire hypothesis space that you explored in your mind in getting there. Because you have acted as a learning algorithm unknowingly. You looked at the data. Before you looked at the data, you had no idea. I'm just going to let's say that you decided ahead of time I'm going to use a second order. Now you look at the data, and you realize that some coefficients are zero. That's called learning. I don't need this, I don't need this, I don't need this. You did it very quickly in your mind. So the hypothesis that was learned was a hierarchical learning. First, you learned, and then you passed it onto the algorithm to complete the learning. So the effective hypothesis set is what you started with. You see where the point is. And the lesson learned from this is that if you look at the data before you choose the model, this can be hazardous to your health. Not your health, but the generalization health. Why is that? Because now the quantities that describe generalization becomes very vague. When we propose a particular model, I can go and mathematically estimate the VC dimension. If you look at the data, we said that you did learning. So now I'm asking what is exactly the full hypothesis space that you explored in the beginning? That's a little bit vague, very difficult to pin down. Definitely bigger than what you are going to use if you use the VC dimension of the hypothesis set that you ended up with. And this is a manifestation of the biggest trap that practitioners fall into. When you go into machine learning, I want you to learn from the data, and choosing the model is very tricky. Some model may work, some model may not work. So it's very tempting. Let me just look at the data and pick something suitable. Well, you are allowed to do that. I'm not saying that this is against the law. You can do it. Just charge accordingly. Remember that if you do this and you end up with a small hypothesis set and you have a VC dimension, you have already forfeited the warranty that is given by the VC inequality according to that. It applies only to the VC dimension to begin with. And this is a manifestation of basically snooping. You snooped into the data. You looked at it in a way that is not allowed. And when you do this, bad things happen. And the formal term for it, actually, in machine learning, is called data snooping. And we will dedicate one-third of a lecture just describing data snooping. This is the most obvious manifestation of data snooping. You look at the data before you choose the model. But there are other ways that are so subtle that it's very likely that even a smart person may fall into those traps. And it's very important to understand what these traps are in order to avoid them, and make sure that when you apply the theory after all of the sweat we had in order to get these things, they might as well be valid. And you can immediately make them not valid by doing these things. So this is the subject of data snooping. And I'm not minimizing the idea of choosing a model. There will be ways to choose the model. When we talk about validation, model selection will be the order of the day. But it will be a legitimate means of model selection. It's a model selection that does not contaminate the data. The data here was used to choose the model, therefore it's contaminated. It is no longer trusted to reflect the real performance, because you have already used it in learning. So this is the lesson we get. And if you remember when I said linear models are an economy car, non-linear transfers give you a track. And we saw that the track is very strong. I can go to very high-dimensional space. I can have very sophisticated surface. And then I warned you that be careful when you drive a track. And this is what I meant by that. That if there are dangers, you could be well-meaning, and you could simply crash. Instead of having the smaller car that may not be as impressive, but it will definitely get you where you want. Now we move into the main topic of the lecture, which is logistic regression, which is a very important linear model. And it complements the two models we have seen so far, linear classification, the perceptron, and linear regression. And there are three pieces. First, I'm going to describe the model, what is the hypothesis said that I'm trying to implement. And then we are going to devise an error measure for it, which is a pretty interesting error measure. And finally, we are going to go for the learning algorithm to go with it. It turns out that the model is different, the error measure is different, and the learning algorithm is different from what we have seen before. So by the time we have done this, we will have covered enough territory in these variations. And this will really be very representative of machine learning at large. So linear models will not only be a very useful model to use, they also cover the concepts in many techniques that you will see otherwise. For example, the learning algorithm here is the same learning algorithm we are going to use in neural networks next time. So let's start with the model. So here is the third linear model. Being linear means that we take your inputs, compute a signal that is a linear combination of the input with weights, s. And then I take s and do stuff with it. And the stuff could be linear classification, perceptrons. And what was that? In this case, you take your hypothesis to be a decision, plus or minus one. And that decision is a direct thresholding with respect to zero of the signal. So you take this linear signal, and this is what you do to it in order to get the output. And that will give you the perceptron. So let's put it in a picture. So here are your inputs. x1 up to xd, this is the genuine input. This is the plus one that takes care of the threshold. They go with these into the sum. So this would be weights going attached to these guys. And then they are summed in order to give me s. And then one linear model or another will be doing different things to s. So the first model will take s and pass it through this threshold in order to get plus or minus one. So now the second guy was linear regression. What did we do to the signal in the case of linear regression? Nothing. We left it alone. That was our output. So if you want to put it in the same diagram, you can. And in this case, you do the linear sum, et cetera, and you get the signal, and then you have the identity function if you want. You output what you input. And that's what you get. Now when you go to the third guy, the new guy, which is called logistic regression, we are going to take s and apply a nonlinearity to it. The nonlinearity, we're going to call theta the logistic function. It is not as harsh as this nonlinearity. It is somewhere between this and leaving it alone. And it looks like this. So you can see, OK, that's an interesting thing. I am bounded. I'm, you know, this is the least I can report. This is the most I can report. It looks bounded like this. It actually looks pretty much like this, except for the softening of it. But it's real valued. I can return any real value between this value and this value, so it has something of the linear regression. And the main utility of logistic regression is that the output is going to be interpreted as a probability. And that will cover a lot of problems where we want to estimate the probability of something. So let's be specific. Let's look at the logistic function theta, the nonlinearity I talked about. OK, it looks like this. So it can serve as a probability, because it goes here from 0 to 1. And if you look at the signal, if the signal is very, very negative, you are close to probability 0. If the signal is very, very positive, it's close to 1. And at signal 0, its probability half. So that actually corresponds to something meaningful. So the signal corresponds to the level of certainty I have about something. If I have a huge signal, I am pretty sure that a binary event will happen. And if the signal is very negative, I'm pretty sure that it will not happen, so the probability is 0. Now, there are many formulas that I can have that will give you this shape. This shape is what I'm interested in. And I'm going to choose a particular formula. And the formula is this. It's not that difficult. So I have an exponential here and here. Let's say that you take S, go to plus infinity, very large signal. This will be huge. And this will be the same amount huge. The one will be negligible. So the ratio will get closer and closer to 1. If S is negative, this is very small. This is very small, so the 1 dominates. And you will get something small divided by 1, which is very close to 0. So that's what I get here. And indeed, if you get S equals 0, you will get 1 over 1 plus 1, which is a half, so it will give you this. So this is a nice function. It's indeed odd around half, if you will. This part is the same as this part. And the reason for taking this particular form is that when we plug it in, and we get the error measure, and then we go for the optimization, it will be a very friendly formula. You can have another formula that has the same shape. And then you run into trouble when you go into the next steps. So this is with a view to what is going to happen. So this thing is called soft threshold for obvious reasons. That is, the hard version would be just decide this or this. So this softens it and gives you a reliability of a decision. So if you think that we are talking about the credit card application, it used to be that I want to say, a customer is good or bad. Instead of deciding a customer is good or bad, which is a binary classification, I cast myself, what is the probability that this customer will be good? Or what is the probability that they will be bad? That is what is the probability of default. And then I can say 0.8, 0.7, 0.3, and let the bank decide what to do according to this probability. Do we extend credit, how much credit to do, and so on? So there is a utility for that. And the soft threshold reflects the uncertainty. Seldom, do we know the binary decision with certainty? And it might be more information to give you the uncertainty as part of the deal, and that is reflected in this soft threshold. It's also called sigmoid, for a simple reason, because it looks like a flattened out S. So this is an S, so it's called sigmoid. So you'll hear sigmoidal function, or soft threshold, and whatnot, and there is more than one sigmoidal function, or soft threshold. So I told you this one formula, there are other formulas. In fact, when we go to neural networks, there will be another formula that is very closely related, and we can invent other formulas as well. So this is the logistic function, this is the model, so we know what the model does. The main idea is the probability interpretation. So we have the model. So the model is you take the linear signal, pass it through this logistic function, and that will be the value of the hypothesis at the x that give rise to that signal. And we are going to interpret it as a probability. So we think that there is a probability sitting out there generating examples. Let's say a probability of default based on credit information. So I'm going to give you an example where it's patent that you actually need a probability. It would be absurd to try to predict the thing outright as a decision. And that is the unfortunate prediction of heart attacks. Now, risk of heart attacks depends on a number of factors. And you would like to predict whether there is a big risk or a small risk. So the kind of input you will have is data that are relevant to having a heart attack. So you can look at the cholesterol level, the age, the weight, the weight of the person, not the weight of the input, and so on. And then the output here would be a probability. Because if I told you to predict whether the person will have a heart attack or not, and you say plus one or minus one, I think this would be laughable. Because there are so many factors that affected, you will be correct very, very small amount of the time. And very difficult to tell that your predictions are better than another one. Both of you are wrong most of the time. So what we are doing here, we are actually predicting the probability of a heart attack within a time horizon. Let's say that you take this data today, and I'm asking what is the probability that you will get a heart attack within the next 12 months. That's the game. So you turn a number, and that will be reflected by the output of logistic regression. Now, if you look at the signal that goes into this thing, the signal goes from minus infinity to plus infinity, and it's a linear sum of those guys that we take and process in order to make it a probability. Two things to observe. First, this remains linear. That is, you are actually giving an importance, I'm going to call it importance, because there is a weight here. So you're going to give an importance for the age, an importance for the cholesterol level, and an importance for the other factor. But from then on, all you do is just give the importance weight, and sum them up. Now, it's conceivable, obviously, that this is bad, because you look at the age. In terms of risk, basically 40 is critical for these things. So it's not really linear. Above 40 or below 40 makes a big difference. So there is a non-linearity there. Does this bother us? No. We know that we can study the clean linear system, and when the time comes to apply, we can always transform those to relevant features, and we have the same machinery in place. The other aspect of the signal is that this can be interpreted. You can think of it as a risk score, if you will. Remember the credit score? We have the credit score, and then compared it to a threshold to decide, extend credit, not extend credit. So this is a risk score. Although we translate it to probability to make it meaningful, I can tell you what is the, you add this up, and you're 700, you're in trouble. You're minus 200, you're in good shape, in general. But obviously, in order to interpret them in an operational way, you need to put them through the logistic order to get a probability, which can be interpreted. This is the probability that someone will get a heart attack within a certain time horizon. Now, I'd like to make the point that this is genuine probability. What do I mean by that? You have a hypothesis that goes from 0 to 1. I am interpreting it as a probability. But you could just think of it as a function between 0 and 1. If I give you examples, here is x, and here is the probability, which is a number between 0 and 1, I'm going to learn it. And the fact that you are using it as a probability is your business. I'm just going to take two functions, try to get the difference between them, let's say mean square error, and learn. The main point here is that the output of logistic regression is treated genuinely as a probability even during learning. So why is that? This is because the data that is given to you does not tell you the probability. I don't give you the first patient, and here are the data, and this is supervised learning, right? So I have to give you the label. So the probability of getting a heart attack in 12 months is 25%. How the heck would I know that? I can only get whether someone got a heart attack or didn't get a heart attack. Well, that is affected by the probability, but you don't have an access to the probability. So this is a noisy case, where the nature of the example is that I give you a binary output that is affected by the probability. So this is generated by a noisy target. So let's put the noisy target in order to understand where the examples are coming from. It's the probability of y given x. That is what noisy targets are. And they have the form of a certain probability that the person gets a heart attack, and a certain probability that they don't get a heart attack, given their data. And this is generated by the target that I want to learn. So I'm going to call the probability the target function itself. So the probability that someone gets a heart attack is f of x, and the probability that they don't, it's a binary thing, has to be 1 minus f of x. And I'm trying to learn f, not withstanding the fact that the examples I am getting are giving me just sample values of y that happen to be generated by f. I want to take the examples and then generate h that approximates the hidden target function. Understood the game. That's why it's genuine probability. It's not only 0, 1. It's also that the example that I am given have already inherently a probabilistic interpretation. So the target is from the d-dimensional Euclidean space to 0, 1, the interval. And it is interpreted as a probability. And now you want to learn the final hypothesis, which would be called g of x, which happens to have the form of logistic regression. That's the model we are talking about. So you are going to find the weights, and then you are going to not project it with x and pass it through the nonlinearity. And the claim you are going to end up saying is that this is approximately f of x, the real guy, the probability here. And you are going to try to make that as true as possible, according to some error measure that we are going to define. And what is under your control, as always with linear models, what is under your control are the parameters. You change the parameters in order to get one hypothesis or another. So the question now becomes, how do I choose the weights such that the logistic regression hypothesis reflects the target function, knowing that the target function is the way the examples were generated? That's the game. OK. So let's talk about the error measure. Now again, remember in error measures, we had the proper way of generating an error measure, and then we had plan B, in the absence of a very specific way of paying a price for if you predict a heart attack and it doesn't happen, what is the cost, the guy's alarmed, et cetera. If you don't predict that it happens, then maybe you should have taken precautions and whatnot, and you can put prices and do all of the analysis. That is not done. Now we are resorting to the case where we use analytic properties, something plausible that makes this looks like an error measure. Yeah, if I actually minimize this error measure, I'm going to be doing well. Or take some things that will be friendly to the optimizer. After I do it and pass it to the optimizer, the optimizer will have easy time minimizing it. Well, it turns out that in this case, the error measure that I'm going to describe has both properties. It's plausible and friendly. It's a very popular error measure. So let's construct it. For each point x and y, and remember that y is binary, plus or minus 1, that is generated by the target function f. y is generated by it. We have the following plausible error measure. Here's the argument. It is based on likelihood. So likelihood is a very established notion in statistics, not without controversy, but nonetheless it's very widely applied. And the idea of it is that I am going to grade different hypotheses according to the likelihood that they are actually the target that generated the data. So let's be specific. We assume that, at your current hypothesis, let's say that this was actually the target function, just for the moment. You have the data, right? The data was generated by the target function. So you can ask, what is the probability of generating this data if your assumption is true? If that probability is very small, then your assumption must be poor. And if that probability is high, then your assumption has more plausibility. So I can use this to build a comparative way to saying that this is more plausible hypothesis than another, because the data becomes more likely under a scenario of this hypothesis, rather than this hypothesis being the actual target function. So this is the idea. You ask, how likely? And the difference, I said about controversy. The controversy is a bit subtle, related to the controversy that I raised early on in the course. The thing you are really trying to find, if you decided to use a probabilistic approach, which is your thing, for choosing the hypothesis. What you are trying to find is that what is the most probable hypothesis given the data? That would be completely clean. Now, here you are asking, what is the probability of the data given the hypothesis? Which is backwards. It has plausibility in it. That's what's called likelihood. It's not exactly the probability we want. And the people who don't like likelihood would add a prior and use a Bayesian approach, which looks principled, but then there is a big assumption in it. This is never a completely clean thing. It could be clean in terms of derivation, but conceptually, there is always a funny aspect to it. But we will sort of swallow that, because it looks very reasonable, that if I choose a hypothesis under which having that data is very plausible, it looks like this hypothesis is likely, hence the likelihood name. So this is the probability distribution we have. This is the genuine probability distribution for generating the y. Under the assumption that h is f, that probability would be, if I used h to generate it, that would be my measure of the probability of the data under this assumption. So this would be the way for you to define the likelihood. Assume it was generated by h, compute this probability, and that would be the likelihood of the hypothesis given one data point, x, y. Now let's use this in order to derive a full-fledged error measure. So what are you going to do? You are going to take the formula for likelihood, this, that is for one point. And then you have a formula for h of x. You are using logistic regression. So in this case, this thing happens to be that formula. It does depend on x, as you expected to, and it does the dependency through the choice of parameters w, and passing through an unlinearity set. Now I don't like the fact that these are cases, because I want something analytic. This is a number that I'm going to take and multiply and take logarithms of and whatnot. And I don't want to keep worry about cases. So now something comes handy here, which is the following observation. The sigmoid, the logistic function, happens to satisfy that theta of minus s equals 1 minus theta of s. So let's first look at it pictorially and then see why this is useful. Can you verify that? We said that this is odd around half. If you go with theta of minus s, this would be 1 minus this guy. Very easy to verify, and you can verify it from the formula straightforward. The good thing about it is that 1 minus theta looks like this fellow. And I only add a minus sign in this case. But the minus sign is readily available, because this case goes for y minus 1. So it's already crying for a simplification. And the simplification would be that p of y of x, in general, equals this. What did I do? For the case plus 1, I have it straightforward, because this is plus 1, nothing changes. And the case minus 1, I have minus this, which is 1 minus, and that gives me this formula. So it's summarized by this very simple case. So I have one example, x and y, and I want to get the likelihood of this w, given a single example. This would be the measure for that likelihood. OK, that's good. Now we have the likelihood of the entire data set. So someone gives you a bunch of patients, and whether they had a heart attack within 12 months of the measurements, and quite a number of them. And now I would like to say, what is the likelihood of this entire data set? The assumption, as always, is the independence from one example to another. And therefore, if I want to get the likelihood of the full data set, I'm going to simply, let me magnify this. This would be, I'm multiplying the likelihood of individual ones, from n equals 1 to capital N, covering the data set. So now I need a formula for that, and it's ready, because I already have a formula for p of y given x. All I need to do is plug it. And when I plug it, I end up with this thing. That's a very nice formula, because now you realize, OK, I have a bunch of examples. They have different plus or minus ones that will come in here, different x n's. The same w of my hypothesis contributes to all of these terms. So now you can find that there would be a compromise. If I choose w to favor one example, I'm missing up the other detector, so I have to find a compromise. And the compromise is likely to reflect that I'm catching something for the underlying probability distribution that generated these examples in the first place. So now let's go for what happens when we maximize this likelihood. So we'll write it down, and then we'll take it. And the maximizing of likelihood will translate to the minimizing of an error measure as we know it, or as we have been familiar with. So first thing, we are maximizing. Remember, in the error measure, we are minimizing. So something will happen through this slide that will make maximization go into minimization. That shouldn't be too difficult. But let's look at what we are maximizing. We are maximizing the likelihood of this hypothesis under the data set that we're given. So you're given the data set, how likely is this hypothesis, which means what is the probability of that data set under the assumption that this hypothesis is indeed the target. And maximizing with respect to what? To something that will turn purple? That is our parameter. We are maximizing this function with respect to w. So now I'm going to play with this. We are maximizing this, right? Can I maximize this instead? First, it's legitimate to take it. This is the natural logarithm. First, it's legitimate to take it, because the quantity here is non-negative. Theta, by nature, is positive. So I'm not taking log of 0 or log of negative. These things are not allowed. So first, I am allowed to take it. Second part is that the logarithm happens to be monotonically increasing of its argument. So if you maximize it, you maximize its argument, or vice versa. So I am allowed to do that. I kind of like that. So let me play it further. Can I do this? Yes, that's just proportional. The monotonousity still goes. But you can see where this is going, right? I'm trying to get an error measure. Error measure in the training set. That used to be what? 1 over n summation of errors on individual guys. So you can see that this is shipping. OK, one final thing. Can I do this? No. You are maximizing, but then all you need to do is, instead of maximizing, you minimize. We are cool. So this is the problem. So now let's see what it's equal. Miraculously, the loan times the product becomes a sum of the loans, and I do this. The minus takes the guy under the loan and puts it in the denominator. And after all of this very sophisticated algebra, we end up with something that looks very suspiciously familiar. 1 over n, summation from n equals 1 to capital N, something that involves the value of the example and the parameters that I'm trying to learn. Anybody have seen something like that before? OK, we're going to give it the proper name in a moment. But this is what we have. Now I'd like to reduce this further. So I'm going to remember what theta was. Because theta is a mysterious quantity, I want to put it in terms that I'm completely familiar with. So theta was this guy, which was e to the s divided by e to the s plus 1. Now I can reduce this by dividing both the numerator and the numerator by e to the s, in which case this will become this. No surprise. Why is this good? Well, this is good because I have theta here. So if I substitute it here, the 1 plus will go into the numerator, good things will happen. So let me substitute and see what happens. Now, because what I'm going to get is very clean and very close to the formula, I'm going to officially declare it the in-sample error of logistic regression. I'm minimizing it, so it's legitimate. And it looks like this. This is simply substituting in the above formula. I get this. Now, this is very nice. I have this term depending on this example. I'm summing them up, so I'm completely within my right, since I'm minimizing this as my in-sample error, to call this fellow what? Call it the error measure, so let me magnify it. This is something that depends on the particular example. I'm going to call it the error measure between my hypothesis, which depends on w, applied to xn, and the value you gave me as a label for that example, which is yn. That is the way we define error measures on points. So this is my formula for the error measure. And under that, maximizing the likelihood would be like minimizing the in-sample error. Now, let me leave this for a moment just to mention a point. There is an interesting interpretation here. If you look at w transpose xn, this is what we call the risk score. If this is very positive, the guy is likely to get a heart attack. If it's very negative, the guy is very unlikely to get a heart attack. Now, this one is whether that particular person that supplied the data ended up with a heart attack or not. So let's see agreement, disagreement, and how they affect the error measure. Now, if this signal is very positive, and this guy is plus 1, so the guy actually unfortunately got a heart attack, then the result is that this is minus a lot. And therefore, this is a very small number, and the contribution to the error is small. I'm already in good shape. My predictions are right. However, if the sign is different, if I say this is very positive and this ends up being minus 1, or if this is very negative and this is plus 1, the end result is this will be a positive exponential, and the error would be huge. And I need to do something in order to knock it down. So indeed, this is very intuitive, under that interpretation, that this would be an error measure that I would be trying to minimize. What is this error measure called? It is called the cross entropy error. And I'm putting it between quotation because the way to get it strictly to be cross entropy is to interpret a binary event as if it was a probability. But we'll accept that it's universally referred to as cross entropy, and we'll call this the cross entropy. It's basically between, supposedly, h and f, not really h and f, but h and a particular realization of f. So this is what we want to do. So now we have defined the model, and we have defined the error measure. The remaining order is to do the learning algorithm. I know the error measure, I just want to minimize it. How can I do that? So it's an easy question. If you look at logistic regression, we have just developed the error measure for it. So now I have this function, and I want to minimize it with respect to w. Let's look at a previously tackled case in order to see how we went about that. Remember linear regression? We also had an error function. It was derived directly through just with it being squared and wrote it down. We didn't have to go through this long derivation. But the in-sample error ended up being this. Very similar to this, except here we penalized the difference squared. Here we penalized it according to this cross-entropy thing. Now in the case of linear regression, if we want to minimize this, we found a very simple way to do it, because we found a closed-form solution for the minimum. That was the pseudo-inverse, the one-step learning. Unfortunately, here we are out of luck. If you use a derivative here, and you try to solve it, there's exponential, and things will come out, and you sum it up, and you cannot find a closed-form solution. Well, in the absence of a closed-form solution, we usually go for an iterative solution. We are not going to go for the solution directly. We are going to improve, improve, improve, improve. Finally, we'll get the good solution. This is not a foreign concept to us. This is what we did with perceptrons. Although we didn't explicitly do it at the time in terms of a declared error function, we went and tried to improve one example at a time and kept repeating until we got what we want. So that's what we are going to do here. But here, we are going to do it based on calculus. And the method that will be used for minimization can be applied to any error measure, even non-linear, and so on, as long as there is some smoothness constraint. And it will be the one that will be applied to neural networks next time. It's a very famous one, a very simplistic one. And there are more sophisticated versions of it. It's called gradient descent. So let's see what is there. First, let me show you what the error measure for logistic regression looks like. As you vary the weight, the value of the error differs. But it has this great property that it has one minimum, and otherwise it goes like that. A function that goes like that is called convex. And it goes with convex optimization, which is very easy, because obviously, wherever you start, you will go to the same valley and whatnot. You can imagine a more sophisticated non-linear surface, where you do this and that. And then depending on where you start, if you are sliding down, you will end up in one minimum or another, there are other issues that arise. And we will only tackle them when we need them when we talk about the error measure for neural networks. Right now, we have a very friendly guy. So we're going to describe gradient descent in terms of this friendly guy. So what do you do with gradient descent? First, we admit that it's a general method for non-linear optimization. And what you do is start at a point, initialization, pretty much like initialize a perceptron. And then you take a step. And you try to make an improvement using that step. The step is to take the step along the steepest slope. The steepest slope is not an easy notion to see in two dimensions, because I either go right or left. There aren't too many directions. So let's do the following. Let's say that I'm in three dimensions. And in this room, I have a very non-linear surface, going up and down, up and down. I'm going to assume one thing that it's twice differentiable. That's what you need to invoke gradient descent. So you can think of hills and this and that. Now I'm trying to get to the minimum of that surface. First thing to remember in optimization. You don't get to see the surface. You don't have a bird's eye view, and you look at it. Ah, OK, that region looks good. Let's go there. That doesn't happen. You only have local information at the point you evaluated. So the best thing to imagine is that you are sitting on the surface, and then you close your eyes. And all you do is feel around you, and then decide this is a more promising direction than this. That's all you do at one step. And then when you go to the new point, repeat, repeat, repeat, repeat, until you get to the minimum. These are all the iterative methods that we are going to use. So we go back, and we start at W0. And then we look at a fixed step side. So I am at a point. I'm going to move in W space by a certain amount. And I'm going to take it to be fixed and small. The reason I'm doing that is because I am going to apply local approximations based on calculus, Taylor series. And I know that this would apply well if the move is not that big. If I go very far, the higher-order terms kick in, and I'm not sure that the conclusion that I got locally will apply if I just get a first-order or second-order as in the other methods. So for the fixed step side, I'm going to say, OK, I'm moving in the W space. I am moving by a unit vector v. v hat is a unit vector. So this tells me just the direction. Should I go this way or that way? So if I'm doing the sensing, oh, this is steep, so I'm going to go this way. So the unit vector in this direction would be my v hat. And I'm going to modulate the amount of move by a step size which I'm going to call eta. So this is the amount of moves. The only unknown I have is what is v. I already decided on the size, but I want to know which direction to go. And the formula would be the next weight, which is w1, will be the current weight plus the move. And I already decided on the move. So now under this condition, you are trying to derive what is v hat. That is the direction. If you solve for it in one method or another, that gives you gradient descent. In another method, it will give you conjugate gradient, which has second-order stuff in it, and so on. So that is always the question. So let's actually try to solve for it. We said that we are going to go in the direction of the steepest descent, so we are really talking about change in the value of the error. So the change of the value of the error, if I move from w0 to w1, would be e in at some point minus e in at another point. Which two points? It's w0 and w1. That is, if I decide to move to this guy, this is the amount. So what I want to do, I want this guy to be negative, as negative as possible, because I want to go down, by the proper choice of w1. But w1 is not free. It's dictated by the method, and it has the very specific form that it is the original guy plus the move I made. So this is what I would like to make. I would like to make this as small as possible. Now, if I can write this down using the Taylor series expansion with one term, this is e of the original point plus a move, minus the original point. That would also be the derivative times the difference. So the derivative times the difference here will be the gradient, transpose times the vector times eta. And I just took eta outside here to make it OK. So this would be the move according to the first order approximation of the surface. If the surface was linear, this would be exact. But the surface is not linear. And therefore, I have other terms, which are of the order eta squared and up. And the assumption for gradient descent is that I'm going to neglect this fellow, as if it didn't exist. When you go to conjugate gradient, you will have the second guy, and you will neglect the third. And you can see the idea. So now, how do I choose the direction in order to make this as negative as possible? So by simple observation, I realize that this quantity for any choice of v hat will be greater than or equal to this fellow. So this guy is gone. So I'm only dealing with this guy. So I'm taking the inner product between a vector and a unit vector. So the unit vector could be aligned with that vector, or could be opposed to that vector, or could be orthogonal to that vector, but it doesn't contribute magnitude. It's magnitude is 1. So the most I can get is the value, is the norm of this. And the least I can get is negative the norm of this, if they are opposed. So this will be the least I can get, and I inherit eta from here. So this will be true for any choice of v hat. So knowing that if I choose the v hat that achieves this, this will be my v hat, because it gives me the most negative value that I can get. Not that difficult to do, because this is a unit vector. So I'm trying to get it. So clearly what I want, a unit vector opposed to the original guy. So I end up with this fellow, the formula for it. The formula for it, I am getting minus the gradient. But this is a unit vector, and therefore I have to normalize by that in order to make it a unit vector. So that's the solution. And you can see now why it's called gradient descent, because you descend along the gradient of your error. Now, we have said it's a fixed size, and that was a way for us to make sure that the linear approximation holds. We are going to modulate eta to be first of all. But you can see that there is a compromise. I can get close to perfect approximation for linear by taking the size to be very small, but then it will take me forever to get to the minimum. I'll be moving, et cetera. Or I could be taking bigger step, which looks very promising, but then the linear approximation may not apply. So there is a compromise. So let's look at how eta affects the algorithm. This is the case I talked about. If eta is very small, you will get there, but it will take you forever. And in optimization, it's a very simple game. I charge you for two things. The value you arrived at, and how long it took you to get there. I don't care how you do it. Don't bother me. If you give me a beautiful thing, I'm computing the Hessian and taking the inverse and doing it, that's your business. All I care about is how long it took you to do that, and what is the value you are going to deliver. So just by this, this is not good, because it took me forever. Now I go to the other extreme and make eta large, and all of a sudden I am really doing this. You can even do worse. Let's say that eta is really large. You start here, and our next step is here. So you end up with the error surface being up there. So you went up instead of down, obviously, because the second order and third order dominate that. So that is not good. So pointer? So if you look at it, you realize that the best compromise is to have initially a large eta, because the thing is very steep, and I want to take advantage of it. And just become more careful when I'm closer to the minimum, so that I don't bounce. So a rule of thumb, this is not like a mathematically proof thing. It's an observation in surfaces. So it looks like a very good idea, instead of having a fixed step, to have eta increase with the slope. If I'm in a very high slope, I just go a lot, because I'm going down. And then if I'm now close to the minimum, I'd better be careful in order not to miss the minimum and overshoot. And because of this, here's an easy implementation of this idea. Instead of taking the direction which will not change, here's the direction. And we're going to eta, and this is the formula for it. This is what we have for fixed size. Now I'm going to try to make eta proportional to the size of the gradient. So it's bigger when the slope is bigger. That's very convenient, because I have here the size of the gradient sitting there. So if I make eta proportional to the size of the gradient, something will nicely cancel out. And then I will get another eta, purple eta, which is the new constant. And this guy completely canceled out, and I have a very simple formula. Now it's not a fixed step anymore, but a fixed learning rate, eta now being the learning rate. You just compute the gradient and use that learning rate, and that will take care of the previous observation. So that's what we have. That's all I'm going to say about gradient descent for this case, and then we are going to go to the more complicated issues of it when we talk about neural networks next time. So this is how to minimize. And now we have the logistic regression algorithm written in language. You iterate. You start by initializing at w0. You iterate for every step. What do you do? You compute the gradient. How do I do that? Oh, I have a formula for the sample error. All I need to do is differentiate it. Differentiating it will not be difficult. You are going to get this. You can verify it. And now you are going to take the next weight to be the current weight minus your learning rate times the gradient. That's the formula we had. And then you go to the next iteration, next iteration, until it's time to stop. And then we return the final weight. That is the algorithm. Now let me spend two minutes summarizing all the linear models in one slide. And then we will be done completely with that model. We had three models. We had the perceptron, linear classification. We had linear regression. And we today added logistic regression. Let's take one application domain, which is credit, and see how each of them contributes. So we have credit analysis. If you apply each of these to credit analysis, what type of thing do you implement? For the linear classification of the perceptron, you accept or deny, that is our very first example. If you use linear regression, you are trying to decide the credit line. We have seen that example as well. If you are applying logistic regression, you are computing the probability of default, just reliability of the get, and then they let the bank decide what to do with it. So this is from an application domain. Let's look from tools point of view. They had different error measures. The perceptron had the binary classification error. Linear regression has squared error. And finally, logistic regression had cross entropy error, different errors that had different possibility motivations to them. And we tackle all three of them. And then there was the learning algorithm that goes with them, that is very dependent on the error measure you choose. So for the case of the classification error, which is a combinatorial quantity, and we went for something like the perceptron learning algorithm or the pocket version, if the thing is non-separable, and there are other more sophisticated methods to do that. How about squared error? That was the easiest. That was the one-step learning where we had the pseudo inverse, and you have your solution. And finally, with the cross entropy, we had the gradient descent, which is a very general method. All we needed to do is that it's a twice-differentiable surface, and we are ready to go with that. And this was particularly friendly because it happens to be convex, so it avoids a lot of the traps that we will see next time when we talk about neural networks. I'll stop here, and then we will continue after the short break. OK, let's start the Q&A. OK, so the first question is, for the algorithm, the termination time means, does the error does not change or what criteria do you usually use? So when I discuss gradient descent, there are several aspects to the algorithm that need to be discussed. There is the question of the learning rate. And this is what I focus on today, because it's the most relevant to the particular application, which is logistic regression, that has a very good behaving one. There are other questions. One of them is the initialization. For example, if you look at the top here, I set it to W of 0. What is W of 0? How do I initialize? That's a question. It's not critical in logistic regression. If you initialize it to 0, this would be fine. And if you think initializing it to 0 means that you initialize the probability to a half. The most uncertainty about any example before you learn anything, so it looks like a reasonable thing to start. Then also, there is a question of termination. And the termination is an issue here, but it's less of an issue here than in other cases, because of reason that I'm going to explain. But in general, the termination is tricky, and you have a combination of criteria. So what do I want? I want to minimize the error. So one of them is to say, if the thing gets flat and flat and flat to the level where I move from one point to another, I'm not really making a lot of improvement, then I must be close to the minimum and should stop. That turns out to be reasonable, but sometimes, not in the convex case, but sometimes you have a surface like this. It goes down, and then flatten, and then goes down. So if you do this criteria by itself, you may stop prematurely. And you may think this is pathological. It happens more often than you think. So now you say, OK, let me set a target error that is not only that the changes are small, but also if I didn't get to the target error that I want, I'm not going to stop. So now that will get you over this hump until you get to the other guy, and maybe that will achieve your target error. That's very nice, except for the problem is that if your target error is not achievable, you will continue forever. So you patch this up and say, OK, I'm going to have a limit of the number of iterations anyway. I'm going to do this for 10,000 epochs. Regardless of what happens, I'm going to stop. In practice, some combination of the above works. But the main thing here is that termination as a properly analyzed thing is a bit tricky because of so many unknowns in the error surface that we are dealing with. But it is something that will become more of an issue in neural network than it is here. And then another issue in gradient descent that I didn't talk about is the question of local minimum versus global minimum. When you do gradient descent, I said you close your eyes and you sort of roll down a surface. And then when you get to the minimum, you know you're at a minimum. If you have a surface that goes like this, then goes up and then goes down to a better minimum. If you start at the original one, you can go, go, go. Once you get to that minimum, you have absolutely no reason to leave, according to the prescription of the gradient descent. Because you will be going up a hill, and that looks like a bad idea. So you will be in a local minimum rather than a global minimum. I didn't mention it this time again, because I have a convex function. So there's only one. You will get there and everybody's happy. When we get to neural network, this is an issue, so this should be addressed. So the short answer to the question, termination is tricky. A combination of criteria is the best way. And my coverage of gradient descent, this one, only covers a part of it that is most relevant to logistic regression. And the rest of the story will come up when we talk about neural networks. So there are questions about, so why was gradient descent picked? Isn't it usually very slow, a slow method for convergence? OK, think with this way. If I can see the surface, I obviously can go for the minimum directly. But I'm doing something here that depends on first order. So let's say that you're playing golf. You want to get to the hole, that's your minimum. Gradient descent would be doing this. Take this, and just this one. Nobody in the right mind would do that. When you go to the second order thing, what you are doing actually, your first guy is a swing. You may not land exactly at the hole, but you land close. You have a second order approximation of the surface, and you get close. And then you try to get there. Having said that, it is remarkably efficient algorithm to use, especially the stochastic version of it, gradient descent that is. That is, in many applications, you just apply gradient descent in a very simple way, and you often get very, very good result. And the conjugate gradient, which is the king of the derivative based methods, is a very attractive one. And in some optimizations, it completely trumps the alternatives. On the other hand, in many ways, the stochastic version of this and the simplicity of it makes it the algorithm of choice in many applications. Although it's not the case for this error function, what happens for gradient descent if there are local minima? If you are local minima and you are applying the algorithm fistfully, you are going to get to the nearest local minimum to where you started. So there is a huge amount of research and optimization about local minima and how to do it, and there are algorithms right and left. From a practical point of view, here is my experience. Let's say I'm using neural networks, and the local minima are abundant in neural networks, and therefore it looks on face value like a serious problem. If all you do is do the learning a number of times starting from different initial conditions, that is, do a session starting from this point. Do another session starting from this point, et cetera. Each of them will go to its nearest local minimum. If you do this enough times, I'm not taking a million times, even 100 times. And then after all of these sessions, you pick the one that gave you the best minimum, and you know the best minimum by evaluating the error, which is accessible to you. That usually gets you what you want. It will not get you the global minimum. I mean, formally you can prove that getting to the global minimum is NP-hard. So if you insist on getting to it in every case, this is simply not tractable in terms of computational time. But some very simple heuristic, as just repeating from different points, and then picking the best, works actually pretty good in most of the cases that I have seen. And if this becomes a real issue in your application, there is no shortage of methods in optimization that explicitly deal with local minimum. The only thing to remember is that the avoiding local minimum versus not avoiding them has nothing to do, almost has nothing to do with the order of your algorithm. I could be using first order or second order, et cetera. And all of them will be very happy when you get to the minimum, even if it's local. So this is an added layer that will make you, in spite of the fact that you are at a minimum, you explore further. Sometimes you have a temperature, and you escape the local minimum to a better one, if it's a shallow local minimum. And there are others that are deliberately looked at. If your application calls for it, there will be methods to help, but there will be no method to actually solve your problem, because the problem is NP-hard. Can you quickly explain what stochastic gradient descent is? I will do that at the beginning of the next lecture. Basically, instead of taking the whole training set at once, you take one example at a time. But I'll do that, because that will be the part that is applicable to neural networks in general. Can you explain a little bit more the notion of cross entropy? OK. So the formal definition of cross entropy, so entropy, you get a function based on a probability, and it's basically the expected value of log 1 over the probability. That would be your classical definition of an entropy. When you have two different probabilities, you can get a cross entropy between them by getting the expected value with the expected value of, and you take a ratio of them one way or the other, and there are a number of them in the literature that have different definitions and different scopes. But basically, you are getting a relationship between two probability distribution using a logarithmic and expected values. That's the common thread. So the reason why I put this between quotations is that you really are getting the cross entropy between the h that you are trying to learn, and a binary event, so something like a probability 1 or 0 at a time. So it's a little bit of a loose thing, because that's the way it is defined. But again, it is referred to as cross entropy. OK, so the question is why a method like binary search wouldn't work to find minimums quickly? OK, but binary search works well once you decide on the direction. I'm in a space, and you think binary search, that's very nice. Let's talk about 1,000 dimensional space. Where do I move? So if you decide on a direction, and you say that it's very good to go along this direction, then it becomes a legitimate question, OK, that's the direction. How far should I go? We used very crude method, like use a fixed step, or maybe by looking at it closely. Maybe it shouldn't really be fixed. It should be proportional to the gradient, so we had a fixed learning rate. But one can become more sophisticated and say, OK, let me explore that direction until I get to the minimum along this direction. And then there are binary search methods for doing that. Again, when you judge one method versus the other in optimization, don't be excited about the sophisticated method, because you will be charged for it. So if I have to evaluate a lot of stuff, I have to show for it that I got a much better value. If I evaluate many more values and get that much of a difference, then I lose in the optimization game, because I used CPU cycles, and I didn't improve the error as much. So whenever you are looking at a method, it's a very practical question whether it will work or not. For example, second-order methods. Hands down, approximating the surface as a second order is better than the first order. And the problem, why don't we do that? Because if you approximate it as a second order, you have to compute the second-order derivatives. And that's a full matrix called Hessian. So in order to do that, if I do that outright, I will get a better minimum, and I will get it quickly in terms of the number of steps. But each step becomes very expensive. And conjugate gradient that I mentioned very quickly is a way to do the second order without actually explicitly computing the Hessian, which is effective. And that's why it's a famous method. So it is used, but be careful where to use it. Can logistic regression be applied for a multi-class setting? Yeah, if you look at the full... Now, this is the linear model as we covered it. On the other hand, if you see, OK, what type of functions am I getting, I seem to be getting either binary, real valued, or bounded real valued, specifically probability. There are obviously other classes. But in many cases, the other classes can be derived in terms of those. Let's look at, for example, the multi-class, the thing being asked about. Remember the recognizing the digits that we talked about. How many digits did we have? We had 10 digits, 0, 1, 2, up to 9, in the zip codes. And we wanted to be able to classify them. What did we do for that? We used Perceptron. Wait a minute, Perceptron does a binary thing. How did we do that? We did what we usually do for multi-class problems. Instead of taking one versus two versus three versus four, etc., we either take one versus one, a class versus another class, like I say, one versus five, and two versus three, etc., and then combine the decisions. Or sometimes you do one versus all. So I want to recognize one from the rest, two from the rest, three from the rest, etc. And there are other methods. So many of the multi-class approaches deal with a tree of sorts based on binary decisions. And that is the way it is applied. But there are some others, which are ordinary regression, for example, where there's a specific order to them, it is dealt with differently. So there are modifications to these guys that accommodate other methods. The easiest of which is multi-class, because it's ready for us, and we have seen an example of it. What other sigmoid functions that can be used instead of it? We will use one next time. I'm glad you asked. So the logistic function is between 0 and 1. We are going to use the tanche, which is between minus 1 and plus 1. And that will be the neuronal function in neural networks next time. And at that time, I will discuss it a little bit, so you will get a feel for it. But it's also based on exponential. The tanche has the exponential 8. It's very close to the sigmoid. There is a scale and shift. On the other hand, you can use other functions. It turns out that, from an analytic point of view, using the exponential-based soft thresholds has advantage, as we see. So we got a number of advantages here, a simple formula for the error. And also, the error measure that resulted from it was convex. That may not be guaranteed in every choice. So therefore, there is a criteria for choice. It's not just a good-looking formula. It's a formula that will go through the chain of the processes that we go through when we do the learning. OK, there's a conceptual question about how is new derived, but it's not really derived. The learning rate? Yeah. So the way I described it, I got the 0th order, which is fixed eta, fixed step. And then I got the first order, which is to make it proportional to make the step proportional to the gradient. And then you've got the fixed learning rate. You can take that and make it more sophisticated. And you have an adaptive learning rate. So it's a very simple heuristic that you take a learning rate and you do a step. And if the step is successful, you say, OK, maybe I can afford a bigger learning rate, so increase the learning rate. And you keep doing that until you hit a point where, oh, I actually used too big a learning rate, because after doing my minimization in one step, I ended up with a bigger value. So obviously, I went too far. And in that case, you shrink the learning rate. So you can do this adaptively. And it does buy you time. There are lots of heuristics that are add-ons to the plain vanilla gradient descent. And one of them has to do with the learning rate. When you go to conjugate gradient, which is a second order one, because there is a second order thing, you can look at it eventually and interpret it as if it was having a principal direction and a principal learning rate. This is one way to look at it. But for the gradient descent, it's really a heuristic that will choose the learning rate. I have a rule for it. I will mention it like a rule of thumb. There are certain values that work in many cases. And in the case of it, there is a particular value that works. I'm going to mention it. But again, this is just a practical observation. And other people may have different experience. Going back to the first part of the lecture, and a few lectures back, you made the example of character recognition. And you chose features like symmetry and remember something else. So how much are you charged in terms of DVC for getting those features? OK, I'm glad this question was asked. So this was the nonlinear transformation. And we called the space the feature space. And these guys are features. And there are basically two types of features. Here I'm trying to find just generically more sophisticated surface, because I realize that the points will not be linearly separable. So I cannot do it here. And the other one is along the lines we started with, which are meaningful features, like symmetry in the case of the digits. And in the case of, let's say, years in residence, if you want this as an input to the credit, you may not want it as a linear thing. But you say, am I bigger than five years or less than five years? So they're meaningful features. So the key distinction you need to make in your mind is that did I choose the feature by understanding the problem, or did I choose the feature by looking at the specific data set that was given to me? The latter is the problem. If I look at the data and then choose features, then I am doing the learning myself, at least the first stage of it. And therefore, I have a bigger hypothesis set than what I'm going to end up with. And therefore, as I said, the VC warranty is forfeited in this case. If you look at the problem and their drive features that are meaningful, but that's not depending on the data set that you are going to learn from, depending on general understanding, like, OK, we look at the credit. Looks like years in residence are fine, but I don't think it's just proportional to these guys without looking at data. I think the thresholds are five years for stability, less than one year for not so much stability, and so on. So I'm going to derive those. You are absolutely charged nothing for doing that. More power to you. You may have helped the learning algorithm by taking properties of the learning problems that you are working on and got a better representation for it. And this is an art, purely an art, and it depends on the application domain. The thing I warned about very explicitly is to try to derive features from looking at the data. And the warning is not absolute. So try to derive features based on the data and still think that the final hypothesis that you ended up with is what will dictate the generalization behavior. That is where the fallacy lies. OK, there's a question. It's possible to choose parameters automatically. I'm guessing they're referring to the learning rate. OK, so I mean, OK. Oh, no, wait. Sorry. They corrected. So how to select the features automatically? So it's back to the answer. OK, so automatically that's what we're in business with. It's machine learning. Things are automatic. But then this becomes part of learning. Now, here we had an explicit nonlinear transformation. When we go to neural networks, we will find out that they sort of choose features automatically. But that is part of learning, and you are charged for it in terms of VC dimension. So probably that question will be better answered and understood in the context when I talk about neural networks and hidden layers. And then we'll see what that means. OK, I think that's it. OK, very good. Then we'll see you on Thursday.