 The following program is brought to you by Caltech. Welcome back. Last time we talked about linear models. And linear models share what we would refer to as the signal, which is this formula. It's a linear sum involving the input variables and weights that can be put in vector form. And all linear models in one form or another have that as their basic building block. And you can have a classification linear system, like the perceptron, that uses that signal and takes the sign of it to make a decision plus or minus one. Or you can take something like regression, which is real value, that takes the signal as it is, and has that as output. We looked at the linear regression algorithm, which was a particularly easy algorithm. All it does, it takes the inputs and puts them in a particular matrix form. And so the outputs, that's the inputs and outputs of the data set. And then by computing this very simple formula, in one shot, it can get you the optimal value of the weight vector. If you look at linear models, you can think of them as an economy car. They get you where you want to go, and they don't consume a lot of gas. You may not be very proud of them, but they actually do the job. If you want a luxury car, wait until you get to support vector machines. And you'll have to pay the price for that. For linear models, it is remarkable how often they succeed on their own, and they are sufficient to get you the learning performance that you want. So I urge you to give learning models in general more attention than you would otherwise give, and try to use them when you face a learning problem first, and see if they will actually achieve what you want. To strengthen linear models even further, we introduced nonlinear transformation. And the idea behind it is that the signal is not only linear in x, which is what you would think of as the reason we call these linear systems, but actually they are linear in w, the vector. And the reason this is important, because learning actually modifies w in the learning process until it gets to the optimal one, while x, which you usually think of as a variable, is actually a bunch of constants, which are the data sets that are handed to you. So the linearity in w is the key point. And if you take x and transform it in any way you form to another vector z, in a very nonlinear way, if you want, this will still preserve this linearity, the linearity in w, obviously it will not be linear in x. And all that matters for you to apply the machinery that we got here, like the simple linear regression algorithm. And we took an example where we had the two variables, x1 and x2, and we transformed them nonlinearly to x1 squared and x2 squared. And we found that this case, the transformation helps us separate the data, where if we worked in the original space, we would not be able to. So this time I'm going to talk about error and noise. And these are practical considerations that we have to take when we consider real-life problems. And we are going to modify the learning diagram that we have by incorporating the notion of error and the notion of noise. And I will do that for the bulk of the lecture. However, my starting point will be to wrap up the nonlinear transformation that we started last time. So let's look at what we had last time. We had this space, let me magnify it a little bit. So this space is the original x space, and the origin is in the middle. And you have these points, which are your data set. And each point belongs to that space. And as you realize, there is really no way of separating the blue from the red using a line, which is what linear models do. And the idea for us was, OK, let's do a nonlinear transformation. We called it phi. And if you look at what happened here, let's look at both of them at the same time. This is where you took x1 squared, so this is x1, and this is x2. And the transformation here was x1 squared, so this would be x1 squared, and this x2 squared, which we are going to label as z. The transformation is you take every point in the sample space xn, you put it through a transformation y, and you get the corresponding point zn, and now you are working in the feature space or the nonlinear space z. When we did this, we realized that a data set like this can become linearly separable in the new space. And that allows us to apply the linear model algorithm here. And when you do that, you will get a separating boundary here. And that separating boundary is applied by applying your simple linear model, like linear regression, in this case linear classification, the perceptron, to the data in the z space. So that's what we get. But we are not working in the z space. When I give you a test point, it will be x. And you have managed to separate things in the z space. The way you do it is you go back to the input space. And as you realize, I am using the inverse between quotation and transformation, because the transformation in principle may not have an inverse. There are some points in the z space that may not be a mapping of any point in x, and some points in the z space which may be the mapping of more than one point. And therefore, in spite of the fact that phi is a mapping, a function, phi minus 1, as we call it, is not. But when you apply this figuratively, what you are going to get is a separating surface in the x space that is not linear. And that was obtained by applying purely linear methods. And therefore, you can classify a new point by applying g of x, which would be the hypothesis that you find here, the linear one, which happens to have that formula. So you look at the diagram altogether. And this is basically the cycle you have when you are doing the nonlinear transformation. You take the data set, transform it, classify it, and interpret it. In reality, when you get the new x, what you are going to do, you are going to take the new x wherever it might be. You are going to transform it, and then look here where it lies and classify it accordingly. So it's a very simple procedure. And as you can see, although we are illustrating here in a case where you are going from two-dimensional to two-dimensional, you could, in principle, go from two-dimensional to a hundred-dimensional with highly nonlinear coordinates, and the same principle will apply. You will be classifying here with a hyperplane in that case. And then this surface would be very, very complicated. It could be completely jagged and whatnot. And that enables you to implement a lot of sophisticated surfaces. OK. So let's look at the nonlinear transformation and ask ourselves what transforms to what, to make sure that all the notions are clear. The first thing is the input point x. This is a single point that is represented by its coordinates x1 up to xd, together with the mandatory constant x0, which equals 1, that takes care of the threshold term. So this is a general representation of a point in the x-space. What does this transform to? I'd like you to think before I give the answer. Well, it transforms to a z. That is a vector. Each of these coordinates, let's say z1, is a nonlinear function, potentially nonlinear function, of all of the x's, of the entire vector x. So for example, this could be x1, x2, e to the x3. The next one would be 1 over x2 times x3 cubed, whatever it is. And you can go on and on and on and on and on. And there is really no limit. If we thought of linear methods here as an economy car, this could be a truck. This could be actually an 18-wheeler. And we must be proud of that, because with such a simple method, we are able to create such a strong machine. But be careful, because you may not be able to drive it. And if you do the wrong transformation, you will end up crashing. In this case, generalization-wise crashing. That is, although you did everything right, and you did this transformation, and this is a powerful machine, you don't know how to drive the powerful machine, and you end up with very poor generalization. And we will need the theory in order to get our driver's license. That will tell us what to do in order to be able to drive this machine. OK, so that is x. Now, what do x1 up to xn go to? Remember, this is the data set, the inputs of the data set. Each of these guys by itself is a full vector x that has all of these coordinates. So this is the data set we have. What does it transform to? Not surprisingly, z1 up to zn. So you end up with the same number of points. That is obvious. And each of them is a vector. And each vector can be very long, according to the transformation you chose. Next one, the labels. The data set comes with inputs and outputs. So the inputs, I did the transformation. What do y1 up to yn transform to? Will they transform to y1 up to yn? These are untouched. These are the values. They are not touched. And these are the ones we learn. If it's classification, they are plus 1 or minus 1, exactly the same way they were there before. How about the weights? When we use linear models, we have a weight vector. So when we are in the x space here, are the weights. The answer is that there are no weights in the x space when you do a nonlinear transformation. The weights are done in the z space. And I label the weights here as w tilde. And I'm using tilde as nonlinear, so that you remember it's a nonlinear space. And everything here is tilde. So this is w tilde. And if you look here at the dimension, you may have not seen it. Let me magnify it. The dimensionality here is also d tilde. So whenever we need to distinguish between z and x, we will add tilde to the z counterpart so that you are not confused about which is which. So we have those weights. And finally, you ask yourself, OK, I have done all of this machinery. Could you please tell me what is the hypothesis that I'm delivering to my customer? We're still calling it g of x, the final hypothesis of your learning process. And it happens to be exactly the same way, except in z space. So you take the linear form here and take the sign, and that would be your hypothesis. Except it's a little bit annoying, because this is g of x, and you are telling me this is w tilde transpose times z. Where is x? Don't worry. Here is x. What z is, is the transformation of x. So when you want to evaluate this for any x, all you need to do is plug into this formula, and you are ready to go. That's the entire story of the nonlinear transformation. OK? Now, with that out of the way, let's go to the main topic of the day, which are error measures and noisy targets. So when we face a real learning problem, we realize that there are practical components in real life that we have not fully taken into consideration. And what I'm going to do, I'm going to take the learning diagram, which we introduced in the first lecture, and then I'm going to adjust it according to these practical components until we get the final general form of the supervised learning diagram. That will take us through both topics, which are the error measures and the noisy targets. OK. So here is the learning diagram in case you forgot what it was. That's where we left it. Let's see what it looks like. Remember, it's a pretty simple diagram, and we built it from scratch. I need to rebuild it in order for you to realize that I'm not just flashing a jungle on you. It has a sense. This is what we are trying to learn. It's an unknown target function, represented to us by training examples. We have a learning algorithm that will take these examples and produce the final hypothesis. All of this is nice. We said that the learning algorithm is picking the hypothesis from the hypothesis set. And we said that this is a convenient technicality that has no loss of generality. So we accepted that we will always have a hypothesis set. And then we went into the feasibility of learning, and we realized that for that to happen, we need to introduce a probability distribution on x, any probability distribution, and generate the points x1 up to xn, which constitute the inputs to the training examples, using this probability distribution independently. Once you do that, you get the benefit of hefting, and you can make a statement that you are going to do something out of samples that is reflected by the in-sample. That's where we stood. This is the diagram we are going to be modifying piece by piece. Let's talk about error measures, the first notion. Error measures try to answer the following question. What does it mean for h to approximate f? You have two functions, and you say this is a good approximation, this is a bad approximation. Is it a qualitative statement, or is it quantitative? It is quantitative, and because it's quantitative, we are going to define an error measure that measures how well or how badly h approximates f. So the error measure will be defined as e of two guys, and these will be h and f. It returns a number for any two functions you plug in. One of them will be the target function. One of them will be a hypothesis of interest, and you ask yourself how badly in this case does h approximate f, and you get an error. If the error is zero, then h perfectly reflects f, and you are home-free. If there is an error, then maybe you need to look for another h that has smaller error. That formalizes the question of search of the learning into minimizing an error function. We call it error function, and we call this error measure. It is neither a measure in the measure theoretic sense or a function. This is actually a functional, but we are not worrying about that. We just take these objects and return a number, and we refer to it as a function, and we talk about error measure in the sense of the English word measure, not the mathematical measure. So the error function in principle returns a number for a pair of functions, but it is almost always defined in terms of difference on a particular point, and then you put these points together. That's the point-wise definition. In this case, you define a small e that goes with the capital E that also takes two arguments, and these two arguments are the value of h at a particular point and the value of f at the same point. That makes sense. I'm trying to compare functions. I want them to be the same on the same point. Therefore, if I compare them for every point this way, then I will get something meaningful, and then I will need to do something about the different e's, small e's, that will get me the big e. So although this is not strictly required by the definition, you could have a crazy error function that does not reduce to corresponding points, but invariably, this is what you are going to do. Have we seen this before? Yes, we have. Remember the squared error? How do we formalize it in this sense? We can say that the error in this case is h of x minus f of x squared. That's what we did, and that is indeed an error function that measures the difference between the two, and indeed, if the error is zero, it means that h of x equals f of x, and we have exactly what we want. We also saw it before, although we didn't explicitly talk about it in those terms, when we talked about binary error. Every point could be either right or wrong, and that doesn't look like a quantitative error function. It's binary. However, you can also put it in those terms as follows. We agreed this notion, let me magnify it for you. This notation, which is the funny bracket, means that your return one, if the statement enclosed here is true, and your return zero if it is false. That's a standard notation. If you take this as your error function, what will happen? If h of x equals f of x, then this statement is false, and your return zero, so the error is zero. Good. And if this statement is true, your return one, and indeed, in that case, you are making an error. So it's a binary error because of this. And if you take it as a formal error, and do the rest of the development to get the other one, the big E for the global function, you will find that this is exactly what we did when we were talking about frequency of error and probability of error. That will become clear in a moment. So now let's move from the point-wise. You define it on one point in the space to define the error function, the capital E, on the entire space. So the way it is done is that the overall error, which has this notation, will always be the average of point-wise errors. So you take these point-wise errors and average them. And all we need to do is articulate what we mean by average, in order to get that. So let's look at the in-sample error. When we have the in-sample error, this is the formula for it. And now you think of in-sample error as the in-sample version of this. Because now we are going to use the point-wise error that goes with that error measure in defining the in-sample error. If you take a single point from your training set, you will be having small n going from 1 to capital N. So one of them, generically, is small n. And I'm putting it in red because that's what we are going to average with respect to. So you compute this error measure, whatever it may be, squared error, binary error, any other error you can think of. And now you get the average. And the average in that case will be the simple average, which is 1 over n over the sum. So this is indeed consistent with what we thought of as the training error. And if you go back to the binary error, which is the funny error, and you look at what this formula will return, it will return exactly the frequency of error in the sample. Correct? So now let's go for the out-of-sample error. Again, the out-of-sample error is the sample version of this error measure. Now in this case, the point is a general point in the space. So we are labeling it as x in general, so this could be any point in the space that is picked from the script x, which is the input space. And in order to get an average in that case, what you do is you get the expected value, in this case, with respect to x. So that is the average for the out-of-sample case. And again, if you take the binary error and you take the expected value of this, this will be identically the probability of error overall. And we are using the probability distribution over the input space x in order to compute this quantity. So that's how we get from a definition that you invoke on a single point to the in-sample and out-of-sample versions. Now let's revise the learning diagram with this added component. Here is the learning diagram. There is nothing that changed here, except that now this is the standard color, because we already got used to it. The red stuff is the new stuff. So you have here, and you want to add the error measure. I'd like you to think for a moment what we are going to do. We're going to take two steps. The first one is to realize that we are defining the error measure on a point. So here's the addition. The addition is that in deciding whether g is close to f, which is the goal of learning, we test this with a point x. And the criteria for deciding whether g of x is approximately the same as f of x is our point-wise error measure. Furthermore, this x is created from the space using something very specific. And that is, it comes from the same probability distribution that generated those points. This was implicit when we talked about the bin. Mu was the probability distribution in the bin, and nu was the sample distribution in the sample that you pick. When you test the system that you train using a certain probability distribution, I ask you to test it with points drawn from the same distribution. That's the only requirement in order to invoke hefting or the counterparts of hefting for more elaborate type of functions. Now, if you do that, then you have the guarantee. And it does make sense that with the benign assumption that p can be an unknown any probability distribution. All you are asked to do in order to get the guarantees that we talked about is use it to generate the examples and use it to test the hypothesis. So that is what we have. Now we come to the question. I understand where the rule is. I'm going to define the error measure on point by point. I know how to move from a point to the in-sample onto the out-of-sample. Now we come to the crux of the question. How do I define the error measure? What is the number that I should return for h differing from f on a particular point? I will give you an example to make the point. And my example will be fingerprint verification. So you declare yourself, and you want to authenticate yourself. So you put your finger, and the system will decide whether it's u and return plus 1, or it's an intruder and return minus 1. That's what the system does. And we would like to see how to define an error measure in this case. There are two types of error that you can make when you have a system like this. One of them is called false accept. I think it's self-explanatory. False accept, meaning that someone who shouldn't be there was accepted. He was falsely accepted. So the intruder got him. That's an error. The other type is false reject. The owner of the system, the one who paid for it, you put your finger, and you are rejected. And you are mad at them. That's a false reject. Now, in defining an error measure, I'd like to get this case because there is a great intuition about what is going on. So if we can come up with a meaningful error measure here that captures both the false accept and the false reject, we will have a handle on what the error measures are all about. So how do we penalize each type? That's what you do. When you give an error, you penalize it. Such that the error is large. So you move away from that hypothesis to get a better hypothesis that doesn't penalize it as much. Now, we can put it in a matrix form. So this is the target. This is the perfect system. This returns plus 1 whenever it's u, returns minus 1 whenever it's an intruder. That's our dream system. We don't have that. We are going to use machine learning using examples. And we are going to come up with a hypothesis. This may not be the final hypothesis. We are talking about the general hypothesis here. When it becomes the final hypothesis, we are going to call it g. So h could be plus 1 or minus 1. Plus 1 means you are accepted, or the person is accepted, whoever he may be. And minus 1 means they are rejected. So now let's try to put under the four possibilities here what the errors would be. First, the easy one. The diagonal corresponds to no error. And I'm putting it in faint color, because in that case it's clear that you are going to make zero error. It's u and you are accepted. It's not u and you are rejected. That's fine. What's interesting are these two. We need to get a number for the false accept, which means it's an intruder, but they were accepted. Or the false reject, which means it's u, but you were rejected. If I can come up with four numbers here, two of which I already know, then I have the answer. The key message I am conveying with this discussion is that there is no inherent merit to choosing one error function over another. It's not an analytic question. It's an application domain question. And I'm going to argue for that. So let's look at the error measure for this problem for the important application of supermarkets. So what happens with supermarkets? Well, supermarkets decide to become fancy, and when you go and you own the discount for your special program, you not only declare yourself, they need to verify that it's u, because recently too many people just claim any number and get the discount. They want to control it a little bit. So on that checkout, you identify yourself, and then you put your finger, and then the system will verify you or decide that you are an intruder. Now, given this application, let's try to see false accept and false rejects and how to penalize them. The false reject in this case actually is costly. Think of it this way. You are a customer. You go out, and you have this huge bag of stuff, $100 worth of stuff, and you think you are an important customer to the supermarket. You put your finger, you are rejected. You put your finger again, you are rejected. Put your finger again, you are rejected. You are embarrassed in front of the entire queue, and in your mind you say, the heck with this supermarket? I'm going to go to the competitor. So when they have a false reject, they run the risk of losing a customer. The customer gets annoyed. False accept is not that big of a deal. Someone comes in and claims that you and the system passes them. What is the downside? They got a discount. One more discount. For business, it's not that important. And furthermore, if you think about it, that must be a very brave person, because they are an intruder, and they left their fingerprint on the system. That's part of the deal. All to get a discount, I think really they are in trouble. So we really shouldn't penalize the false positives that much if that will help us with the false negatives. So if you look at a matrix that fits this situation, this one qualifies. I'm going to penalize false rejects. That's not a question. But I'm just going to penalize them just by one. When it comes to the other one, which are the false accepts, let me try again. This is the false reject. It's you, and you are rejected, and you are penalized dearly. And this one is the false accept. It's not you, and you are accepted, and therefore you give it less weight. So this looks like a reasonable matrix for that case. Now let's look at the exact same system. You are given training examples. You are told that the target function is fingerprint verification, and you go about your machine learning algorithm. Now one of them is for a supermarket, and the other one is for the CIA. Let's see the situation here. So now what is the CIA going to use the system for? It uses it for security. It uses it for verification that you are an authentic person authorized to do something, could be entering the building, could be looking at a document. So you put your fingerprint first. Now let's look at the false accept and false reject. You have to agree with me that false accept in this case is an unmediated disaster. Someone got authority to something that they are not authorized in, and national security is at stake. That's a no-no. False reject in this case can be tolerated. You are not a customer. You are an employee. It's you, but the system rejected you. Just try again and again and again, because you are paid to be here. Just take the inconvenience and do this in order to save the false accept. So in this case, it makes sense that we are going to put the weights in exactly the opposite way, even more extreme, and you will have a matrix that looks like this. If you are the wrong person and you are accepted, that's a false accept, that's a huge penalty. The other one, you put a bigger penalty. If you are really cruel in your employees, you put this as 0.1 or 0.001, and then have them really go for this thing for 20 times until they are accepted. But in general, you can see that this has to be a much bigger number than this, whereas in the supermarket case, this was a bigger number than that. So the take-home lesson is that when you are dealing with a practical learning problem, the error measure should be specified by the user. You are going to deliver a system to them. The system is not perfect. They want the target function, and you give them a hypothesis instead. You should ask them, how much does it cost you to use my imperfect system in place of the perfect system, that is their decision to make, and if they articulate that as a quantitative error function, this is the error function you should work with. However, this does not always happen. People may not have the formalization that will capture the error measure in reality, and sometimes they capture it, but it's very difficult to optimize. There are other considerations. Now, what I am going to talk about are the alternatives to this approach, and the alternatives are a compromise. They are very common and popular compromise, and people indulge on them. I don't mind indulging on them, because actually there are some nice properties to them. But you always remember, this is the second choice. If we knew what the error measure that needs to be used by the user is, we would use that. So here are the two alternatives. You don't have the user-specified error measure. Then you resort to plausible measures, measures that you can argue analytically that they have merit. Usually, the analytic argument starts with an assumption that is usually a loaded assumption, and from the end, the going is very smooth. But nonetheless, that is in the absence of a meritorious error measure. We might as well resort to that. I have seen an example of that, which is squared error. If you look at squared error, you can say that if the noise is Gaussian, I didn't do that in the lecture, not a difficult thing to imagine, that the corresponding error measure in this case would be squared error. So that is the plausibility of it. And you can take other case, for example, the binary error. You can go and get a cross-entropy type of error corresponding to the binary error and whatnot. So these guys have an error measure that goes with them. The other approach is not to have a plausible measure, but to have a friendly measure. You're not justifying that this is a meritorious measure. You're just using it because it's easy to use. And we have also seen that. For example, we can get close form solution if we choose a particular error measure. Linear regression comes to mind. If you didn't use a squared error in that case, you would not have gotten the very easy formula that we started the lecture with. And also, even if you can get a close form solution, you might be able to use optimizations that is favorable. For example, the cross-entropy that I referred to ends up in a case of a linear model, the logistic regression, being convex, which means that optimization is efficient. In that case, you get a global minimum and all of that. So now you resort to either conceptual appeal, the plausibility, or practical appeal, which is the friendly aspect, to choose the error measure. This is completely legitimate, because in many cases you are not going to have the user-specified error measure. Now let us modify the learning diagram once more to introduce the error measure as we defined it. So here is, and I'd like to ask you to look at this for maybe 15 seconds, and tell me where you think the error measure will fit in this diagram. The error measure itself. What does it affect? What does it take from? Exactly. So I can put the error measure, for example, here. That's an option. I can put it inside the unknown target function. What does the error has to do with the target function? So you can think, where would it be? It's not difficult to realize that that's where it belongs. It has two roles. One of them is to evaluate this statement. This statement is qualitative. The final hypothesis approximates the target function. This is what gives it a number. This gives it a grade, and you use the error measure to quantify this approximate thing. The other thing is that you have to feed the error measure to the learning algorithm, because what does the learning algorithm do when you have an error measure? It minimizes the in-sample error, let's say in this case. And the in-sample error depends on your error measure. If you are minimizing squared error, that's different from minimizing another type of error. So the error measure feeds into those two. Now we go for the next guy, which is the noisy targets. New topic, another addition to the learning diagram. The noisy targets are actually very important, because in reality, these are the only types you are going to encounter in the problems in life. Very seldom, you get a very clean target function. So the first statement is the target function is not always a function. Why are you calling it target function, if it's not a function? Well, function is a mathematical notion. You have to return a unique value for every point in the domain. That's what qualifies it as a function. We used it here a little bit liberally. So far, we dealt with it as if it was really a function. But let's take the example we started with, the credit example. You consider credit card approval, and here is a historical record. This is an input. Isn't it possible that two identical customers have these fields, and one of them ended up being good credit, and one of them ended up being bad credit? Sure, this doesn't capture all the information in the world. There is information that is not given that contributes noise, if you will. And there are circumstances that the people will go through that will make it probabilistic whether they will be good credit or bad credit. So we come to realize that two identical customers, in the sense that their input representation is the same, can have two different behaviors. And having this is one point mapping to two values so it is not a function. What do we do about that? Well, we use a target distribution, as in probability distribution. Instead of having y equals f of x, you tell me what x is, and I'm going to tell you what the value y is for sure. You use a target distribution, and the notation for that is probability of y given x. Again, it depends on x, but its dependence is probabilistic. Some y's are more likely than other than this case. Here, one y was possible, and the rest were impossible. So now we make it a little bit more accommodating this way. So now we have a target distribution instead of a target function. Let's follow it through. X used to be generated by the input probability distribution. It will still be generated by that distribution. This is an artifact that we introduced in order to get the benefit of the Hoeffding type inequality. Nothing has changed. But what will change now is that instead of y being deterministic of x once you generate x, y is also probabilistic, generated by this fellow. So you can think now of x, y as a pair being generated by the joint distribution, which is p of x times p of y given x, assuming independence. In this case, there is no assumption of independence once you put it this way. But the assumption here is that the p of y you are given is actually conditional on x. Now you get noisy targets. What is a noisy target in this case? Well, a noisy target can be posed as a deterministic target, like the one we had before, plus noise. This applies to any numerical target function. So if y are real number, or binary, or something numerical, you can always pose the question of a target distribution as if it was a deterministic target function, proper, plus noise. This is just a convenience to show you that this is not far from what we have already. And why is that? Because if you take, define now a target function to be the expected value, the conditional expected value of y given x, that's a function. Although p of y given x gives you different values. You take the expected value that's a number, and you call this the value of the function f of x. Then whatever is left out, you call the noise. It's a nice trick. So you've got the bulk of it, and then you go here and you call the rest the noise. And that is usually the form it is given. So you think that you are really trying to learn the target function still, but there is this annoying noise and you're trying to make your algorithm pick this pattern and there is nothing it can do about the remaining noise, which averages to 0. Now, by the same token, there is no loss of generality when we talk about probability distributions. If you actually have a proper function, which happens once in a blue moon, you can still pose this as a probability distribution. How do you do that? You get here p of y given x, and you define it to be identically 0, unless y equals f of x that you have in mind. So if we were talking about finite domains, you put all the probability 1 on this value, and you put the probability 0 for all other values. If it happens to be continuous, which is almost all the case, you put all the mass on the point, you put the delta function there, and you let the other ones be identically 0. The target function is a probability distribution, therefore, if we decide that the target is always a distribution, there is no loss of generality. Now, let's do the final installment of the learning diagram. Once we are done with this, we will freeze it forever. This will be the general learning diagram for unsupervised learning. And here is what we have. We are going to include the noisy targets. This is what we had so far. It's getting crowded, isn't it? And we are going to make it more crowded. And in this case, we are including the noisy targets. Obviously, the modification will happen here. And what you do is you replace this with a target distribution. So let me magnify it. So the unknown target function became unknown target distribution. You define it as a conditional probability distribution of y given x. And you can think of this as if it was a target function, with the expected values that I talked about, plus noise, which is the remaining part. But as a target, it is a distribution. And I'd like to look at this and appreciate the time we spent to build the blocks here. In spite of the fact that this looks like a complete jungle, you can go back and understand every single component here. Every component has a reason. This is to accommodate a practical consideration, which is the fact that we are learning something that is noisy. This is because a specification of the penalty or the cost you pay for not being perfect needs to be specified by the user. This is our artificial addition to the problem, in order to make learning feasible, and so on and so forth. So that is the final diagram for supervised learning. Now, I'd like to make one final point about noisy targets, which is the distinction between the two probabilities we have. We have probability of x, which we artificially introduced to accommodate herding, and then this was introduced in a completely different context. That is to accommodate the fact that real, as is genuine, functions that you encounter in practice are not functions, are actually noisy distributions. So let's look at this. They look very much related. They both pour into the training examples. That's how the training examples are generated. This guy passes on the probability of y given x. This guy passes on the probability of x. When this guy gets it, it generates those guys according to the joint distribution, multiplies these, if you will, and then uses it as a way to generate the pair xy. So they look like they belong to the same category. Both of them are unknown. This one is unknown so that my machine learning statement can be as general as I can afford. You are learning something, you don't know what it is. So that's good to have. This one is unknown because it is the most assumption we could afford in order to get herding. We needed a probability distribution, but we didn't need to make any assumptions about it. So we left it to be arbitrary one and unknown, and we don't want to know it. So these are the similarities. Now let's look at the differences. So both have probabilistic aspects, we have seen that. Now the target distribution is what you are trying to learn. You are not trying to learn the input distribution. As a matter of fact, when you are done, you will not know what the input distribution is. The input distribution is merely playing the role of quantifying the relative importance of the point x. Let me give you an example. Let's say you are approving credit again. The target distribution is the probability of creditworthiness given the input. Let's simplify the input and say it's the salary. So I give you the salary, you decide what is the risk of this person defaulting, and then you decide that the output is plus 1, approve credit with probability 0.9, and disapprove credit with probability 0.1. That is the target distribution, and that is what you are trying to learn. You are going to approximate it to a hard decision probably, or you can actually learn the probability distribution as we will see later on. The input distribution just tells you the distribution of salaries in the general population. How many people make 100,000? How many people make 10,000? Et cetera. So in spite of the fact that the probability distribution over the input matters, in the sense that, let's say, that you encounter a population where the salaries are very high. So p of x is tilted very much towards the high salaries. And let's conjecture that high salaries correspond to high creditworthiness. In this case, the same system that you train that will take any salary, low or high, and then decide whether to approve credit or not, will be tested mostly in the very comfortable region of high salaries. So it will be returning yes, approve, yes, approve, yes, approve with very small probability of error. And if you go and put the mass of probability around the borderline cases, the cases where the decision is difficult, the same system that you learned will probably perform worse just because there are so many points that are borderline. So it does give the weight that will finally grade your hypothesis, but you are not trying to learn that distribution. And when you put them together analytically, which you are allowed to do, you can merge them as p of x and y. And that's what you will find in the literature. It's very nice and pleasant, and you generate the example using that joint distribution. However, you just need to remember that this merging mixes two concepts that are inherently different. Definitely, p of x and y is not a target distribution for supervised learning. The target distribution, the one you are actually trying to learn, is this fellow. The other component is just a catalyst in the process. That covers the error and noise, and we have arrived at the final statement of the learning problem. So now let me spend the rest of the lecture trying to prepare you for the coming two weeks, which will consider the theory of learning. It's a very important theory, and I encourage everyone to bite the bullet and go through it. I will do my best to make it user-friendly. However, it's important not because just of the mathematical derivation, the insight and the secondary tools you are going to get are extremely important. It's worth two weeks, not full two weeks, but four hours' worth of listening to a lecture and actually trying to study the material well. So that's my pitch. Let me give you the preamble to the theory. Let's see what we know so far in order to put the theory in perspective. We know that learning is feasible in a probabilistic sense, and the way we did this is by stating that it is likely that the out-of-sample performance is close to the in-sample performance, that in our mind corresponded to the feasibility of learning. I am going to test this premise now and ask, is this really learning? Is this condition the condition that captures what we mean by learning? Let's raise some doubt. Learning means that you got the hypothesis right. You give it to your customer, and it behaves well as close as possible to the target function. That's success. That means that the condition for learning is really that g approximates f. And now we are sophisticated people. We know what approximates means. So this condition is not really this condition. What is this condition in terms of the e in and e out and stuff? Very simple. That's what it means. It means that the out-of-sample error for g is close to zero, because the out-of-sample error measures what? Measures how far you are from the target function over the entire space. And therefore, the statement that you are close means that the out-of-sample error is small. So this is what we want, and this is what we have. Now, what was that? If it's not learning, what was it? Well, this was actually good generalization. And it's an important building block, because you never will have access to this. If I gave you this as the condition, a quantity that I will never know is close to zero. Thank you very much. But now with the theory, I was able to tell you, you have a window on e out by dealing with e in if you have the right checks in place that we develop vaguely, as the number of hypotheses is not too big. And we will define it very, very accurately when we get to the theory part. So if you have that, all of a sudden, e in is an important quantity, because now it acts as a proxy for e out that you don't know. So this is not a total waste, but it's only half the story. So the full story of learning has two questions. And if you look at this slide and remember that this is always the case, the learning problem is posed, you will dismiss a lot of misconceptions of learning is impossible, learning is possible. You find all kinds of results over the literature. So here is the deal. This quantity patently is we learned well. That's what it means. So now we are going to achieve this through two conditions. The first condition is the one we developed using herfding. e in is close to e out. The second condition is e in is small. You put them together and you have the learning. And you can see the difference between the two. This is a theoretical result. This is a practical result. e in, I know e in, I can try to use linear regression or something else to knock this down and get it as close to zero as possible. And indeed, if you look at where we handle these questions, we didn't explicitly say the questions in this form, but we already covered them in two lectures. This was the subject of lecture two. Herfding was all about the fact that e in is close to e out. This was the subject of lecture three. We had data, and we wanted to get the in-sample error to be small, and we looked for techniques to do that. So now, because this is important, let's put it in a box. Learning reduces to two questions. First question. Can we make sure that the out-of-sample performance is close enough to the in-sample performance? This is a theoretical question, and we are going to spend two weeks answering this question. Second one. Can we make the in-sample error small enough? This is a practical question, and we are going to spend four weeks doing this one. And then, by the way, we'll have one week to reflect, and it's always sweet to reflect when you have a concrete ground to stand on. We can do all the philosophy in the world, and we will have very concrete mathematics, very concrete algorithms, and very concrete results in real data to know what we are talking about means something. So that's the plan. Now, let me just make a small comment about this one. Small enough has been close to zero so far. There is a very important class of applications where there is no way under the sun that you will get an out-of-sample performance close to zero, anywhere near zero, and by proxy, simply E in will not also be zero. And I'll give you a very simple example. Let's say that you are doing financial forecasting, trying to detect whether the market is going up or down. Under idealized conditions, this is impossible. The data is purely noisy, and there is nothing to learn. The fact that the conditions are not ideal makes hedge funds make money because of that. They exploit a little bit of inefficiency, but they don't get it right 100% of the time. They don't get it right 70% of the time. They will be very, very happy if they get it right 3% of the time consistently. Under those conditions, you will make a lot of money. So the out-of-sample error here that we are trying to do is very close to a half. It's 47%. If you get an out-of-sample error, so the correct is 53%, the error is 47%. So you can get as small as possible in some applications to be not really near zero at all, but actually closer to the half. As long as you are no-for-effect, or at least have the theoretical guarantee that what you are seeing in sample, when you add the halving allowance, if you will, that the out-of-sample will be comfortably error smaller than a half consistently, you are in business. If you don't have the halving guarantee, then you are so happy in sample. You've got the stock market right 75% of the time. You think you're going to make money, and then when you look at the jump from E-in to E-out, you find that the error bar is that big, and you are in trouble. So let me talk about what the theory that we are going to cover in the next two weeks will achieve, and then I will stop. This is a typical figure that you are going to get. The theory deals with in-sample error and out-of-sample error. Let me actually magnify it just to... So we will see the behavior of in-sample error as you get the model to be more and more elaborate, which will be measured by a quantity which you are going to call the VC dimension. You will find that there are certain behaviors of the in-sample and the out-of-sample and the model complexity and whatnot, and all of the things that appear in this figure will get a formal definition and an ability to evaluate them when we get the theory. So it is worthwhile. But if you summarize what the theory does, there are two steps that are the most important. The first one, which is the most remarkable result in the theory of learning, is we are going to characterize the feasibility of learning for the case where infinite m. Remember what m was? m was the number of hypotheses. We worked with a finite hypothesis set in order to be able to work with simple hefting, and we realized that the bigger m is, the looser the bound, and if m goes too big, the bound is meaningless. So if m is infinity, we are toast. So now we would like to be able to find a counterpart to be able to take an infinite hypothesis set, like every hypothesis set we have seen so far, Perceptron, the linear regression model, all of these are infinite hypotheses. And we are going to try to find a way to deal with infinite hypotheses. This is the bulk of the development, because while we'll end up, we are going to measure the model not by the number of hypotheses, but by a single parameter which tells us the sophistication of the model. And that sophistication will reflect the out-of-sample performance as it relates to the in-sample performance. Once we do this, lots of doors open, so we are going to characterize a trade-off that we observed on and off as we went through the lectures. We realized that we would like our model, a hypothesis set to be elaborate in order to be able to feed the data. The more parameters you have, the more likely you are going to feed the data and get... So the E in goes down if you use more complex model. We also suspected that if you make the model more complex, the same direction, the discrepancy between E out and E in gets worse and worse. E in tracks E out much more loosely than it used to. Now the good news from the theory is that this will be pinned down so concretely that we are going to derive techniques from this that will make a lot of difference in the practical learning. Regularization is a direct result of this. And without regularization, you basically cannot do machine learning other than extremely naively. So this will set the foundation for a practical method that is used in almost every machine learning problem you will have. It's worth really knowing. Okay. Now I will stop here and we'll take questions and we'll start the theory next time. Please sort of get ready, do your exercise, and get ready for two weeks worth of very interesting derivation. Okay. Now let's go to the questions. Okay, so... So how does Px impact the learning algorithm? So does it matter if Px is different from the training and the yield data set? Okay. So there is the absolute impact of Px and then there is a relative impact. So you are asking about the relative impact. Let's say that I pick the training points according to one distribution and then test the system using another. Let's answer that question first. Okay. There is a correction to the theory that takes into consideration the difference between the two probability distributions, assuming that they are not extreme. For example, if one probability distribution completely vanishes, then obviously there's a problem because the points in that part of the space will never happen and you shouldn't be hoping to learn at all from that. But there are modifications to the theory where you get a correction term based on the difference between the two probabilities. The absolute version, I don't know whether this was asked, but let me address it anyway, how does Px affect the learning algorithm? Well, the emphasis that Px gives on certain parts of the space over others will affect the choice of the learning examples. And if you have a limited resource in your hypothesis set, which you always have to have, otherwise the model is too complicated, then there is always a compromise about which part of the space should I try to get better than the other? I mean, you don't think of this explicitly, but that's what the algorithm does when it tries to satisfy a number of examples. If you change the probability distribution, even if it's the same for both of them, then you will end up with a slightly different hypothesis that takes into consideration the emphasis of the new one. Nonetheless, you are not learning that input distribution, it's just affecting your choices. Okay, and in this discussion, so you introduced the probability of y given x and probability of x. Will probability of x given y ever play a role? Okay, I mean, I can imagine cases where it plays a role. Okay, so you have p of x and y, and you ask yourself, if I get this output, what is the likely input? I mean, this is a role. I don't know whether it's a machine learning role or not. But in general, the merging of p of x and p of y given x in the same quantity, although it's mathematically convenient, as I mentioned, it's a little bit, not misleading, misleading is a strong word. So what you are really looking at, you always think I have p of y given x. That's the genuine thing that I'm trying to learn, and that is an integral part of the learning problem. On the other hand, p of x plays a technical role, and a technical role that is fairly negligible, it's essential to exist for it, but it's not nearly as important as p of y given x. In the case of considering the target function as a probability distribution, then what is better to have m pairs of x and y or mys per x? Okay, so I don't have a theoretical proof for it. It seems to me obvious that if you get all the outputs corresponding to one input, you are dealing with a very specific part of the input space, and you are unlikely to infer anything about the rest of the space. So the answer to the question is that intuitively, and I think it would probably be not that difficult to prove, that you get them independently rather than get them for the same input. Also, by the argument we mentioned before, you should be choosing the inputs according to the probability distribution, the input probability distribution p of x independently. So if you get all the examples according to the same x, this really means that p of x that you are using is a delta function on that x. So if you leave up to the expectations and use the same probability distribution to generate the output, then you are in good shape. But if you change the game on me and generate all the examples according to this delta function, and then when you want to test it, you go out and give me points that I haven't seen before, then I'm in trouble. Can you clarify what you mean by the poor generalization? It's a common question. Okay, this will be part of the series. There will be a very specific quantity we measure, which is a discrepancy between E out and E in. And we are going to call this the generalization error. And that will quantify poor generalization or good generalization. Okay, going back to the slide 11 and 12. Oh, the supermarket and the CIA. Yes, so you chose the numbers 110 or 1010. So is there a principled way of choosing these numbers? The principled way is to estimate the cost of this occurrence and then translate it into those. This was only an illustration, and I wasn't really interested in the one or 10. I was only interested in making the point crisp that the error measure is different between two application domains for exactly the same system, the same system as in machine learning system, same training data, same target function. But the error measure can be different depending on the application domain. So in this case, we can actually go and see, for example, the loss of revenue by giving an unwarranted discount for the supermarket and the probability that the customer will be annoyed and lost revenue because of the customer and then actually come up with the right balance between false accept and false reject for the supermarket. It may not be 10, but it will be, you know, definitely the number that is 10 here would be bigger than the number that is one here. Similarly for the CIA, you can go and ask yourself what is the risk and how much does it cost versus the lost time for the employees by trying the system again and then come up with a more principled way of doing it. This is not really the crux of what I'm doing here. I was only making the point that they are different. That's all. Okay. Once the theory is explained, will it quantify the errors that result from not knowing parts of P of X, especially if P of X has maybe long tails and things like that? Okay. P of X has been assumed to be an unknown target function and I only used it as a utility to invoke a probabilistic setup. There are no assumptions about P of X, as long as you pick the points from the same distribution to train as to test everything that I said and I will say during the theory part will be valid. If it's a long tail, it's a long tail for training and for testing. The probability of getting something from, let's say if it's a heavy tail and I get something that is outlier, I will get a certain error, I will get an in-sample error and I'll get out-of-sample error. I basically don't worry about the structure of P of X because I'm assuming it's unknown and I'm assuming that in the course of supervised learning, I'm not going to learn it. What happens in the case that both the false positives and negatives have higher values? If you scale both of them up, it makes no difference whatsoever. Then the error measure is scaled up and you are minimizing it, so it's just a constant multiplied by it. If they are scaled relative to each other, then obviously the emphasis on the system changes trying to get more false positives and less false negatives or vice versa and that's what happens between these two examples. For the supermarket, here we're trying not to reject customers and in the CIA case, we are trying not to accept people who are intruders. Okay, there's also a question of reiterating the relation of P of X to Hevding inequality. Okay. Hevding inequality was based on the bin. The bin had marbles and we would pick them according to some probability, which we labeled as a Bernoulli trial, so it's a binary outcome and the probability was mu. The bin became the input space. The input space, when we started talking about machine learning, did not have a probability distribution, it was just a set. So in order to be able to invoke the probabilistic aspect, we need to put a probability distribution over the input space, such that when you change it, when you create green and red marbles according to agreement-disagreement and the input space becomes a bin, there is a probability that goes with it for picking red versus green marbles. It doesn't matter, because any probability you put will correspond to some mu and then you have the rest of it. And we know that the Hevding inequality is independent of the value, the bound on the right-hand side is independent of the value of mu. So any all probability will do. Will do what? Will do the legitimization of the learning problem as far as the probabilistic approach is concerned. Obviously, we can enter a discussion about the probability of being concentrated or spread out or parts of the space being zero. All of that is good and valid except that it doesn't affect the basic question which is to make sure that the learning is feasible in a probabilistic sense. Any p of x will achieve that. Okay, a clarification. So some people are asking to exemplify the case of a squared error measure and a closed-form solution. This actually goes to the review. Let me go to the... to the review one because this is from last lecture. Okay. There is an algorithm that we derived for linear regression. And the algorithm is based on minimizing squared error. Remember that we took a gradient and we took advantage of the form being squared error in order for the thing to be differentiable and for the derivative to have a simple form. And that simple form is what ended up in getting the formula for w at the bottom which is x transpose x inverse of that times x transpose y as a simple closed-form solution for the final hypothesis of linear regression. So in that case, it is the squared error measure that defines linear regression that enabled us to find such a simple solution. If you take another solution, you may or may not get a simple solution, but for sure in this case we got it and there are definitely error measures that you can put where you cannot find a simple solution like this one. Okay. Going back to the problem of the CAA and the supermarket. Yes. If the probability of y equals one and y equals minus one is not balanced, should you do something regarding p of x or to have a correct estimate of your error? Okay. Probability of y in the absolute depends on two things. Probability of y given x and probability of x. So if you put them together and you get an imbalanced probability of y, this means that the building quantities which is p of x and p of y given x are what affected that. And those quantities will definitely affect the learning process. So the answer, if you want to answer it what happens when y is not balanced, go back and see what gives rise to it and then you will be able to find the answer more directly linked through the quantities that directly affect the learning process. And again on these costs and is there ever a case where you can use rewards instead of costs as in assigning negative values to the... Yeah, they are equivalent indeed. You are just maximizing the reward or minimizing the punishment. I think it's just two ways of looking at the same thing. Okay, question. So in the example of the bins when you say there's a bin that becomes the input space, does the input space include just the training data points or the input space includes all possible points? The input space includes all possible points but includes only the input part of those possible points. If you look at the example, the training data, X and Y and the input space deals only with the X part. When you talk about the input space in general, it covers all possible X's. When you talk about the training data, you are talking about the X's that were picked as a training set and capital N of them. Okay, regarding the transformation, so is the PHY what relation does PHY have to something like principal component analysis? Okay, this is a different subject. So there is a subject of processing the input in order to make it more compact, in order to get rid of irrelevant parts and whatnot and that is a legitimate processing step but it's not what I was alluding to here. What I was alluding to here in the nonlinear transformation is an ability to implement more sophisticated hypotheses using the same simple method which is the linear method. Okay, and therefore the transformation is with a view to that not with a view to get rid of some of the artifacts of the input. However, feature extraction is feature extraction. You can think of the nonlinear transformation as feature extraction. You can also think of other methods for processing the input and getting rid of some of the the irrelevances also as feature extraction and if you think of the example of the subject that we talked about we started with the full image which I think was 257 bits worth counting the constant one and then we ended up with only two features plus the one. The two features were symmetry and intensity and in some sense these are informative features and in that case you lost some information about the input but hopefully what you lost is not relevant. The principle component and other methods are fairly systematic to detect that without attaching meaning. So you don't really study the subject you just apply a standard method that will pick the most informative directions in the input space and in the input representation space and that will be your coordinates. So it's a different subject it's not related to nonlinear transformation per se. Regarding the error measures, so the squared error measure is used mainly for mathematical convenience. How much if we by replacing it for something like just an absolute value or one norm? You lose optimization. So squared error is this way and that is nice and smooth and has all kinds of properties. You take the absolute value and you have this guy and the edge is really bad news. All of a sudden it becomes a combinatorial optimization instead of a smooth function. So yes, you lose in terms of optimization. If you have a specific merit for using the absolute value that is, you know, the guy tells you that this is my function and I want to make sure that this is what you minimize, then we have to buy the bullet and work through it. But if you are making an analytic choice just for the heck of it you might as well pick something that is friendly either to the concept or to the optimizer. So these questions so regarding the use of a linear model so when you have p of y given x that represents f of x and then f of x would be the result of w transpose hello? So if this is a case then when you subtract y by f of x does it mean you have a p of y of x shape? The target is f of x and the target is not w transpose x. W transpose x is the final hypothesis that is my attempt to approximate the target function. So I was talking about target function versus target distribution even without any learning taking place. Someone has a target function, it's noisy. I'm telling them that they can model it this way. Take the expected value, assuming it's a numerical function expected value under the probability distribution, you will get the average y given a particular x. That's a function of all this f of x. The remaining part which is the value of y minus that will be pure noise in the sense that it averages around zero. So this was just a way of looking at it, but definitely it does not touch at all on linear models or any other models. It's a characterization of the target function versus target distribution. There's a trade-off between complexity and the performance. Is there a way to simultaneously improve the generalization and as well as minimize error? If you sit through the next four lectures very, very attentively, you will get the answer to that at the end of the four lectures. I'm half-joking, but that's the reality of it. You will have enough tools to be able to answer questions like that. I think that's it. That's it. Thank you, and we will see you next week.