 The following program is brought to you by Caltech. Welcome back. Last time, we discussed the feasibility of learning. And we realized that learning is indeed feasible, but only in a probabilistic sense. And we modeled that probabilistic sense in terms of a bin that has an out-of-sample performance. We already mapped that to the out-of-sample performance, the performance we don't know. And in order to be able to tell what E out of H is, H is the hypothesis that corresponds to that particular bin, we look at the in-sample, and we realize that the in-sample tracks the out-of-sample well through the mathematical relationship, which is the Hüvding inequality, that tells us that the probability that E in deviates from E out by more than our specified tolerance is a small number, and that small number is a negative exponential in n. So the bigger the sample, the more reliable that E in will track E out well. That was the basic building block, but then we realized that this applies to a single bin, and a single bin corresponds to a single hypothesis. So now we go for a case where we have a full model, H1 up to Hm, and we take the simple case of a finite hypothesis set, and we ask ourselves what would apply in this case. We realize that the problem with having multiple hypotheses is that the probability of something bad happens could accumulate because if there is a half a percent chance that the first hypothesis is bad, in the sense of bad generalization, and half a percent for the second one, we could be so unlucky as to have this half a percent accumulate and end up with a significant probability that one of the hypotheses will be bad, and when one of the hypotheses will be bad, if we are further unlucky, and this is the hypothesis we pick as our final hypothesis, then E in will not track E out for the hypothesis we pick. So we need to accommodate the case where we have multiple hypotheses, and the argument was extremely simple. G is our notation for the final hypothesis. It is one of these guys that the algorithm will choose. Well, the probability is that E in doesn't track E out will obviously be included in the fact that E in for H1 doesn't track the out-of-sample for that one, or E in for H2 doesn't track, or E in for Hm doesn't track. The reason is very simple. G is one of the guys. If something bad happens with G, it must happen with one of these guys at least, the one that was picked. So we can always say that this implies these things, which is this, or this, or this, or this, or this. And after that, we apply a very simple mathematical rule, which is the union bound. The probability of an event or another event or another event is at most the sum of the probabilities. That rule applies regardless of the correlation between these events, because it takes the worst case scenario. If all the bad events happen disjointly, then you add up the probabilities. If there is some correlation and they overlap, you will get a smaller number. In all of the cases, the probability of this big event will be less than or equal to the sum of the individual probabilities. And this is useful because in the coin-flipping case which started this argument, the events are independent. In the case of the hypothesis of a model, the events may not be independent, because we have the same sample, and we are only changing the hypothesis. So it could be that the deviation here is related to the deviation here. But the union bound doesn't care. Regardless of such correlations, you will be able to get a bound on the probability of this event. And therefore, you will be able to bound the probability that you care about, which has to do with the generalization, to the individual herding applied to each of those. And since you have capital M of them, you have an added M factor. So the final answer is that the probability of something bad happens after learning is less than or equal to this quantity, which is a helpful small quantity, times capital M. And we realize that now we have a problem, because if you use a bigger hypothesis set, M will be bigger. And therefore, the right-hand side here will become bigger and bigger when you add the M. And therefore, at some point, it will even become meaningless. And we are not even worried yet about capital M being infinity, which will be true for many hypothesis sets, in which case this is totally meaningless. However, we weren't establishing the final result in learning. We were establishing the principle that through learning, you can generalize. And we have established that. It will take us a couple of weeks to get from that to the ability to say that a general linear model, an infinite one, will generalize, and we'll get the bound on generalizations. That's what the theory of generalization will address. So today, the subject is linear models. And as I mentioned at the beginning, this is out of sequence. If I was following the logical sequence, I would go immediately to the theory and take the capital M, which takes care of the finite case, and then generalize it to the more general case. However, as I mentioned, I decided to give you something concrete and practical to work with early on. And then we will go back to the theory after that. So the linear model is one of the most important models in machine learning. And what we are going to do in this lecture, we are going to start with a practical data set that we are going to use over and over in this class. And then if you remember the perceptron that we introduced in the first lecture, the perceptron is a linear model. So here is the sequence of the lecture. We are going to take the perceptron and generalize it to non-separable data. That's a relief, because we already admitted that separable data is very rare, and we would like to see what will happen when we have non-separable data. Then we are going to generalize this further to the case where the target function is not a binary classification function, but a real-valued function. That also is a very important generalization. And linear regression, as you will see, is one of the most important techniques that is applied, mostly in statistics and economics, and also in machine learning. Finally, as if we didn't do enough generalization already, we are going to take this and generalize it to a nonlinear case, all in a day's work, all in one lecture. It's a pretty simple model, and at the end of the lecture, you will be able to actually deal with very general situations. And you may ask yourself, why am I calling the lecture linear model when I'm going to talk about nonlinear transformation? Will you realize that nonlinear transformation remains within the realm of linear models? That's not obvious. We will see how that materializes. So that's the plan. Now let's look at a real data set that we are going to use, and we're going to be available to you, to try different ideas. And it's very important to try your ideas on real data, regardless of how sure you are when you have a toy data set that you generate. You should always go for real data sets and see how the system that you thought of performs in reality. So here is the data set. It comes from zip codes in the postal office. So people write the zip code, and you extract individual characters, individual digits, and you would like to take the image, which happens to be 16 by 16 gray level pixels, and be able to decipher what is the number in it. Will that look easy, except that people write digits in so many different ways? And if you look at it, there will be some cases like this fellow. Is this a 1 or a 7? Is this a 0 or an 8? So you can see that there is a problem. And indeed, if you get a human operator to actually read these things and classify them, they will probably be making an error of about 2.5%. And we would like to see if machine learning can at least equal that, which means that we can automate the process, or maybe beat that. So this is a data set that we are going to work with. So let's look at it a little bit more closely to see how we input it to our algorithm. We have one algorithm so far, which is the perceptron learning algorithm. We are going to try on this, and then we are going to generalize it a little bit. So the first item is the question of input representation. What do I mean? This is your input, the raw input, if you will. Now, this is 16 pixels by 16 pixels. So there are 256 real numbers in that input. So if you look at the raw input x, this would be x1, x2, x3, dot, dot, dot, dot, dot, dot, dot, dot, and x256. That's a very long input to encode such a simple object. And we add our mandatory x0. Remember, in linear models, we have this constant coordinate, x0 equals 1. We add in order to take care of the threshold. So this will always be in the background, whether we mention it or not. So if you take this raw input and try the perceptron directly on it, you realize that the linear model in this case, which has a bunch of parameters, has really just too many parameters. It has 257 parameters. If you are working in a 257-dimensional space, that is a huge space. And the poor algorithm is trying to simultaneously determine the values of all of these Ws based on your set. So the idea of input representation is to simplify the algorithm's life. We know something about the problem. We know that it's not really the individual pixels that matter and whatnot. We'll probably extract some features from the inputs and then give those to the learning algorithm and let the learning algorithm figure out the pattern. So this gives us the idea of features. What are features? Well, you extract the useful information. And as a suggestion, very simple one, let's say that in this particular case, instead of giving the raw input with all of the pixel values, you extract some descriptors of what the image is like. For instance, you look at this, depending on whether this is the digit 8 or the digit 1, etc. There is a question of the intensity, average intensity. One doesn't have too many black pixels. Eight has a lot. Five has some. So if you simply add up the intensity of all the pixels, you probably will get a number that is related to the identity. It doesn't uniquely determine it, but it's related. It's a higher level representation of the raw information there. Same as symmetry. If you think of the digit 1, one will be symmetric. If you flip it upside down or you flip it right and left, it will get something that overlaps significantly with it. So you can also define a symmetry measure, which means that you take the symmetric difference between something and its flipped versions, and you see what you get. If something is symmetric, things will cancel because it's symmetric. You will get a very small value. And if something is not symmetric, let's say like the five, you will get lots of values in the symmetric difference, and you will get high value for that. So what you are measuring is the anti-symmetry. You take the negative of that, and you get the symmetry. So you get another guy, which is the symmetry. So now x1 is the intensity variable, x2 is the symmetry variable. Now, admittedly, you have lost information in that process. But the chances are you lost as much irrelevant information as relevant information. So this is a pretty good representation of the input as far as the learning algorithm is concerned. And you went from 257 dimensional to 3 dimensional. That's a pretty good situation. And you probably realize that having 257 parameters is bad news for generalization, if you extrapolate from what we said. Having 3 is a much better situation. So this is what we are going to work with. So when you take the linear model in this case, you just have w0, w1, and w2. And that's what the perceptron algorithm, for example, needs to use to determine. So now let's look at the illustration of these features. You have these as your inputs, and x1 is the intensity, x2 is the symmetry. What do they look like? They look like this. This is a scatter diagram. Every point here is a data point. It's one of the digits, one of the images you have. And I'm taking the simple case of just distinguishing the ones from the fives. So I'm only taking digits that are ones or fives. And you can always take other digits versus each other, and then combine the decision. So if you can solve this unit problem, you can generalize it to the other problem. So when you put all the ones and all the fives in a scatter diagram, you realize, for example, that the intensity on the five is usually more than the intensity on the ones. There are more pixels occupied by the fives than the ones. So this is the coordinate, which is the intensity, and indeed the red guys, which happen to be the fives, are tilted a little bit more to the right, corresponding to the intensity. If you look at the other coordinate, which is symmetry, the one is often more symmetric than the five. Therefore, the guys that happen to be the ones that are the blue tend to be more higher on the vertical coordinate. And just by these two coordinates, you already see that this is almost linearly separable, not quite. But it's separable enough that if you pass a boundary here, you'll be getting most of them right. Now, you realize that it's impossible really to ask to get all of them right, because believe it or not, this fellow is a five. At least meant to be a five by the guy who wrote it. So we have to accept the fact that there will be stuff that is completely undoable. And we will accept an error. It's not a zero error, but hopefully it's a small error. So this is what the features look like. Now, what does the perceptron learning algorithm do? Now, what it does is this complicated figure, which takes the evolution of E in and E out as a function of iteration. When you apply the perceptron learning algorithm, you apply it only to E in. E in is the only value you have. E out is sitting out there. We don't know what it is. We just hope that E in tracks it well. So let's look at the figure. These are the iteration numbers. So this is the first misclassified example. You go and apply the perceptron learning algorithm again, again, again, again for 1,000 times. As you do that, E in, which is the green curve, will go down, and sometimes will go up. We realize that the perceptron learning algorithm takes care of one point at a time, and therefore may mess up other points while it's taking care of a point. So in general, it can go up or down. But the bad news here is that the data is not linearly separable. And we made the remark that the perceptron learning algorithm behaves very badly when the data is not linearly separable. It can go from something pretty good to something pretty bad in just one iteration. So this is a very typical behavior of the perceptron learning algorithm. Because the data is not linearly separable, the perceptron learning algorithm will never converge. So what do we do? We force it to terminate at iteration 1,000. That is, we stop at 1,000 and take whatever weight vector we have. And we call this the final hypothesis of the perceptron learning algorithm. Now you obviously look at this, and you say, OK, if I only took this guy, this is a better guy than the other guy. But you are just applying the algorithm and cutting it off. Now one of the things you observe from here, I plotted E out. You are not going to be able to plot E out in a real problem that you deal with if E out is really an unknown function. You may be able to estimate it using some test examples and whatnot. But all you need to know here is that E out is drawn here for illustration. Just to tell you what is happening in reality as you work on the in-sample error. And in this case, you find that E out actually tracks the E in pretty well. There is a difference. So if you go from here to here, that's our epsilon. It's a big epsilon. But the good news is that it tracks it. When this goes down, this goes down. When this goes up, this goes up. So if you make your decision based on E in, the decision based on E out will also be good. That's good for generalization. And that is one of the advantages of something as simple as the perceptron learning algorithm. It doesn't have too many parameters. And because of our efforts in getting only three features, it has even three parameters now. So the chances are that it will generalize well which it does. Now, what does the final boundary look like? This is only the illustration here. It's just, this is the evolution. Eventually, you end up with a hypothesis. The hypothesis would separate the points in the scatter diagram you saw. So what does it look like? Well, it looks like this. So this is your boundary. This is the final hypothesis that corresponds to the hypothesis you got at the final iteration. Well, it's OK, but definitely not good. It's too deep into the blue region. You would have been better off doing this. And the chances are maybe earlier guys that had better in sample error will do that. But that's what you have to live with if you apply the perceptron learning algorithm. So now we go and try to modify the perceptron learning algorithm in a very simple way. That is the simplest modification you can ever imagine. So let's see what happens. This is what the PLA did. And when we looked at it, we said, OK, if we only could keep this value, well, this value is not a mystery. It happened in your algorithm. You can measure it explicitly. It's an in-sample error. And you know that it's better than the value you ended up with. So in spite of the fact that you are doing these iterations according to the prescribed perceptron learning algorithm rule, modify the weights according to one misclassified point, you can keep track of the total in-sample error of the intermediate hypothesis you got. And only keep the guy that happens to be the best throughout. So you are going to continue as if it's really the perceptron learning algorithm. But when you are at the end, you keep this guy and report it as the final hypothesis. What an ingenious idea. Now the reason the algorithm is called the pocket algorithm because the whole idea is to put the best solution so far in your pocket. And when you get a better one, you take the better one, put it in your pocket, and throw the old one. And when you are done, report the guy in your pocket. We can do that. So what does this diagram look when you are looking at the pocket algorithm? Much better. You can look at these values, and it is the best value so far. So here we went down, and here we indeed went down. Here we went up. You see this green thing? Here we didn't, because the good guy is in our pocket, and that's what we're reporting the value for. And we continued with it until we dropped again. We dropped again, and we never changed that because there was never a better guy than this guy. So when we come to iteration 1,000, we have this fellow. Now when you do that, you can use perceptron learning algorithm with non-separable data, terminated by force at some iteration, and report the pocket value, and that would be your pocket algorithm. And if you look at the classification boundary, PLA versus pocket, this is what we had with the perceptron learning algorithm. We complained a little bit that it's too deep in the blue region. And when you look at the other guy, which is the pocket algorithm, it looks better. It actually does what we thought it would do. It separates them better. Still, obviously, it cannot separate them perfectly. Nothing can, because they're not linearly separable. On the other hand, this is a good hypothesis to report. So with this very simple algorithm, you can actually deal with general inseparable data. But inseparable data, in the sense that it's basically separable. However, this guy is bad, and this guy is bad. There's nothing we can do about them. But there are few, so we will just settle for this. We will see that there are other cases of inseparable data that is truly inseparable, in which we have to do something a little bit more drastic. So that's as far as the classification is concerned. So now we go to linear regression. The word regression simply means real-valued output. There is absolutely no other connotation to it. It's a glorified way of saying, my output is real-valued. And it comes from earlier work in statistics. And there is so much work on it that people could not get rid of that term. And it is now the standard term whenever you have a real-valued function, you call it a regression problem. So that's with that out of the way. Now, linear regression is used incredibly often in statistics and economics. Every time you say, are these variables related to that variable, the first thing that comes to mind is linear regression. So let me give you an example. Let's say that you would like to relate your performance in different types of courses to your future earnings. So this is what you do. You look at, here are the courses I took. Here are the math, science, engineering, humanities, physical education, other. And you get your GPA in each of them. So here I got 3.5. Here I got 3.8. Here I got 3.2. Here I got 2.8. 2.8. No, no, that doesn't happen at Celtic. So you go for the other one. So you just have the GPAs for the different groups of courses. Now, you say, someone graduates. I'm going to look 10 years after graduation and see their annual income. So the input are the GPAs in the courses at the time they graduated. The output is how much money they make per year, 10 years away from graduation. Now, you ask yourself, how do these things affect the output? So apply linear regression, as you will see it in detail. And you finally find, oh, OK, maybe the math and sciences are more important. Or maybe all of that's an illusion. It was actually the humanities that are important. You don't know. You will see the data, and the data will tell you what affects what. And any other situation like that, people simply resort to linear regression. So in order to build it up, we are going to use the credit example again in order to be able to contrast it with the classification problem we have seen before. So what do we have? We have in the classification, we have the credit approval, yes or no. That's a classification function, binary function, which say the output is plus or minus 1. In the case of regression, we will have real-valued function, and the interpretation in this case is that you are trying to predict the proper credit line for a customer. So the customer applies, and it's not a question of approving the credit or not. Do you give them credit limit of $800 or $1,200 or $30,000 or what? Depending on their input. So this is a real-valued function, and we are going to apply regression. Now you take the input. This is the same input as we had before. Data from the applicant that are related to the credit behavior, so the age, the salary. I suspect that the salary will figure very significantly now when you are trying to tell the credit line. Because if someone is making $30,000 a year, you probably are not going to give them a credit line of $200,000. So you can see that this will probably be effective. And there are other guys that merely have to do with the stability of the person, years in residence. If the person has been in the same residence for 10 years, they are unlikely to skip down. On the other hand, if they have been there for only one month, well, you don't know. That type of thing. So you have these variables. You encode them as the input x. And then your output, in this case, which is the linear regression output, is a hypothesis form, which takes this particular form. So let's spend some time with it to understand it. First, it's regression because the output is real. It's linear regression because the form, in terms of the input, is linear. Now, we have seen this before. We sum up from basically 1 to d. These are the genuine inputs, the weighted version of the input variables. And then we add the mandatory x0, which is 1, which takes care of the threshold, which is w0. So this is the form we have seen before. Except that, when we saw it before, we took this as a signal that we only care about its sign. If it's plus, we approve credit. If it's minus, we don't approve credit. And we treated it as a credit score, per se, when you take out the threshold. Now, in this case, this is the output. We don't threshold it. We don't say it's plus, minus, minus 1. There is w0 in, but we don't take it as plus 1 or minus 1. We take it as a real number. And this is the dollar amount we are going to give you as a credit line. Now, the signal here will play a very important role in all the linear algorithms. This is what makes the algorithm linear. And whether you leave it alone, as in linear regression, you take a hard threshold, as in classification. Or, as we will see later, you can take a soft threshold and you get a probability and all of that. All of these are considered linear models. And the algorithm depends on this particular part, which is the signal being linear. We also took the trouble to put it in vector form. And the vector form will simplify the calculus that we do in this lecture in order to derive the linear regression algorithm. But you can always, if you hit the vector form, you can always go back to this. There is nothing mysterious about this. This simply has a bunch of parameters, w0, w1, up to wd. And if I'm trying to minimize something, you can minimize it with respect to scalar variables, which applies very primitive calculus. But we obviously will do it in the shorthand version, which is the vector one, or the matrix form, in order to be able to get the derivation in an easier way. So that's the problem. What is the data set in this case? It's historical data, but it's a different set of historical data. The credit line is decided by different officers. Someone sits down and evaluates your application and decides that this person gets 1,000 limit, this person gets 5,000 limit, and whatnot. All we are trying to do in this particular example is to replicate what they are doing. So we don't want the credit officer to do that. The credit officers sometimes are inconsistent from one another. They may have a good day or bad day. So we'd like to figure out what pattern they collectively have in deciding the credit, and have an automated system decide that. That's what the linear regression system will do for us. So the historical data here are, again, examples from previous customers. And the previous customer, this is x1, and this is y1. So this is the application that the customer gave. And this is the credit line that was given to them. No tracking of credit behavior. We're just trying to replicate what the experts do in this case. And then you realize that each of these y's is actually a real number, which is the credit line that is given to customer xn. And that real number will likely be a positive integer. It's a credit line. It's a dollar amount. And what we are doing is trying to replicate that. That's the statement of the problem. So what does linear regression do? First, we have to measure the error. We didn't talk about that in the case of classification, because it was so simple here. It's a little bit less simple. And then we'll be able to discuss the error function for classification as well. What do we mean by that? You will have an algorithm that tries to find the optimal weights. These are the weights you are going to have. So these weights are going to determine what hypothesis you get. Some hypotheses will approximate F well. Some hypotheses will not. We would like to quantify that to give a guidance to the algorithm in order to move from one hypothesis to another. So we'll define an error measure. And the algorithm will try to minimize the error measure by moving from one hypothesis to the next. So if you take linear regression, the standard error function used there is the squared error. So let me write it down. Well, if you had a classification, there is only a simple agreement on a particular example. You either got it right or got it wrong. There is nothing else. Therefore, in that case, we just defined binary error. Did you get it right or wrong? And we found the frequency of getting it right, and we got the e in and e out. Here, you are estimating a credit line. So if the guy gets 1,000, and you tell them 900, that's not too bad. If the guy gets 1,000, and you tell them 5,000, that's bad. So you need to measure how bad the situation is. And you define an error measure, and you define it by the simple squared error. Now, squared error doesn't have an inherent merit here. It just happens to be the standard error function used with linear regression. And its merit really is the simplicity in the analytic solution that we are going to get. But when we discuss error measures in the next lecture, we will go back to the principle is, does error measure matter? Why? How do we choose? Et cetera. This will be answered in a principled way next time. But for this time, let's take this as a standard error measure we are going to use. So when you look at the in-sample error, you use the error measure. So on the particular example, small n, small n from 1 to capital N, for each example, this is the contribution of the error. Each of these is affected by the same w, because h depends on w. So as you change w, this value will change for every example. And this is the error in that example. And if you want to get all the in-sample error, you simply take the average of those. So that will give me a snapshot of how my hypothesis is doing on the data set. And now we are going to ask our algorithm to take this error and minimize it. So let's actually just look at what happens as an illustration. This is the simplest case for linear regression. The input is one-dimensional. I have only one relevant variable. I want to relate your overall GPA to your earnings 10 years from now. Your overall GPA is x. Your earnings 10 years from now is y. That's it. So I would have properly called this x1 according to our notation. And then there would be an x0, which is the constant one and whatnot. But I didn't bother, because I have only one variable. But this is what we have. So you look at this, and you see that for different x's, you have these guys. Wow. Your earnings are going down. That may not have been the example that is drawn here. So what linear regression does is tries to produce a line, which is what you have here, that tries to fit this data according to the square error rule. So it may look like this. And in this case, the threshold here depends on w0. The slope depends on w1, which is the weight for x. And that is what solution you have. Now you didn't get it right, but what you got is some errors. And you realize that this is the error on the first example. This is the error on the second example. And if you sum up the squares of the length of these bars, that is what we call the in-sample error that we defined in the previous one. Well, linear regression can apply to more than one dimension. And I can plot two dimensions here just to illustrate it. It's the same principle. What you have here is you have x1. If I can get the pointer. We'll leave it to rest. We have x1 and x2. And in this case, the linear thing is really a plane. And you are, again, not separating, but trying to estimate these guys. And you are making errors. And in general, when you go to higher-dimensional space, the line, which is the reason why we call it linear, is not really a line. It's a hyperplane. One dimension short of the space you are working with. And that's what you are trying to use to approximate the guys. Now let's look at the expression for E in. And that is the analytic expression we are going to try to minimize. And that will make us derive the linear regression algorithm. So we wrote this before. And you have the value of the hypothesis minus yn squared. That is because it's a squared error. And because it's linear regression, this value, h of xn, happens to be w transpose xn. It's a linear function of xn. Now let us try to write this down in a vector form. I will explain this in detail. But let's look at this. Instead of the summation, all of a sudden, I have a norm squared of something that is capital X. I haven't seen capital X before. I haven't seen vector Y before. Well, it's basically a consolidation of the different xn's here. xn is a vector. So you put the vectors in a matrix, you call it x. And you put the scalars, the yn, in a vector, and you call it y. So the definition of capital X and the vector y is as follows. The matrix X, what you do, you put your first example here. So this would be the constant coordinate one. The first coordinate, second coordinate, up to the d-th coordinate, the last coordinate. And then you go for the second example and do the same, and construct this matrix. And for y, you put the corresponding output. This is the output for the first example, output for the second example, output for the last example. Now one thing to realize about the matrix capital X is that it's pretty tall. The typical situation is that you have few parameters. We reduce them to three, for example, in the case of the classification of the digits. But you usually have many, many examples in the thousands. So this would be a very, very long matrix. Now the way you take this, well, the norm squared will be simply this vector transpose times itself. And when you do it, you realize that what you are doing is summing up contributions from the different components, and each component happens to be exactly what you are having here. So this becomes a shorthand for writing this expression. Now let's look at minimizing E n. When you look at minimizing, you realize that the matrix x, which has the inputs of the data, and y, which has the outputs of the data, are, as far as we are concerned, constants. This is the data set someone gave me. The parameter I'm actually playing with in order to get a good hypothesis is w. So E n is of w, and w appears here, and the rest are constants. If I do any calculus of minimization, it is with respect to w. So I try to minimize this. And what you do, you get the derivative and equate it with 0, except here it's a glorified derivative. You get the gradient, which is the derivative on a bunch of them all at once. And there is a formula for it, which is pretty simple in this case. I will explain it. By the way, if you hate this and you want to make sure, because linear regression is so important, and you want to verify that it's true, you can always go for the scalar form, get E by partial every w, partial w0, partial w1, partial wd, get a formula that is a pretty hairy one, and then try to reduce it. And surprise, surprise, you will get the solution here that we have in matrix form in two steps. Now if you look at this, deal with it in terms of calculus as if it was just a simple square. If this was a simple square and w was the variable, what would the derivative be? You will get two setting outside, where you got it here. And then you will get the same thing in a linear form, you get it here. And then you will get whatever constant was multiplied by w to sit outside, which you got it here. You just got here with a transpose, because this is really not a square. This is the transpose of this times itself. That's where you get the transpose. Pretty straightforward and standard matrix calculus. So that's what you have. And then you equate this to 0, but it's a fat 0. It's a vector of 0s. You want all the derivatives to be 0, all at once. And that will define a point where this achieves a minimum. Now you would suspect that the solution will be simple, because this is a very simple quadratic form. And indeed, the solution is simple. And if you look at it, you realize that if I want this to be 0, then I want this to cancel out. I want, when I multiply x transpose xw, I get the same thing as x transpose y, so they cancel out and I get my 0. So you write this down and you find that this is the situation. I want this term to be equal to this term, and that will give me the 0. The interesting thing is that, in spite of the fact that capital X, the matrix X, is a very tall matrix, definitely not square, hence not invertible, x transpose x is actually a square matrix, because x transpose is this way and x is this way. Multiply them and you get a pretty small square matrix. And as we will see, the chances are overwhelming that it will be invertible. So you can actually solve this very simply by inverting this. You multiply by the inverse in this direction and multiply by this, this will disappear, and you will get an explicit formula for w, which you are trying to solve for. And when you do that, you will get w equals this funny symbol, x dagger. What is x dagger? This is simply a shorthand for writing this. So I got the inverse of that, and then multiplied it by here. So this is really what I get to be multiplied by y. I call it x dagger, and indeed it gets multiplied by y to give me my w. Now the x dagger is a pretty interesting notion. It's called the pseudo-inverse of x. x being a non-invertible matrix does not have an inverse. But it does have a pseudo-inverse. And the pseudo-inverse has interesting properties. For example, if you take this, the x dagger and multiply it by x, so x dagger times x. What do you get? So you add x here, so you get x transpose x. I have x transpose x minus 1 here. So they cancel out, and I get an identity. So when I multiply x dagger by x, I get the identity. So it's OK to call it an inverse of sorts. It doesn't work the other way around. The other way around gives us an interesting matrix, which we'll talk about later. But basically, this is the essence of it. If we were in a trivial situation where x was a square, I have three parameters. And I have three examples to determine them. That can be solved perfectly. I can actually get this to be 0. And how would you get it to be 0? You would just multiply by the proper inverse of x in this case, and you will get x inverse y. So this is pretty much similar when x is at all 1, and we are not going to get a 0, we're just going to get a minimum using the pseudo-inverse. Now, I would like you to appreciate what the pseudo-inverse is in from a computational point of view. OK. So this is the formula for the pseudo-inverse that you will need to compute in order to get the solution for linear regression. So let's look at it. Something is inverted. And when you see inversion in matrix, you say, oh, computational, computational. If this was a million by a million, I'm in trouble. If this is like 5 by 5, I'm in good shape. So we'd like to know what kind of matrix do we have here. Well, nothing mysterious about what's inside this. You have this fellow, which is x transpose. It's small d plus 1. d is the length of your input. 1 is the added constant variable. So these are the number of parameters. This would be 3 in the digit classification guy. We have only x1 and x2, so d equals 2. d plus 1 equals 3, which is corresponds to x0, x1, x2, or to w0, w1, w2. So this is 3 times capital N. Capital N is a scary one. That's the number of examples. That could be in the thousands. Now, you multiply this by x, and that's what you have. So the multiplication is not that difficult. Even if this is 10,000, I can multiply this by 10,000. But the good news is that when I go to this guy, I will have to be dealing with a simpler guy. Let's just complete the formula first. This is what you have. This is what you are computationally doing. And if you look at what's inside here, it completely shrinks. That is what the matrix inside is. It's just 3 by 3 in our case. You can invert that. Just accumulating it is the one that you have to go through all of the examples, and there's a very simple way of doing it. So it's not that difficult to get this fellow. And you can see now that, oh, good thing that we had three parameters. If we had the 257 parameters that we're beginning with, this would have been 257 by 257. Not that this will discourage us, but if you go for some raw inputs, you can get something really in the thousands or sometimes even more than that. So the computation aspect of this is very simple. And there are so many packages for computing that you do invert or outright getting the solution for linear regression that you will never have to do that yourself, except if you're doing something very specialized. If you do something very specialized, it's not that bad. So that is the final matrix, and the final matrix will have the same dimension as this guy. And if you look at it, this will be multiplied by y, which is y1, y2, y3, yn corresponding to different outputs. And then as a result of that, you will get the w0, w1, up to wd. Indeed, if you multiply this by a n-toll vector, you will get a d plus 1-toll vector, and that's what we expect. OK. So let's now flash the full linear regression algorithm here. That's a crowded slide. That is what you do. The first thing is you take the data that is given to you and put them in the proper form. What is the proper form? You construct the matrix x and the vector y. And these are what we introduced before. This will be the input data matrix, and this would be the target vector. And once you construct them, you are basically done. Because all you are going to do, you plug this into a formula, which is the pseudo-inverse, and then you will return the value w, that is the multiplication of that pseudo-inverse with y. And you are done. Now you can call this one-step learning, if you want. With the perceptron learning algorithm, it looked more like learning. Because I have an initial hypothesis, and then I take one example at a time and try to figure out what is going on, move this around, et cetera. And after 1,000 iterations, I get something. This looks like more than what we learn. We learn in steps. This looks like cheating. You give me the thing, and you have the answer. Well, as far as we are concerned, we don't care how you got it. If it's correct and gives you a correct E out, you have learned. And because this is so simple, this is a very popular algorithm that is used often, and used often as a building block for other guys. We can afford to use it as a building block, because the step here would be so simple that we can become more sophisticated in using it. One remark about the inversion. This has to be inverted in order for this formula to hold. Now, the chances are that this will be invertible in a real application you have is close to one. The reason is the following. Usually, you use very few parameters and tons of examples. You will be very, very, very unlucky to have this so dependent on each other that you cannot even capture the dimensionality, which is the number of columns. The number of columns is 3, 5, 10, and you have 10,000 of those. So the chances are overwhelming in a real problem that this will be invertible. Nonetheless, if it is not invertible, you can still define this U2 inverse. It will not going to be unique and has some elaborate features, but it's not a big deal that is not a situation you will encounter in practice. Now, we have linear regression. So I'm going to tell you that you can use linear regression not only for real-valued function, for regression problems, but you're also going to be able to use it for classification. Maybe the perceptron is now going out of business. It has a competitor now, and a competitor has a very simple algorithm. So let's see how this works. The idea is incredibly simple. Linear regression learns a real-valued function. We know that. That is the real-valued function. The value belongs to the real numbers. Now, the main observation, the ingenious observation, is that binary-valued functions, which are the classification functions, are also real-valued. Plus 1 and minus 1, among other things, happen to be real numbers. So linear regression is not going to refuse to learn them as real numbers. So what do we do? You use linear regression in order to get a solution such that the solution is approximately yn in the mean square sense. So for every example, the actual value of the signal is close to the numerical plus 1 and the numerical minus 1. That's what linear regression does. Now, having done that with yn equals plus or minus 1, you realize that in this case, that if you take the classification version of it, you take the sign of that signal in order to be able to classify as plus 1 or minus 1. If the value is genuinely close to plus 1 or minus 1 numerically, then the chances are when it's plus 1, this would be positive. And when it's minus 1, it's negative. The chances are you're getting close to a number, you'll probably cross the 0 in doing that. And if you cross the 0, the classification will be correct. So if you take this and then plug it in as weights for classification, you will likely get something that will give you likely to agree with plus or minus 1. That's a pretty simple trick, because it's almost free. All you need to do is have a classification problem. Let's run linear regression. It's almost for free. Do this one-step learning, get a solution, and use it for classification. Now, let's see if this is as good as it sounds. Well, the weights are good for classification, so to speak, just by conjecture. But they also may serve as good initial weights for classification. Remember that the perceptron algorithm or the pocket algorithms are really very slow to get there. You start with a random guy. Half the guys are misclassified, and it just goes around, tries to correct one, messes up the others, until it gets to the region of interest, and then it converges. Why not give it a jump start? Why not run linear regression first? Get the W's. We know that the W's are OK, but they are not really tailored toward classification. But they are good initial condition. Feed those to the pocket algorithm, and let it run to the solution, which is a classification solution. That's a pretty nice idea. So let's actually look at the linear regression boundary. Now, I take an example here. Again, I have the plus one class and the minus one class, and I applied, trying to find what is the linear regression solution. Now, you remember, the blue region and the pink region belong to classification. When you talk about linear regression, you have the value here, and the signal is 0 here. The signal is positive, more positive, more positive, more positive, and here the signal is negative, more negative, more negative, more negative. There is a real value function that we are trying to interpret as a classification by taking the sign. Now, if you look at what the linear regression is trying to do when you use it for classification, all of these guys have a target value minus 1. It is actually trying to make the numerical value equal minus 1 to all of them. So the chances are these will be minus 1. This will be minus 2, minus 3, and the linear regression algorithm is very sad about that. It considers it an error, in spite of the fact that when we plug it into the classification, it just has the correct sign, and that's all we care about. But we are applying linear regression. It is actually trying very hard to make all of them minus 1 at the same time, which obviously it cannot. And you can see now the problem with linear regression. In its attempt to make this minus 8, minus 1, it may move the boundary to the level where it's in the middle of the red region, and now it's very happy because it minimized its error function, but that's not really the classification. Nonetheless, it's a good starting point, and then you take the classification, now that forgets about the values and try to adjust it according to the classification, and you will get a good boundary. That's the contrast between applying linear regression for classification and linear classification outright. Now we are done. I'm going to start on nonlinear transformation, and I'm going to give you a very interesting tool to play with. So here is the deal. You probably realize that even when dealing with non-separable data, we are dealing with non-separable data that are really basically separable with few exceptions. But in reality, when you take a real life problem, you will find that the data you are going to get could be anything, could be, for example, something that looks like this. So you want to classify these as plus 1s and these as minus 1s. Let's take the classification paradigm here. Now I can put the line anywhere, and obviously I'm in trouble, because this is not linearly separable even by a long shot. You can look at this and say, OK, I can see what the pattern here. Closer to the center, you have blues. Closer to the peripherals, you have reds. So it would be very nice if I could apply a hypothesis that looks like this. Yes, the only problem is that that's not linear. We don't have the tools to deal with that yet. Wouldn't it be nice if in two view graphs, you can use linear regression and linear classification, the perceptron or the pocket, to apply it to this guy? That's what will happen. I told you this is a practical lecture. So we take another example of non-linearity. We take the credit line. Now if you look at the credit line, the credit line is affected by years in residence. We argued that if someone has been in the same residence for a long time, there is stability and trustworthiness, and someone has been in a short time, there is a question mark. One thing is to say that this is a variable that affects the output. Another thing to say is that this is a variable that affects the output linearly. It would be strange if I'm trying to determine a credit line to decide that the credit line will be proportional to the time you have lived in residence. If you live 10 years, 20 years, I will give you twice the credit line. It doesn't make sense, because stability is established probably by the time you get to five years. After that, it's diminishing returns. So it would be very nice if I can, instead of using the linear one, define non-linear features. Which is the following? Let's take the condition, the logical condition, that the years in residence are less than one. And in my mind, I'm considering that, OK, this is not very stable. You haven't been there for very long. And another guy, which is Xi, greater than five. You have been there for more than five years, so you are stable. The notation here, when I put something between these brackets, means that this returns one if the condition is true, and returns zero if the condition is false. So this is one zero, and this is one zero. Now, if I had those as variables in my linear regression, they would be much more friendly to the linear formula in deciding the credit line, rather than the crude input. But these are non-linear functions of Xi. And again, we have the non-linearity, and we wonder if we can apply the same techniques to a non-linear case. So this is the question, can we use linear models? The key question to ask is, linear in what? What do I mean? Look at linear regression. What does it implement? It implements this. This is indeed a linear formula. And when you look at the linear classification counterpart, it implements this. This is a linear formula, and the algorithm being simple depends on this part being linear, and then you just make a decision based on that signal. Now, these, you would think, are called linear because they are linear in the x's, which they are. Yeah, I get these inputs, and I combine them linearly, and I get my surface. That's why I'm calling it linear. However, you realize that, more importantly, these guys are linear in w. Now, when you go from the definition of a function to learning, the roles are reversed. The inputs, which are supposed to be the variable when you evaluate a function, are now constants. They are dictated by the training set. They are just a bunch of numbers someone gave me. The real variable, as far as learning is concerned, are the parameters. This is the fact that it's linear in the parameters is what matters in deriving the perceptron learning algorithm and the linear regression algorithm. If you go back to the derivation, it didn't matter what the x's were. The x's were sitting there as constants, and their linearity in w is what enabled the derivation. So that results in the algorithm work because of linearity in the weights. Now, that opens a fantastic possibility, because now I can take the inputs, which are just constants. Someone gives me data, and I can do incredible nonlinear transformations to that data, and it will just remain more elaborate data, but constant. When I get to learn using the nonlinearly transformed data, I'm still in the realm of linear models, because the weights that will be given to the nonlinear features will have a linear dependency. So let's look at an example. Let's say that you take x1 and x2. I omitted the constant x0 here for simplicity. And these are the guys that gave us trouble. This is the coordinates. This is x1. This is x2. These guys should map to plus 1. These guys should map to minus 1. I don't have a linear separator. Fine. These are data, right? So everything that appears within this box is just a bunch of constants, x's, and corresponding constants, y's. So now I'm going to take a transformation. I'm going to call it phi. Every point in that space, I'm going to transform to another space. And my formula for transformation will be this. I'm assuming here that the origin of the coordinate system is here. So I'm taking x1 squared and x2 squared, and you can see where I'm leading, because now I'm measuring distances from the origin, and that seems to be a helpful guy here. Now, in doing this, all I did was take constants and produce other constants. Now you can look at this and say, OK, this is my training data. I take your original training data, do the transformation, and forget about the original one. Can you solve the problem in the new space? Oh, yes, you can. Because that's what they look like in the new space. All of a sudden, the red guys, which happened to be far away, will have bigger values for x1 squared and x2 squared. They will sit here. And the guys that are closer to the origin by the time they transform them, they will have smaller values here. So this is now your new data set. Can you separate this using a perceptron? Yes, I can. I can put a line going through here. Great. When you get a new point to classify, transform it the same way, classify it here, and then report that. That's the game. And there is really no limit, at least computationally, in terms of what you can do here. You can dream up really elaborate nonlinear transformations, transform the data, and then do the classification. There is a catch, and it's a big catch. So I will stop here, and we'll continue with the nonlinear transformation at the beginning of the next lecture, and we'll take a short break now before we go to the Q&A session. We have from the online audience. OK, so a popular question is how to figure out, in a systematic way, the nonlinear transformations instead of the data. OK, so I said that the nonlinear transformation is a loaded question. And there will be two steps in dealing with it. So I will talk about it a little bit more elaborately at the beginning of next lecture. And then we are going to talk about the guidelines for choice and what you can do and what you cannot do after we develop the theory of generalization, because it is very sensitive to the generalization issue. And that should not come as a surprise, because I can see that, OK, I can take the input, which is, let's say, two variables corresponding to two parameters, and I want the transformation to be as elaborate as possible in order to stand a good chance of being able to separate them linearly. So I'm going to go all out. I'm just going to keep getting nonlinear coordinates. x1, x1 squared, x1 cubed, x1 squared, x2, e to the a. I'm just going. Now at some point, you should smell a rat, because you realize that, OK, I have this very, very long vector and corresponding number of parameters, and generalization may become an issue, which it will become an issue. So there are guidelines for how far you can go, and also there are guidelines for how you can choose them. Do I look at the data and figure out what is a good nonlinear transformation? Is this allowed? Is this not allowed? What the ramifications are? All of these will become clear only after you look at the theory part. OK. And there's a question about slide 15. So regarding the expression of e, e in, so how does the in-sample error here or the out-of-sample error would relate to the probabilistic definition of last time? OK. Here we dealt only with the in-sample error. So we decided on e in, and in general in learning, you only have the in-sample error to deal with. You have on the side a guarantee that when you do well in-sample, you will do well out-of-sample. So you never handle the out-of-sample explicitly, you just handle the in-sample and have the theoretical guarantee that what you are doing will help you out-of-sample. Now, the error measure here was a squared error. Therefore, when you define the in-sample error, you get the squared error and average it. And when you define the out-of-sample error, it's really the expected value of the squared error. Now, in the case of the binary classification, the error was binary. You're either right or wrong. So you can always define the in-sample error as also the average of the question, am I right or wrong on every point? So if you are right, there's no error, you get zero. If you are wrong, you get one. So you ask yourself, what is the frequency of ones in-sample? And that would give you the in-sample error. The expected value of that error happens to be the probability of error. That's why we simply, without going into expectation and in-sample average versus out-of-sample expected value, in the case of classification, we simply talked about frequency of error and probability of error. Not because they are different, but just because they are simple to state. But in reality, the aspect of them that made them qualify as in-sample and out-of-sample is that one of the probabilities is the expected value of an error measure. That happens to be a binary error measure. And the frequency of error happens to be the average value of that error measure. So you showed us very nice graph with negative slope about the dependence of future income. Just unintentional. I didn't think of the income at the time I drew the graph. So any implication that you should really do worse in school in order to gain more money, I disown any such conclusion. OK. But you mentioned the example of determining future income from a graded print average, or at least finding some correlation. So the question I'm interested in is, where can we get data? I mean, obviously, the alumni association of every school keeps track of the alumni. And they send them questionnaires. And they have some of the inputs is how much money they make. I mean, there are a number of parameters. So there will be a number of schools that have that. And actually, this is actually used. So you realize that something is related to success or some things, you can go back and revise your curriculum or revise your criteria and whatnot. So the data is indeed available. That's a question. I mean, it's available in principle, but can we get it? Oh, we get it, OK? So I thought it was generic we. I mean, OK. So obviously, the data will be anonymous after a while. You will just get the GPA and the income without knowing who the person is. It will be you are dependent on the kindness of the alumni associations at different schools, I guess. Or maybe there are some available in public domain. I have not looked. So my understanding is that you want to run linear regression, see what happens, and then focus your time on the courses that matter. That's the idea now. That's your feedback. OK. OK, so technical question. Why is the W0 included in the linear regression? So there's a confusion about this. And also, into that point, so what do you do specifically in the binary case? How do you incorporate this plus 1, so minus 1? There's people asking about this. Let me ask one at a time. Let's talk about the threshold first. Why is the threshold is there, right? So let's look here. If you look at the line here, the linear regression line, the linear regression line is not a homogeneous line. It doesn't pass by the origin. If I told you that you cannot use a threshold, then the constant part of the equation goes away, and the line you have will have to pass through the origin. Can you imagine if you're trying to fit these with a line? I mean, obviously, it would be down there if you have the negative slope, or if you want to pass through the points up there. So obviously, I need the constant in order to get a proper model. And in general, there is an offset, depending on the values of these variables, and the offset is compensated for by the threshold. So that's why we need the threshold for linear regression. What is the second question? So in the binary case, when you use y as plus 1 or minus 1, so why does that just work? OK, well, if you apply linear regression, you have the following guarantee at the end. The hypothesis you have has the least squared error from the targets on the examples. That's what has been achieved by the linear regression algorithm. Now, the outputs of the example is being plus or minus 1. We can put that together with the first statement, and then we realize that then the output of my hypothesis is closest to the value plus 1 or minus 1, with a mean squared error. The leap of faith is that if you are close to plus 1 versus minus 1, then the chances are when you are close to plus 1, you are at least positive, and when you are close to minus 1, you are at least negative. If you accept that leap of faith, then the conclusion is that when you take the threshold of the value of the signal from linear regression, you will get the classification right, because positive will give you plus 1, negative will give you minus 1. This is not quite the case, because in the attempt to numerically replicate all the points, the signal for linear regression can become, let's say, as I mentioned, plus 7 for some points, and minus 7 for another point, and the linear regression is trying to push the w, which is what will end up being the boundary, in order to capture that numerical value. So in attempting to fit stuff that is irrelevant to the classification, it may mess up the classification. And that's why the suggestion is, don't use it as a final thing for classification, just use it as an initial weight, and then use a proper classification, something as simple as the pocket algorithm, in order to find unit further, in order to get the classification part, without having to suffer from the numerical angle. So also in that, does it make a difference what you use? Is it plus 1, minus 1, or something else? If it's plus something and minus the same thing, it's a matter of scale. If it's plus and minus and not symmetric, it will be absorbed in this threshold. So it really doesn't matter. It will just make things look different. OK, so regarding the first part of the lecture, so how do you usually come up with features? OK, the best approach is to look at the row input, and look at the problem statement, and then try to infer what would be a meaningful feature for this problem. For example, the case where I talked about the years in residence, it does make sense to derive some features that are closer to the linear dependency. There is no general algorithm for getting features. So this is the part where you work with the problem, and you try to represent the input in a better way. And the only catch is if you look at the data in order to try to derive the input, that is there is a problem there that will become apparent when we come to the theory. But the bottom line is that if you don't look at the data and you study the problem, and derive features based on that, that will almost always be helpful if you don't have too many of them. If you have too many of them, then it starts becoming a problem. But first order, usually when I get a problem, I look at the data, and I probably can think of less than a dozen variables that will be helpful. And I put all of them, and usually a dozen variables, in this case, doesn't increase the input space by much. These are big problems. And so I don't suffer much from the generalization issue. So added to that a short clarification. So the nonlinear transformations, they become futures. Yeah, so the word feature we are going to use, there's a feature space which is called z. And anything that you take the input and transform it into something else, this will be called feature. And features of features will also be features. So if you take, for example, the classification of the digits, we had the pixel values, that's the row input. And then we had the symmetry and the intensity. These were features. If you go further and find nonlinear transformations of those, these will also be called features. A feature is any higher level representation of a row input. So another question is, so how does this analysis change if we cannot assume that they are, or if they are correlated, if they're not independent? Not clear about the question. So there is really, OK, so I think I get. So probably when we get the inputs, the question is independence versus dependence. And the independence was used in getting the generalization bound. That's probably the direction of the question. The independence was from one data point to another. So I have capital N inputs. And I want these guys to be generated independently according to a probability distribution. If they were originally independent, and I transformed one of them and transformed the other, the independence is inherited. There's no question of independence between coordinates of the same input. The independence, there was a question of the independence between the different inputs. So the different input points. So another question is, so are there methods that use different hyperplanes and intersections of them to separate data? Correct. The linear model that we have described is the building block of so many models in machine learning. You will find that if you take a linear model with a soft threshold, not the hard threshold version, and you put bunch of them together, you will get a neural network. If you take the linear model and you try to pick the separating boundary in a principled way, you get support vector machines. If you take the nonlinear transformation and you try to find a computationally efficient way of doing it, you get kernel methods. So there are lots of methods within machine learning that build on the linear model. The linear model is somewhat underutilized. It's not glorious. And it's not glorious, but it does the job. The interesting thing is that if you have a problem, there is a very good chance that if you take a simple linear model, you will be able to achieve what you want. You may not be able to brag about it, but you are going to do the job. And obviously, the other models will give you an incremental performance in some cases. So a question getting a little bit ahead. How do you assess the quality of E in and E out systematically? OK, this is a theoretical question. E in is very simple. I have the value of E in. I can assess its value by just looking at its value. I can evaluate it at any given point. And this is what makes the algorithm able to pick the best in-sample hypothesis by picking the one that has the smallest in-sample error. The out-of-sample error, I don't have access to. There will be some methods described after the theory that will give us an explicit estimate of the out-of-sample error. But in general, I rely on the theory that guarantees that the in-sample error tracks the out-of-sample error in order to go all out for the in-sample error and hope that the out-of-sample error follows, which we have seen in the graph when we were looking at the evolution of the perceptron. And the in-sample error was going down and up. And the out-of-sample error was also going down and up, albeit with a discrepancy between the two. But they were tracking each other. So here's a question that's kind of a compulsion. So if you want to fit a polynomial, is this still a linear regression case? Correct. Because right now, let's say we have a single input variable, x, like the case I gave. So you have x and y. Now, you have a line. If you use the nonlinear transformation, you can transform this x to x, x squared, x cubed, x to the fourth, x to the fifth. And then fit a line to the new space. And the line in the new space will be a polynomial in the old space. So this is covered through the nonlinear transformation. So what is the relation between linear regression and least squares with maximum likelihood estimation? OK. When you look at linear regression in the statistics literature, there are many more assumptions about the probabilities and what the noise is. And you can get actually more results about it. Under certain conditions, you can relate it to the maximum, like you can say, Gaussian goes with the squared error. And in this case, minimizing it will correspond to maximum likelihood. So there is a relationship. On the other hand, I prefer to give the linear regression in the context of machine learning without making too many assumptions about distributions and whatnot, because I want it to be applied to a general situation rather than applied to a particular situation. As a result of that, I will be able to say less in terms of what is the probability of being right or wrong. I just have the generalization in sample and out of sample. But that suffices for most of the machine learning situation. So there is a relationship. And it's studied fairly well in other disciplines. But it is not of particular interest to the line of logic that I'm following. So a popular question is, can you give at least a set of usual nonlinear transformations used? OK, there will be many. I mean, when we get to support vector machines, we will be dealing with a number of transformations, some of them polynomial that we mentioned. One of the useful ones is referred to as radial basis functions. We will talk about that as well. So there will be transformations. And the main point is to be able to understand what you can and what you cannot do in terms of jeopardizing the generalization performance by taking a nonlinear transformation. So after we are done with that theory, we will have a significant level of freedom of choosing what nonlinear transform we use. And we'll have some guidelines of some of the famous nonlinear transforms. So this is coming up. So I think you already answered this question last time. But again, someone asks, is it impossible for machine learning to find the pattern of a pseudo-random number generator? Well, if it's pseudo-random, then in principle, if you get the seed, you can produce it. But the way it's usually used is you use a pseudo-random number, and then you take a few bits and have them as an output for different inputs. So just looking at the inputs and trying to decipher it, it's next to impossible. So it's a practical question. I mean, philosophically, yes, you can. Practically, it looks random for all intents and purposes. So what are the different treatments for continuous responses versus discrete responses in, I guess, in the IIMs? Obviously, this is dictated by the problem. If someone comes and they want to approve credit, et cetera, I'm going to use a classification hypothesis set. If someone wants to get a credit line or something else, then I will have to use a regression. So it really is dependent on the problem. And the funny part is that real numbers look more sophisticated, yet the algorithm that goes with them, which is linear regression, is much easier than the other one. The reason is that the other one is a combinatorial and combinatorial optimization is pretty difficult in general. So the answer to the question is that it depends on the target function that the person is coming up with. And when there is cross-fertilization between the techniques, it's just a way to use an analytic advantage from one method to give the other one a jump start or to give it a reasonable solution. But it's a computational question. The distinction is really in the problem statement itself. Can you say what makes a nonlinear transformation good? OK. I will be able to talk about this a little bit more intelligently after the theory. I would like to emphasize that the theory part will be very important in giving us all the tools to talk with authority about all the issues that are being raised. So there is a reason for including the theory before we go into more details. This lecture was meant to give you just a little bit of standard tools that you can use. And if you look at it now, you can use for many applications and many data sets. Because now you can deal with nonseparable data, you can deal with real-value data, and you can even deal with some nonlinear situations. So it's just a toolbox for you to get your hand wet. And then things will become more principled when we develop more material. I think that's it. OK. That's it. So we will see you on Thursday.