 The following program is brought to you by Caltech. Welcome back. Last time we talked about validation, which is a very important technique in machine learning for estimating the out-of-sample performance. And the idea is that we start from the data set that is given to us that has n points. We set aside k points for validation for just estimation, and we train with the remaining points n-k. Because we are training with a subset, we end up with a hypothesis that we are going to label g- instead of g. And it is on this g- that we are going to get an estimate of the out-of-sample performance by the validation error. And then there is a leap of faith when we put back all the examples in the pot in order to come up with the best possible hypothesis to work with the most training examples. We are going to get g, and we are using the validation error we had on the reduced hypothesis, if you will, to estimate the out-of-sample performance on the hypothesis we are actually delivering. And there is a question of how accurate an estimate this would be for e out. And we found out that k cannot be too small and cannot be too big in order for this estimate to be reliable. And we ended up with a rule of thumb of about 20% of the dataset go to validation that will give you a reasonable estimate. Now this was an unbiased estimate, so we get an e out. We can get better than e out to do worse than e out in general, as far as eval estimating the performance of g-. On the other hand, once you use the validation error for model selection, which is the main utility for validation, you end up with a little bit of an optimistic bias because you chose a model that performs well on that validation error. Therefore, the validation error is not going to necessarily be an unbiased estimate of the out-of-sample error. It will have a slight positive or optimistic bias. And we showed an experiment where using very few examples in this case in order to exaggerate the effect. We can see the impact of the blue curve is the validation error, and the red curve is the out-of-sample error on the same hypothesis, just to pin down the bias. And we realized that as we increase the number of examples the bias goes down, the difference between the two curves go down. And indeed, if you have a reasonable size validation set, you can afford to estimate a couple of parameters for sure without contaminating the data too much. So you can assume that the measurement you are getting from the validation set is a reliable estimate. Then because the number of examples turned out to be an issue, we introduced the cross-validation, which is by and large the method of validation you are going to be using in a practical situation, because it gets you the best of both words. So in this case, illustrating a case where we have 10-fold cross-validation. So you divide the data set into 10 parts, you train on 9 and validate on the 10th, and keep that estimate of the error. And you keep repeating as you choose the validation subset to be one of those. So you have 10 runs, and each of them gives you an estimate on a small number of examples, one tenth of the example. But then by the time you average all of these estimates, that will give you a general estimate of what the out-of-sample error would be on 9 tenths of the data, despite of the fact that there are different 9 tenths each time. And in that case, the advantage of it is that the 9 tenths is very close to one, so the estimate you are getting is very close. And furthermore, the number of examples taken into consideration in getting an estimate of the validation error is really n. You got all of them, albeit in different runs. So this is really sort of the way to go in cross-validation. And invariably in any learning situation, you will need to choose a model, a parameter, something to make a decision, and validation is a method of choice in that case in order to make that. Okay. So we move on to today's lecture, which is support vector machines. Support vector machines are arguably the most successful classification method in machine learning. And they are very nice because there is a principal derivation for the method. There is a very nice optimization package that you can use in order to get the solution, and the solution also has a very intuitive interpretation. So it's a very, very neat piece of work for machine learning. So the outline will be the following. We are going to introduce the notion of the margin, which is the main notion in support vector machines, and we will ask a question of maximizing the margin, getting the best possible margin. And after formulating the problem, we are going to go and get the solution, and we are going to do that analytically. It will be a constrained optimization problem, and we faced one before on regularization, where we gave a geometrical solution, if you will. This time, we are going to do it analytically because the formulation is simply too complicated to have an intuitive geometric solution for. And finally, we are going to expand from the linear case to the non-linear case in the usual way, thus expanding all of the machinery to a case where you can deal with non-linear surfaces instead of just align in a separable case, which is the main case we are going to handle. Okay. So now let's talk about linear separation. Let's say I have a linearly separable dataset. Okay. And just take 4, for example. So there are lines that will separate the red from the blue. Now, when you apply Perceptron, you will get a line. When you apply any algorithm, you will get a line and separate it. You get zero training error, and everything is fine. And now there is a curious point when you ask yourself, I can get different lines. Is there any advantage of choosing one of the lines over the other? That is the new addition to the problem. Okay. So let's look at it. Here is a line. So I chose this line to separate the two. You may not think that this is the best line, and we try to sort of take our intuition and understand why this is not the best line. So I'm going to think of a margin. That is, if this line moves a little bit, when is it going to cross over? When is it going to start making an error? So in this case, let's put it as a yellow region around it. That's the margin you have. So if you choose this line, this is the margin of error. Okay. Again, sort of informal notion. Now, you can look at this line. Okay. And it does seem to have a better margin. Okay. And you can now look at the problem closely and say, let me try to get the best possible margin, and then you get this line, which has this margin that is exactly right for the blue and red points. Okay. Now, let us ask ourselves the following question. Which is the best line for classification? As far as the in-sample error is concerned, all of them give in-sample error zero. As far as generalization questions are concerned, as far as our previous analysis has done, all of them are dealing with linear model with four points. So generalization as an estimate will be the same. Nonetheless, I think you will agree with me that if you had your choice, you will choose the fat margin. Okay. Somehow it's intuitive. Okay. So let's ask two questions. The first one is, why is a bigger margin better? Okay. The second one, if we are convinced that a bigger margin is better, a bigger margin is better, then you ask yourself, can I solve for a W that maximizes the margin? Okay. Now, it is quite intuitive that the bigger margin is better because think of a process that is generating the data. Okay. And let's say that there is noise in it. If you have the bigger margin, the chances are that the new point will still be on the correct side of the line. Okay. Whereas if I use this one, there is a chance that the next red point will be here and it will be misclassified. Again, I'm not giving any proofs. I'm just giving you an intuition here. Okay. So it stands to logic that indeed the bigger margin is better. And now we are going to argue that the bigger margin is better for a reason that relates to our VC analysis before. Okay. So anybody remembers the gross function from ages ago? Okay. What was that? So we take the dichotomies of the line on points in the plane and let's say we take three points. So on three points, you can get all possible dichotomies by a line. So blue versus not blue region. And you can see that by varying where the line is, I can get all possible 2 to the 3 equals 8 dichotomies here. So you know that the gross function is big and we know that the gross function being big is bad news for generalization. That was our sort of take-home lesson. Okay. So now let's see if this is affected by the margin. Okay. So now we are taking dichotomies not only the line but also requiring that the dichotomies have a fat margin. Okay. So let's look at dichotomies and their margin. Now in this case, I am putting the same three points and I am putting a line that has the biggest possible margin for the constellation of points I have. Okay. So you can see here, I put it, I sandwiched them here. Every time it touches all the points, it cannot extend any further because that's the, it will get beyond the points. And when you look at it, okay, this is a thin margin for this particular dichotomy. This is an intermediate one. This is a fat one and this is a hugely fat one but that's the constant one that's not a big deal. Okay. Now let's say that I told you that you are allowed to use a classifier but you have to have at least that margin for me to accept it. So now I'm requiring the margin to be at least something. All of a sudden, these guys that used to be legitimate dichotomies using my model are no longer allowed. So effectively, by requiring the margin to be at least something, I am putting a restriction on the growth function. Fat margin imply fewer dichotomies possible. And therefore, if we manage to separate the points with a fat dichotomy, we can say, okay, that fat dichotomies have a smaller VC dimension, smaller growth function, than if I didn't restrict them at all. Okay. And although this is all informal, we will come at the end of the lecture to a result that estimates the out-of-sample error based on the margin. And we will find out that indeed, when you have a bigger margin, you will be able to achieve better out-of-sample performance. Okay. So now that I completely and irrevocably convinced you that the fat margins are good, let us try to solve for them. That is, find the W that not only classifies the points correctly, but achieves so with the biggest possible margin. Okay. So how are we going to do that? Well, the margin is just a distance from the plane to a point. So I'm going to take from the data set, the point xn, which happens to be the nearest data point to the line that we have used in the previous example. And the line is given by the linear equation, from zero. And since we are going to use a higher dimensional thing, I'm not going to refer to it as a line. I'm going to refer to it as a plane, hyperplane really, but just plane for short. So we are looking about d-dimensional space Okay. So we would like to estimate that. And we ask ourselves if I give you W and the x's, can you plug them into a formula and give me the distance between that plane that is described by W and the point xn. Okay. So I'm now taking the nearest point because then that distance will be the margin that I'm talking about. Okay. Now there are two preliminary technicalities that I'm going to use the first one. The first one is to normalize W. What do I mean by that? For all the points in the dataset near and far, when you take W transpose times xn, you will get a number that is different from zero. Okay. And indeed it will agree with the label yn because the points are linearly separate. So I can take the absolute value of this and claim the distance but I realize that here there is a minor technicality that is annoying. Let's say that I multiply the vector W by a million. Okay. Does the plane that I'm talking about change? No. This is the equation of it. I can multiply by any positive number and I get the same plane. So the consequence of that is that the scale invariance will be dividing by something that takes out that factor that does not affect which plane I'm talking about. So I'm going to do it now in order to simplify the analysis later. I'm going to consider all representations of the same plane and I'm going to pick one where this is normalized by requiring that for the minimum point this fellow is one. I can always do that. I can scale W up and down until I get there's obviously no loss in generality because in this case this is a plane and I have not missed any planes by doing that. Now the quantity Wxn which is the signal as we talked about it is a pretty interesting thing. So let's look at it. I have the plane so the plane has that the signal equals zero and it doesn't touch any points. The points are linearly separable. Now when you get the signal to be positive you are moving your points and when you go in the other direction and it's negative you hit the other points the nearest point on the negative side and then the interior points which are further out. So indeed that signal actually relates to the distance but it's not the Euclidean distance it just sort of has an order of the points according to which is nearest and which is furthest. But what I'd like to do I would like to actually get the Euclidean distance because I'm on the same point so I have to have the same yardstick and the yardstick I'm going to use is the Euclidean distance. Okay so I'm going to take this as a constraint and when I solve for it I will find out that the problem I'm not solving for is much easier to solve and then I can get the plane and the plane will be general under this normalization. The second one is pure technicality. Remember that we had X being in Euclidean space R to the D and then we added the coordinate X0 in order to take care of W0 that was the threshold if you think of it as comparing with a number or a bias if you think of it as adding a number and that was convenient just to have the nice vector and metric representation and so on. Now it turns out that when you solve for the margin the W1 up to WD will play a completely different role from the role W0 is playing so it is the same vector. So for the analysis of support vector machines we are going to pull W0 out. So the vector W now is the old vector W1 up to WD and you take out W0 and in order not to confuse it and call it W because it has a different role we are going to call it here B for bias times x plus B equal 0 and there is no x0 x0 is the one that used to be multiplied by B also known as W0 So every W you will see in this lecture will belong to this convention and now if you can look at this this will be W transpose xn plus B absolute value equals 1 and the plane will be W transpose x plus B that will make our mass much more friend so these are the technicalities that I wanted to get out of the way okay now big box because it's an important thing it will stay with us and then we go for computing the distance okay so now we would like to get the distance between xn we took xn to be the nearest point and therefore the distance will be the margin and we want to get the distance from the plane so let's look at the geometry of the situation okay the equation for the plane and I have the condition that I talked about this is the geometry I have a plane and I have a point xn and I'd like to estimate the distance okay first statement the vector W is perpendicular to the plane okay that should be easy enough if you have seen any geometry before but it's not very difficult to argue but remember now I'm not talking about the weight space I'm talking about W you plug in the values and you get a vector and I'm looking at that vector in the input space x and I'm saying it's perpendicular to the plane why is that because let's say that you pick any two points called them x dash and x double dash on the plane proper okay so they are lying there okay what do I know about these two points of the plane right so I can conclude that it must be that when I plug in x dash in that equation I will get zero and when I plug in x double dash I will get zero conclusion if I take the difference between these two equations I will get W transpose x dash minus x double dash equal zero and now you can see the good old B dropped out and this is the reason why it has a different treatment here the other guys actually matter but the B plays okay so when you see an equation like that your conclusion is what your conclusion is that WT as a vector W as a vector must be orthogonal to x dash minus x double dash as a vector okay so when you look at the plane here is the vector x dash minus x double dash okay let me magnify it okay and this must be orthogonal to the vector W okay okay so the interesting thing is that we didn't make any restrictions in x dash and x double dash these could be any two points on the plane right so now the conclusion is that W which is the same W that the vector W that defines the plane is orthogonal to every vector on the plane right therefore it is orthogonal to the plane okay so we got that much so we know that now the W has an interpretation okay now we can get the distance of the orthogonal to the plane you probably can get the distance because what do we have the distance between x and n and the plane and we put them here is what can be computed as follows take any point one point on the plane we just call it generic x okay and then you take the projection of the vector going from here to here you project it on the direction which is orthogonal to the plane and that will be your distance right so we just need to put the mathematics that goes with that okay so here is the vector and here is the other vector which we know that is orthogonal to the plane now if you project this fellow on this direction that length will give you the distance okay now in order to get the projection what you do you get the unit vector in the direction so you take W which is this vector could be of any length and you normalize it by its norm and you get a unit vector under which the projection would be a simply a dot product so now the W hat is a sort of a shorter W if the norm of W happens to be bigger than one and what you get you get the distance being simply the inner product you take the unit vector dot that and that is your distance except for one minor issue this could be positive or negative depending on whether W is facing x or facing the other direction so in order to get the distance proper you need the absolute value okay so we have a solution for it okay so now we can write the distance as this is the formula now I'm multiplied by W hat I know what the formula for W hat is so I write it down and now I have it in this form okay now this can be simplified if I add the missing term plus B minus B okay why is that can someone tell me what is WTx plus B which is this quantity being subtracted here this is the value of the equation of the plane for a point on the plane so this will happen to be zero how about this quantity WTxn plus B for the Y point xn okay well that was the quantity that we insisted being one remember when we normalize the W because W could go up and down and we scale them such that the absolute value of this quantity is one so all of a sudden is just one and you end up with the formula for the distance given that normalization being simply one over the norm that's a pretty easy thing to do so if you take the plane and insist on a canonical representation of W by making the this part one for the nearest point then your margin will simply be one over the norm of W you used this I can use in order now to choose will give me the best possible margin which is the next one okay so let's now formulate the problem here is the optimization problem that resulted we are maximizing the margin the margin happens to be one over the norm so that is what we are maximizing subject to what subject to the fact that for the nearest point which happens to have the smallest value of those guys so the minimum over all points in the training set I took the quantity here and scale W up or down in order to make that quantity one so I take this as a constraint when you constrain yourself this way then you are maximizing one over W and that is what you get okay so what do we do with this well this is not a friendly optimization problem because if the constraints have a minimum in them that's bad news minimum is not a nice function to have so what do we do now we are going to try to find an equivalent problem that is more friendly completely equivalent by very simple observation so the first observation is that I want to get rid of the minimum that's my biggest concern so the first thing I notice that okay not to mention the absolute value okay so the absolute value of this and etc happens to be equal to this fellow why is that well every point is classified correctly this is the one that maximizes the margin because they are classifying the points correctly it has to be that the signal agrees with the label therefore when you the label is just plus one or minus one and therefore it takes care of the absolute value part so now I can use this instead of the absolute value I still haven't got rid of the minimum and I don't particularly like dividing one over and the norm which has a this friendly quantity quadratic one okay I'm minimizing now so I'm maximizing one over minimizing that everybody sees that it's equivalent okay so now we can see okay does anybody see quadratic programming coming up in the horizon okay it's a quadratic formula and it will be okay the only thing I need to do is just have the constraints being friendly constraints not a minimum an absolute value just sort of inequality constraints that are linear in nature okay this doesn't bother me because I already established that it deals with the absolute value but here I'm taking greater than or equal to one for all points okay now okay I can see that if the minimum is one then this is true but it is conceivable that I do this optimization I end up with a quantity for which all of these guys happen to be strictly greater than one that is a feasible point according to the constraints and if this by any chance gives me the minimum then that is the minimum I'm going to get the problem with that is that this is a different statement from the statement I made here that's the only difference okay well is it possible that the minimum will be achieved at a point where this is greater than one for all of them a simple observation until you know this is impossible because let's say that you got that solution you got you tell me okay this is the minimum I can get for W transpose W right yeah and I got it for values where this is strictly what am I going to do I'm going to scale W and B proportionately down until they touch the one I mean you have a slack right so I can just pull all of them just slightly until one of them touches one okay now under those conditions definitely if the original constraints were satisfied the new constraints will be satisfied okay all of them are just proportionately I can pull down the factor which is a positive factor and indeed if this is the case and the point is that the W I got is smaller than yours because I scaled them down right so it must be that my solution is better than yours conclusion when you solve this the W that you will get necessarily satisfy these with at least one of those guys with equality which means that the minimum is one and therefore this problem is equivalent to this problem okay this is really very nice so we started from a concept and geometry and simplification now we end up that we are going to solve and when you solve it you are going to get the separating plane with the best possible margin okay so let's look at the solution so formally speaking let's put it in a constraint optimization question the constraint optimization here you minimize this objective function subject to these constraints we have seen those and the domain you are working on W happens to be D B happens to be a scalar belongs to the real numbers that is the state okay now when you have a constraint optimization we have a bunch of constraints here and we will need to go an analytic in order to solve it the geometry won't help us very much so what we are going to do here we are going to ask ourselves oh constraint optimization I heard of Lagrange is pretty much what we got in regularization before we did it geometrically we didn't do it explicitly with Lagrange but that's what you get now the problem here is that the constraints you have are inequality constraints not equality constraints that changes the game a little bit but just a little bit because what what people did is simply look at these and realize that there is a slack here if I call the slack S squared I can make this equality and then I can solve the old Lagrange with equality okay I can comment on that in the Q&A session because it's a very nice approach and that approach was derived independently by two sets of people Karoush which is the first K and Kintaker which is the KT and the Lagrangian under inequality constraint is referred to as KT okay so now let us try to solve this and I'd like before I actually go through the mathematics of it to remind you that we actually saw this before in the constraint optimization we solved before the inequality constraint which was regularization okay and it is good to look at that picture because it will put the analysis here in perspective okay so in that case you don't have to go through that we were minimizing something you don't have to worry about the formula exactly under a constraint and the constraint is an inequality constraint that resulted in weight decay if you remember and we had a picture that went with it okay and what we did is we looked at the picture and found a condition and the condition for the solution showed that the gradient of your objective function the thing you are trying to minimize okay it becomes something that is related to the constraint itself in this case normal the most important aspect to realize is that when you solve the constraint problem here the end result was that the gradient is not zero it would have been zero if the problem was unconstrained if I ask you to minimize this the constraint the constraint kicks in and you have the gradient being something related to the constraint and that's what will happen exactly when we have the Lagrangian in this case okay but one of the benefits of having of reminding you of the regularization is that there is a conceptual dichotomy no pun intended between the regularization and the SVM okay SVM is what we are doing here maximizing the margin and regularization so let's look at both cases and ask ourselves and what is the constraint if you remember in regularization we already have the equation what we are minimizing is the in-sample error okay so the optimizing E in under the constraints that are related to W transpose W the size of the weight that was weight decay if you look at the equation we just found out in order to maximize the margin we are actually what we are optimizing is W transpose W that is what you are trying to minimize right and your constraint is that you are getting all the points right so your constraint is that E in is zero so it's the other way around but again because both of them will blend in the Lagrangian and you will end up doing something that is your compromise it's it's it's it's it's it's it's it's it's it's it's it's it's so now let's look at the Lagrange formulation okay and I would like you to pay attention to this slide because once you get the formulation we are not going to do much beyond getting a clean version of the Lagrangian and then passing it on to a package of quadratic programming to give us a solution okay so but at least arriving there is important so let's look at it we are minimizing this is our objective function subject to the constraints of this form first step take the inequality constraints and put them in the zero form so what do I mean by that instead of saying that greater or equal to one you put it as minus one and then require that this is greater than or equal to zero okay and now you see it got multiplied by Lagrange multiplier okay so think of this since this should be greater than zero this is the slack so the Lagrange multiplier get multiplied by the slack and then you add them up okay and they become part of the objective and they come out as a minus simply because the inequalities here are in the direction greater than or equal to okay that's what goes with the minus here okay I'm not proving any of that I'm just motivating for you that this formula makes sense but there's a mathematics that actually pins it down exactly and you are minimizing so now let's give it a name it's a Lagrangian it is dependent on the variables that I used to minimize with respect to W and B and now I have a bunch of new variables which are the Lagrange multiplier the vector alpha which is which called lambda in in other cases here it's standard alpha and there are capital N of them there is a Lagrange multiplier for every point in the set okay we are minimizing this with respect to the original thing the interesting part which you should pay attention to is that you are actually maximizing with respect to alpha again I'm not making a mathematical proof that this method holds but this is what you do and it's interesting because when we had equality we didn't worry about maximization versus minimization because all you did you get the gradient equal zero so that applies for both maximum and minimum here you have to pay attention for it because you are maximizing with respect to alphas but the alphas have to be non-negative once you restrict the domain you can just get the gradient to be zero because the function I mean if the function was all over and this way you get the minimum the minimum the gradient is zero but if I tell you to stop here the function could be going this way and this is the point you are going to pick and the gradient here is definitely you need to be pay attention here we are not going to pay too much attention for it because we will just tell the quadratic programming guy please maximize and it will give us a solution but that is the problem we are solving so now we do at least the unconstrained part so with respect to W and B you are just minimizing this so let's do it then we are going to take the gradient of the Lagrangian with respect to W so I am getting partial by partial for every weight that appears equation here how do I get that okay I can differentiate so I differentiate this I get a W the square goes with the half when I get this I ask myself what is the coefficient of W I get alpha Yn and Xn right that's what gets multiplied by W for every n equals one to n so I get that and I have a minus sign here that comes here everything else drops out okay so this is the formula okay and what do I want to be the vector zero okay so that's a condition what is the other one I now get the derivative with respect to B B is a scalar that's the the remaining parameter okay and when I look at it okay can we do this okay what gets multiplied by B oh I guess it's just the alphas everything else drops out so oh not just the alphas it's Yn okay so here's the B gets multiplied by Yn and alpha and that's what I get and you get this to be equal to the scalar zero okay so optimizing this with respect to W and B resulted in these two conditions now what I'm going to do now I'm going to go back and substitute with these conditions in the original Lagrangian such that the maximization with respect to alpha which is the tricky part because alpha has a range okay will become B and that formulation is referred to as the dual formulation of the problem okay so let's substitute here are what I got from what I got from the last slide this is the this one I got from the gradient with respect to W equal zero so W has to be this and this one from the the partial by partial B equal zero I get those and now I'm going to substitute them in the Lagrangian has that form okay now let's do this carefully because things drop out nicely and I get a very nice formula at the end which is function of alpha only okay so this equals okay first part I get the summation of the Lagrange multipliers where did I get that I got that because I have minus one here gets multiplied by alpha n for all of those canceled with this minus so I get I kill the part that I already used so I killed the minus one okay that part I got next I look at this and say okay I have plus B here right okay so when I take plus B it gets multiplied by Yn alpha n summed up from n equals one to capital N now I look at this and say oh the summation of alpha zero and therefore I can kill plus B okay now when I have it down to this it's very easy to see because you look at the form for W when you have W transpose W you are going to get a quadratic version of this you get some double summation alpha alpha y y x x right with the proper name of the dummy variable to alpha n yn xn and now when you substitute W by this you are going to get exactly the same thing you are going to get another alpha another y another x so this will be exactly the same as this except that this one has a factor half this has a factor minus one so you add them up and you end up with this okay now this is a very nice quantity to have because this is a very simple quadratic form in the vector alpha alpha here appears as a linear guy here appears as a quadratic guy that's all now I need to put the constraints I put back the things I took out and let's look at the maximization with respect to alpha subject to condition I have to look for solutions under these conditions and I also have to consider the conditions that I inherited from the first stage so I have to satisfy this and I have to satisfy this for the solution to be valid so this one is a constraint over the alphas and therefore I have to take it as a constraint here but I don't have to take the constraint here because that is the alphas whatsoever you do your thing you come up with alphas and you call whatever that formula is the resulting W since W doesn't appear in optimization I don't worry about it at all so I end up with this thing now if I didn't have those sort of annoying constraints I would be basically done because I look at this well that's pretty easy the unconstraint optimization of a quadratic one I solve it I get something maybe a inverse of something and then I'm done but I cannot do that simply because I'm restricted to those choices and therefore I have to work with a constraint optimization albeit a very minor constraint optimization now let's look at the solution the solution goes with quadratic programming so the purpose of the slide here is to translate the objective and the constraints we had is that you are going to pass onto a package called quadratic programming so this is a practical slide first what we are doing is maximizing with respect to alpha this quantity that we found subject to a bunch of constraints quadratic programming packages come usually with minimization so we need to translate this into minimization how are we going to do that we're going to do this so now it's ready to go now the next step will be pretty scary because what I'm going to do I'm going to expand this isolating the coefficients from the alphas the alphas are the parameters you're not passing alphas to the quadratic programming quadratic programming work with a vector of variables that you call alpha what you do because that would minimize this quantity so this is what it looks like I have a quadratic term alpha transpose alpha and these are the coefficients in the double summation these are numbers that you read of your training data you give me x1 and y1 I'm going to compute these numbers for all of the linear term just to be formal happens to be since we are just taking minus alpha it's minus one transpose alpha which is the sum of those guys so this is the bunch of linear coefficients that you pass and then the constraints you put the constraints again in the same and it will ask you for finally the range of alphas that you need and the range of alphas that you need happened to be between zero so that would be the vector zero will be your lower bound infinity will be your upper bound okay so you read of this slide you give it to the quadratic programming and quadratic programming gives you back an alpha and if you're completely discouraged by this let me remind you that all of this is just to give you what to pass to the this that's all you're doing a very simple quadratic function with a linear term you are minimizing it subject to linear equality constraint plus a bunch of range constraints okay and when you expand it in terms of numbers this is what you get and that's what you're going to use okay so now we are done we have done the analysis we know what to optimize it fit one of the standard optimization tools it happens to be convex function in this case so the okay and then we'll pass it in we'll get a number back okay just a word of warning before we go there okay you look at the size of this matrix and it's capital N by capital N right so the the dimension of the of the matrix depends on the number of examples well if you have 100 examples no sweat if you have 1000 examples no sweat if you have a million examples this is really trouble so all the entries matter okay and if you end up with a huge matrix quadratic programming will have a pretty hard time finding the solution to the level where there are tons of heuristics to solve this problem when the number of examples is big it's a practical consideration but it's an important consideration okay but basically if you if you are working with problems in the 10,000 then it's not formidable 10,000 is flirting with danger but that's what it is so pay attention to the fact that in spite of the fact that there is a standard way of solving it and the fact that it's so it's friendly it is not that easy when you get a huge number of examples and people have hierarchical methods so now we want to take this solution and solve our original problem what is W what is B what is the surface what is the margin you've answered the questions that all of this formalization was meant to tackle okay so the solution is vector of alphas and the first thing is that it is very easy to get the W because luckily the formula for W being this was one when we got the gradient with respect to W we found out this is the thing so you get the alphas you plug them in and then you get the W so you get the vector the vector of width you want now I would like to tell you a condition which is very important and will be the key to defining support vectors in this case which is another KKT condition that will be satisfied at the minimum which is the following let's say that you know alpha is the same length you give you a vector of 1,000 guys you look at the vector and to your surprise you don't know yet whether it's pleasant or unpleasant surprise a whole bunch of the alphas are just zero the alphas are restricted to be non-negative they all have to be greater than or equal to zero if you find any one of them negative then you say programming made a mistake but it won't make a mistake it will give you numbers that are non-negative but that remarkable part out of the 1,000 more than 900 are zeroes so is this something is wrong is there a bug in my thing or something no because of the following the following condition holds it looks like a big condition but let's read it this is the constraint in the zero form so this is greater than equal to 1 so minus 1 would be greater than equal to zero this is what we call the slack so the condition that is guaranteed to be satisfied for the point you are going to get is that either the slack is zero or the Lagrange multiplier is zero the product of them will definitely be zero so if there is a positive slack which means that you are talking about an interior point remember that I have a plane and I have a margin and the margin touches on the nearest point and that is what defines the margin then there are internal points where the slack is bigger than one at those points the slack is exactly one but the other ones I mean not the slack the slack is zero the value is one the other ones is the slack will be positive so for all the interior points you are guaranteed that the corresponding Lagrange multiplier will be zero I claim that we saw this before again in the regularization case remember this fellow we had a constraint which is to be within the red circle and we're trying to optimize the function that has equipotentials around this absolute minimum and it grows it grows it grows and because we are in the constraint we couldn't get the absolute minimum we went there when we had the constraint being vacuous that is the constraint doesn't really constrain us and the absolute optimal is inside we ended up with no need for regularization if you remember and the lambda for regularization that case was zero okay that is the case where you have an interior point and the multiplier is zero but a genuine guy that you have to actually compromise you ended up with a condition that requires lambda to be positive okay so these are the guys with where the constraint is active and therefore you get a positive lambda while this guy is by itself zero okay so now we come to the an interesting definition okay so alpha is largely zero interior points the most important points in the game are the points in the plane and the margin and these are the ones for which alpha n are positive okay and these are called support vectors so I have n points and I classify them and I got the maximum margin and because it's the maximum margin it touched on some of the plus one and some of the minus one points those points support the plane so to speak and they are support vectors and the other guys are interior points and the mathematics of it tells us that we can identify those because we can go for the lamb does that happen to be positive they're not the the alphas in this case and the alphas will the alpha greater than zero will identify a support vector okay again when I put a box it's an important thing so this is an important notch okay so let's talk about support vectors I have a bunch of you know points here to classify and I go through the entire machinery I formulate the problem I get the matrix I pass it to quadratic programming I get the alpha back I compute the w all of the above and this is what I get okay so where are the support vectors in this picture they are the closest ones to the plane where the margin region touched and they happen to be these three this one this one so all of these guys that are here and all of these guys are here we'll just contribute nothing to the solution they will get lambda alpha equals zero okay and the support vectors achieve the margin exactly they are the critical points okay okay the other guys are their margin if you will is much bigger or much bigger and for the support vectors you satisfy this with equal one so all of this fine okay now we used to compute w in terms of the summation of alpha n y n x n because we said that this is the quantity we got when we got the gradient n equals zero this is one of the equation and this is our way to get the alphas back which is the currency we get back from quadratic programming and plug it in in order to get the w this goes from n equals one to capital N now that I notice that many of the alphas are zero and alpha is only positive for support vectors then I can say that I can sum this up over only the support vectors it looks like a minor technicality okay I mean okay so the other terms they made the notation you know more clumsy in this case but there's a very important point think of alphas now as the parameters of your model okay when they are zero they don't count just you expect almost all of them to be zeros what counts is the actual values of the parameters that will be some number bigger than zero okay so now your weight vector which can be it's a it's a which is XN and the label plus few parameters hopefully few parameters which is just the number of support vector if you have three support vectors then this let's say are working in twenty-dimensional space so I'm working in a twenty-dimensional space I'm getting away S well it's twenty-dimensional space in disguise because of the constraint you put you got something that is effectively three-dimensional okay and now you can realize why there might be a generalization because I end up with fewer parameters than the expressed parameters that are in the value I get. Okay, so we can also, now that we have it, we can also solve for the b, because you want w and b, b is the bias or the corresponding to the threshold term, if you will, and it's very easy to do because all you need to do is take any support vector, any one of them, and for any of them you know that this equation holds. You already solved for w by that, so you plug this in, and the only unknown in this equation would be b. And as a check for you, take any support vector and plug it in, and you have to find the same b coming out. That was your sort of check that everything in the math went through. But you take any of them and you solve for b, and now you have w and b and you're ready with the classification line or hyperplane that you have. Okay. Now let me close with the non-linear transforms, which will be a very short presentation that has an enormous impact. We are talking about a linear boundary, and we are talking about linearly separable case, at least in this lecture. In the next lecture I'm going to go to the non-separable case. But a non-separable case could be handled here in the same way we handed non-separable case with the perceptrons. Instead of working in the x-space, we went to the z-space. And I'd like to see what happens to the problem of support vector machines as we stated it and solved it, when you actually move to the higher dimensional space. Is the problem becoming more difficult? Does it hold? Et cetera. So let's look at it. So we are going to work with z instead of x, and we are going to work in the capital Z-space instead of the x-space. So let's first put what we are doing. Analytically, after doing all of the stuff, and I even forgot what the details are, all I care about is that, would you please maximize this with respect to alpha subject to a couple of constraints? A couple of sets of constraints. So you look at here, and you can see, when I transform from x to z, nothing happens to the y's. The labels are the same. And these are the guys that are probably will be changed, because now I'm working on a new space. So I'm putting them in a different color. So if I work in the x-space, that's what I'm working with. And these are the guys that I'm going to multiply. In order to get the matrix that I pass onto quadratic programming. Now let's take the usual nonlinear transform. So this is your x-space. And in x-space, I give you this data. This data is not separable, and not linearly separable. And definitely not nearly linearly separable. This is the case where you need a nonlinear transformation. And we did this nonlinear transformation before. Let's say you take just x squared and y squared, the x1 squared and x2 squared. And then you get this, and this one is linearly separable. So all you are doing now is working in this space. And instead of getting just a generic separator, you are getting the best separator, according to SVM, and then mapping it back, hoping that it will have dividends in terms of the generalization. So you look at this. I'm moving from x to z. So when I go back to here, what do you do? All you need to do is replace the x's with z's. And then you forget that there was ever an x-space. I have vector z. I do the inner product in order to get these numbers. These numbers I'm going to pass on to quadratic programming. And when I get the solution back, I have the separating plane or line in the z-space. And then when I want to know what the surface is in the x-space, I map it back. I get the pre-image of it, and that's what I get. Now, the most important aspect to observe here is that, OK. So the solution is easy. Let's say I move from 2-dimensional to 2-dimensional here. OK, nothing happened. Let's say I move from 2-dimensional to a million-dimensional. Let's see how much more difficult the problem became. What do I do? Now I have a million-dimensional vector, inner product with a million-dimensional vector. That doesn't faze me at all. Just an inner product. I get a number. But when I am done, how many alphas do I have? This is the dimensionality of the problem that I'm passing to quadratic programming. Exactly the same thing. It's the number of data points. It has nothing to do with the dimensionality of the space you are working in. So you can go to an enormous space without paying the price for it in terms of the optimization you are going to do. You are going to get a plane in that space. You can't even imagine it, because it's million-dimensional. It has a margin. A margin would look very interesting in this case. And supposedly, it has good generalization property. And then you map it back here. But the difficulty of solving the problem is identical. The only thing that is different is just getting those coefficients. You'll be multiplying longer vectors. But that is the least of our concerns. The other one is that you are going to get a full matrix of this, and quadratic programming will have to manipulate the matrix. And that's where the price is paid. So that price is constant, as long as you give it this number. It doesn't care whether it was the inner product of 2x2, or inner product of a million by million. It will just hand you the alphas. And then you interpret the alphas in the space that you created it from. So the W here will belong to the z-space. Now let's look at, if I do the nonlinear transformation, do I have support vectors? Yes, you have support vectors, for sure, in the z-space. Because you are working exclusively in the z-space. You get the plane there. You get the margin. The margin will touch some points. These are your support vectors, by definition. And you can identify them, even without looking geometrically at the z-space. Because what are the support vectors? Oh, I look at the alphas I get. And the alphas that are positive, these correspond to support vectors. So without even imagining what the z-space is like, I can identify which guys happen to have the critical margin in the z-space, just by looking at the alphas. So the support vectors live in the space you are doing the process in. In this case, the z-space. In the x-space, there is an interpretation. So let's look at the x-space here. So I have these guys, not linearly separable. And you decided to go to a high-dimensional z-space. I'm not going to tell you what. And you solved the support vector machine. You got the alphas. You got the line or the hyperplane in that space. And then you are putting the boundary here that corresponds to this guy. And this is what the boundary looks like. Now we have alarm bells overfitting, overfitting. Whenever you see something like that, you say, OK. That's the big advantage you get out of support vectors. So I get this surface. This surface simply, what the line in the z-space with the best margin, got. That's all. So if I look at what the support vectors are in the z-space, they happen to correspond to points here. They are just data points. So let me identify them here as pre-images of support vectors. I mean, people will sort of say they are support vectors. But you need to be careful, because the formal definition is in the z-space. So they may look like this. So let's look at it. So this is one. This is another. This is another. This is another. And usually when you turn, you would think that in the z-space this is being sandwiched. So this is what it's likely to be. Now the interesting aspect here is that, if this is true, then 1, 2, 3, 4, I have only four support vectors. So I have only four parameters really expressing w in the z-space. Because that's what we did. We said w equals summation over the support vectors of the alpha. Now that is remarkable, because I just went to a million dimensional space. w is a million dimensional vector. And when I did the solution, and if I get only four, which will be very lucky if you're using a million dimensional, but just for illustration, if I get four support vectors, then effectively, in spite of the fact that I use the glory of the million dimensional space, I actually have four parameters. And the generalization behavior will go with the four parameters. So this looks like a sophisticated surface, but it's a sophisticated surface in disguise. It was so carefully chosen that, I mean, there are lots of snakes that can go around and mess up the generalization. This one will be the best of them. And you have a handle on how good the generalization is just by counting the number of support vectors. And that will get us. This is a good point. I forgot to mention. So the distance between the support vectors and the surface here are not the margin. The margins are in the linear space, et cetera. They are likely these guys to be close to the surface. But the distance wouldn't be the same. And there are perhaps other points that look like they should be support vectors, and they aren't. What makes them support vectors or not is that they achieve the margin in the z space. This is just an illustrative version of it. And now we come to the generalization result that makes this fly. And here is the deal. The generalization result, eout is less than or equal to something. So you are doing classification, and you are using the classification error, the binary error. So this is the probability of error in classifying and out of sample point. The statement here is very much what you expect. You have the number of support vectors, which happens to be the number of effective parameters. The alphas that survived. This is your guy. You divide it by n, well, n minus 1 in this case. And that will give you an upper bound on eout. I wish this was exactly the result. The result is very close to this. In order to get the correct result, you need to run several versions and get an average in order to guarantee this. So the real result has to do with expected values of those guys. So for several runs, the expected value. But if the expected value lives up to its name, and you expect the expected value, then in that case, the eout you will get in a particular situation will be bounded above by this, which is a very familiar type of a bound. Number of parameters, degrees of freedom, VC dimension, dot, dot, dot, divided by the number of examples. We have seen this before. And again, the most important aspect is that pretty much like quadratic programming didn't worry about the nature of this space. It could be a million dimensional, and that didn't figure out in the computational difficulty. It doesn't figure out in the generalization difficulty. It didn't ask me about the million dimensional space. You asked me, after you were done with this entire machinery, how many support vectors did you get? If you have 1,000 data points, and you get 10 support vectors, you are in pretty good shape, regardless of the dimensionality of the space that you visited. Because then 10 over 1,000, that's a pretty good bound on eout. On the other hand, it doesn't say that now I can go to any dimension of space, and things will be fine. Because you still are dependent on the number of support vectors. If you go through this machinery, and then the number of support vectors out of 1,000 is 500, you know you are in trouble. And trouble is understood in this case, because that snake will be really a snake, going around every point, going around every point. So we're just trying to fit the data hopelessly, getting so many support vectors that generalization question now becomes useless. But this is the main theoretical result that makes people use support vector. And support vector is with the non-linear transformation. You don't pay for the computation of going to the higher dimension, and you don't get to pay for the generalization that goes with that. And then when we go to kernel methods, which is a modification of this next time, you are not even going to pay for the simple computational price of getting the inner product. Remember when I told you you take an inner product between a million vector and itself? And that was minor. Even if it's minor, we are going to get away without it. And when we get away without it, we will be able to do something rather interesting. The z-space we are going to visit, we are now going to take z-spaces that happen to be infinite dimensional. Something completely unthought of when we dealt with the generalization in the old way. Because obviously, in an infinite dimensional space, I am not going to be able to actually computationally get the inner product. Thank you. So there has to be another way. And the other way will be the kernel. But that will open another set of possibilities of working in a set of spaces we never imagine touching. And still getting not only the computation being the same, but also the generalization being dependent on something that we can measure, which is the number of support vectors. I will stop here and take questions after a short break. Let's start the Q&A. OK, so can you please first explain again why you can normalize w transpose x plus b to be 1? We would like to solve for the margin given w. That has dependency on the combination of w's you get, which is like the angle. That is the relevant one. And also, w has an inherent scale in it. So the problem is that the scale has nothing to do with which plane you are talking about. When I take w, the full w and b, and take 10 times that, they look like different vectors as far as the analysis is concerned, but they are talking about the same plane. So if I'm going to solve without the normalization, I will get a solution. But the solution, whatever I'm optimizing, will invariably have in its denominator something that takes out the scale, so that the thing is scale invariant. I cannot possibly solve, and it will tell me that w has to be this, when, in fact, any positive multiple of it will serve the same plane. So all I am doing myself is simplifying my life in the optimization. I want the optimization to be as simple as possible. I don't want it to be something over something, because then I will have trouble actually getting the solution. Therefore, I started by putting a condition that does not result in loss of generality. Because if I restrict myself to w's, not to planes, that all planes are admitted, but every plane is represented by an infinite number of w's. And I'm picking one particular w to represent them that happens to have that form. When I do that and put it as a constraint, what I end up with, the thing that I'm optimizing, happens to be a friendly guy that goes with quadratic programming and I get the solution. I could definitely have started by not putting this condition, except that I will run into mathematical trouble later on. That's all there is to it. Similarly, I could have left w0. And then all of a sudden, every time I put something, I tell you, I take the norm of the first d guys, or w1 up to wd, and forget the first one. So all of this was just pure technical preparation that does not alter the problem at all, that makes the solution friendly later on. Okay, so many people are curious, what happens when the points are not linearly separable? Okay, there are two cases. One of them, they are horribly not linearly separable, like that, and in this case, you go to a nonlinear transformation, as we have seen. And then there is a slightly not linearly separable, as we have seen before. And in that case, you will see that the method I described today is called Hard Margin SVM. Hard Margin, because the margin is satisfied strictly. And then you are going to get another version of it, which is called Soft Margin, that allows for few errors and penalizes for them. And that will be covered next. But basically, it's very much in parallel with the Perceptor. Perceptor needs linearly separable. If there are few, then you apply something. Let's say the pocket in that case. But if it's terribly not linearly separable, then you go to a nonlinear transformation. A nonlinear transformation here is very attractive because of the particular positive properties that we discussed. But in general, you actually use a nonlinear transformation together with the soft version, because you don't want the snake to go out of its way just to take care of an outlier. So you are better off just making an error on the outlier and making the snake a little bit less wiggly. And we will talk about that when we get the details. Could you explain once again why in this case just the number of support vectors gives an approximation of the PC dimension? Well, in other cases, the transformation matters. So the explanation I gave was intuitive. It's not a proof. There is a proof for this theorem that I didn't even touch on. And the idea is the following. We have come to the conclusion that the number of parameters, independent parameters or effective parameters is the VC dimension in many cases. So to the extent that you can actually accept that as a rule of thumb, then you look at the alphas. I have as many alphas as data points. So if these were actually my parameters, I would be in deep trouble because I have as many parameters as points. So I'm basically memorizing the points. But the particulars of the problem result in the fact that in almost all the cases, the vast majority of the parameters will be identically zero. So in spite of the fact that they were open to be non-zero, the fact that the expectation is that almost all of them will be zero makes it more or less that the effective number of parameters are the ones that end up being non-zero. Again, this is not an accurate statement, but it's a very reasonable statement. So the number of non-zero parameters, which correspond to the VC dimension, it also happens to be the number of the support vectors by definition, because the support vectors are the ones that correspond to the non-zero Lagrange multipliers. And therefore, we get a rule which either counts the number of support vectors or the number of surviving parameters, if you will. And this is the rule that we had at the end that I said that I didn't prove, but actually gives you a bound on E out. Is there any advantage in considering the margin but in using a different norm? So there are variations of this. And indeed some of the aggregation methods like boosting has a margin on its own. And then you can compare that. It's really the question of the ease of solving the problem. And if you have a reason for using one norm or another for a practical problem, for example, if I see that the loss goes with squared or loss goes with the absolute value or whatever. And then I design my margin accordingly. Then we go back to the idea of a principle error measure. In this case, margin measure. On the other hand, in most of the cases, there is really no preference. And it is the analytic considerations that make me choose one margin or another. But different measures for the margin with one norm, two norm and other other things have been applied. And there is really no compelling reason to prefer one over the other in terms of performance. So it really is the analytic properties that usually dictate that choice. Is there any pruning method that can maybe get rid of some of the support practice or not? Will you? So you are not happy with even reducing it as a support vector. You want to get rid of some of them. Well, offhand, I cannot think of a method that I can directly translate into as if it's getting rid of some support vectors. What happens is for computational reasons is that when you solve a problem that is huge in that asset, you cannot solve it all. So sometimes what happens is that you take subsets and you get the support vectors and then you take the support vectors as a union and get the support vectors off the support vectors and stuff like that. So these are really computational considerations. But basically the support vectors are there to support the separating plane. So if you let one of them go, the thing will fall. Obviously I'm have joking only. But because really they are the ones that dictate the margin. So their existence really tells you that the margin is valid. And that's that's really why they are there. Some people are worried that that a noisy the data set would would completely ruin the the performance of the SVM. So OK. How does it deal with it? It will ruin as much as it will ruin any other method. It's not particularly susceptible to noise, except obviously when you have noise the chances of getting a cleanly linearly separable data is not there. And therefore you are using the other methods. And if you are if you are using strictly in a linear transformation but with hard margin, then I can see the point of ruining because now the snake is going around noise and obviously that is not good because you are fitting the noise but in those cases and in almost all of the cases use the soft version of this which is remarkably similar. It's different assumptions but the solution is remarkably similar. And therefore in that case you will be as vulnerable or not vulnerable to noise as you would by using other methods. All right. I think that's that's it. OK. Very good. So we will see you next week.