 The following program is brought to you by Caltech. Welcome back. Last time we introduced support vector machines, and if you think of linear models as economy cars, which is what we said when we introduced them, you can think of support vector machines as the luxury line of those cars, and indeed they are nothing but a linear model in the simplest form that they actually are a little bit more keen on the performance, and the key to the performance was the idea of the margin, is that if the data is linearly separable, there is more than one line that can separate the data, and if you take the line that has the biggest margin, furthest away from the closest point, then you have an advantage. It's both an intuitive advantage and an advantage that can be theoretically established, which we did through the idea of the growth function in this case, and after we determined that it's a good idea to maximize the margin, we set out to do that, and after a chain of mathematics, we ended up with a Lagrangian that we are going to maximize, and the Lagrangian has very interesting properties. It's quadratic, so it's a simple function, and the constraints are inequality constraints, very simple inequality constraints in this case, and one equality constraint, and we are not going to actually do the solving ourselves, we are going to pass the problem on to a package of quadratic programming, and then we will wait for quadratic programming to give us back the values of alphas. Now quadratic programming will have problems with solving this, if the number of examples is bigger, so once you get to thousands, it becomes an issue, and then there are all kinds of heuristics to deal with that case, and in general quadratic programming sometimes needs babysitting, tweaking, limiting range and whatnot, but at least someone else wrote it, and we only have to do these things in order to get the solution, rather than to write this from scratch, so it's not a bad deal for us, and once we get the alphas back, there is a very interesting interpretation that happens, you look at the alphas, the Lagrange multipliers, and some of them will be greater than zero, and most of them will be zero, should be identically zero, in reality because of the rounding error, you might get them very, very small, and you set them manually to zero, but the guys that happen to be bigger than zero are special, and they are called the support vectors, and whether you are working in the X-space, or took the X-space and moved to a Z-space and moved back here, the support vectors are the ones that achieve the margin, they are sitting exactly at the critical point here, and they are used to define the plane, and the most important aspect about them is the fact that you can predict a bound on the out-of-sample error based on the number of support vectors you get, and it is the normal form of dividing the complexity in terms of the number of parameters, in this case the non-zero alphas, or the number of support vectors that corresponds to it, divided by more or less the number of examples, we have seen that before, but the key issue here is something really worth noting. The right-hand side, without the expected value, the expected value just tell us that we have to average this over a number of cases for this to be true, we are dividing the number of support vectors by N. The number of support vectors is an in-sample quantity. You do all of this, you get the alphas back, and you can tell what the number of support vectors are in-sampled. We are able to check on the out-of-sample error using an in-sample quantity, and we know, by the previous experience, that this is a big E, because now not only are we going to check on the in-sample error, we are also going to check on the out-of-sample error using a quantity we can measure. We apply support vectors only to linearly separable data, at least in the previous lecture. In this lecture, we will generalize that. And in order to deal with cases where the data is not linearly separable in the X-space, what we did is we used nonlinear transform, as we did before with linear models. And a curious thing happened when we did that, because we went to a fairly high-dimensional Z-space and we got a surface that is, you know, wiggly and so on, which in our mind raises alarm bells as far as generalization is concerned. But we ended up with something that can be stated, you know, in a simplistic form, because we get a complex hypothesis, which is the snake. But we don't pay the price for it in terms of the complexity of the hypothesis set. Remember, the complexity of the hypothesis set is what we pay the price for in terms of the VC analysis, right? And it is typically the case that when H is, the capital H is more complex, the capital H is more complex, each individual is also complex. But here we sort of did some cheating and we managed to use a high-dimensional Z-space. So it's a complex script H, a complex hypothesis set. But the hypothesis we get is really, although it looks very, very complex, it really belongs to a simple set because it maximizes the margin. So we get the benefit of a fairly low-out-of-sample error, in spite of the fact that we captured the fitting very well by getting the zero in sample error. Now, this is exaggerated. I grant you that, but it has an element of truth in it and it captures what support vector machines do. They allow you to go very sophisticated without fully paying the price for it. Okay. So today we are going to continue this by extending the support vector machines in the basic case and we are going to cover the main method, which is the kernel methods in the bulk of the lecture. And the two topics are the kernels and refer to as the kernel trick, actually, formally. And that takes care of the nonlinear transformation when the z-space can be very sophisticated, so sophisticated that you can't even write it down. It's an infinite dimensional space, which would be completely unheard of if you're using plain vanilla linear models. The other topic is to extend support vector machines from the linearly separable case to the nonlinearly separable case allowing yourself to make errors. Okay. So this is pretty much that if you were using perceptrons and went to pocket, this would be if you went from the support vector machines that we introduced that we are going to label now hard margin because they strictly obey the margin to a soft margin that allows some errors. Okay. And both of these extensions will expand your horizons in terms of the problems you are able to deal with and the chances are in a practical problem you are going to use both. To go to a high-dimensional space sometimes an infinite-dimensional space without paying the price for it as we will see in a moment and in addition to that you are going to allow some errors in order not to make outliers, dictate an unduly complex nonlinear transformation. So both of them will come in handy. Okay. Let's start with the kernels. So the idea of the kernels is that I want to go to the z-space without paying the price for it and we are already halfway there. If you remember from the last lecture the way z manifests itself in the computation is very simple. Okay. You do an inner product in the z-space and from then on it's a regular quadratic programming problem and the dimensionality of the problem depends on the number of examples not on the dimensionality of the z-space once you get the inner product and when you get the result back you count the number of support vectors which really depends again on the number of examples not the dimensionality. Okay. Obviously the dimensionality will come in because you may end up with such a wiggly surface that every other vector becomes a support vector in order to support this type of boundary but basically the dimensionality of the explicitly doesn't appear. Nonetheless we still have to take an inner product in the z-space. Okay. So in this view graph I'm going to zoom in to the very simple question. What do I need from the z-space in order to be able to carry out the dimensionality that I have seen so far? Okay. So what do we do? We have a Lagrangian to solve. Okay. So the Lagrangian looks like this and since we are interested in what we do in the z-space I'm going to make these purple. Okay. So in order to be able to carry out the Lagrangian I need to get the inner product in the z-space. Okay. But getting an inner product in the z-space Okay. Think of it this way. I am a guardian of the z-space. I'm closing the door. Nobody has access to the z-space. Okay. You come to me with requests. Okay. If you give me an X and ask me what is that transformation that's a big demand. I have to hand you a big Z. Right. Okay. And I may not allow that. But let's say that all I am willing to give you are inner products. You give me X and X-dash anything and come back with a number which is the inner product between Z and Z-dash without actually telling you what Z and Z-dash were. Okay. That would be a simple operation and if you can get away with it then that's a pretty good thing because now we can completely focus on inner products in the z-space and see if that can lead to a simplification. So in this slide we are going through step by step in the entire other than the inner product. Okay. So in forming the Lagrangian we need the inner product. Okay. Let's look at the constraints. We have to pass the constraints to quadratic programming. Okay. So this is the first constraint. I don't see any Z. So we are cool. The other one the equality I don't see any Z either. So if you have an inner product in this space you are ready with the the problem back comes the vector alpha alpha 1 alpha 2 up to alpha m now you need to implement your function you are not just solving this you actually are going to hand the hypothesis to your customer, right? and this the hypothesis looks like this. Now I look at this and now I am in a little bit worried because here is W and Z although this is an inner product it's an inner product between points in X between W and I don't know what W is W lives in the z-space so I want to make sure can I get away to solve this Well W is no mystery to us we have solved for it explicitly and we found that you can find W by adding up over all the vectors but in particular over the support vectors that happen to have non-zero alpha this quantity if you take this quantity pack and plug it in for W what do you get in terms of what you need to compute you need to compute inner products okay that's encouraging okay one more item this innocent-looking B is loaded okay this is one of the parameters that we solved for maybe that's what will kill us let's see how do I solve for B I solve for B by taking any support vector and solving for this equation so I take a support vector M and plug it in am I in trouble because I have the W know we already saw that W here and all I need in order to solve for B here is this fellow done we only deal with Z as far as the inner product is concerned now that is a very interesting possibility if I am able to compute the inner product in the Z space without visiting the Z space I still can carry if I can carry the inner product in the Z space without knowing what the Z space is I still will be okay okay you may wonder how am I going to do that that's a different question okay but all we need to do now is something I give you X and X dash two points in the X space you do your thing come back with a number and promise me your in your space that I never visited and come up with the support vectors which live in your space and get the performance based on the number of support vectors and deliver to the customer and tell them I used really a very sophisticated space and then they will ask what is it and usually you know we have our you know stunned silence moments in machine learning where you do something and you know that the existence is sufficient okay so this quantity is a function of X and X dash that much we know we don't know which function but it's a function why is that because Z is a function of X and X and X and X and X and X and X and X and X and Z is exclusively a function of X Z dash is exclusively a function of X dash being transformed versions of them and therefore their inner product will be a function that is determined by X and X dash okay so this is the function that I'm looking for okay now we are going to call this the kernel hence the name okay so this is the kernel as I mentioned this will be labeled as an inner product I put it between quotation because a general between X and X dash it's not a straight inner product but an inner product after a transformation okay now let me give you an example okay it's a bit of a simplistic example but just to illustrate the idea let's say that I have X being two-dimensional two-dimensional Euclidean space so I have so what do we have we have a transformation that takes the vector X produces the vector Z and that would be the full second order guy so we have six coordinates corresponding to all terms of the second order involving X1 and X2 and this is the guy we used that before okay and therefore if you want to get the kernel which is formally the inner product between the transformation of X and X dash you will get this nothing mysterious you are just going to substitute for this for X and substitute for it again for X dash multiply the corresponding terms and add them up and this is what you get okay so the only lesson we are learning here is that indeed this is just a function of X and X dash okay if I didn't know this with an inner product I can look at this okay this is a function I can compute okay fine now we come to the trick okay without transforming X and X dash okay so let's look at the example again okay I am going to now improvise a kernel okay it doesn't transform things to the this space and then does the inner product it just tells you what the kernel is and then I am going to convince you that this kernel actually corresponds to a transformation to some this space and taking an inner product there okay so here is my kernel it's function of X and X dash and it happens to have that form okay this is a special form that will help me later on but the main thing you would want to look at it is that this is not an inner product in the X space in spite of the fact that it involves one computationally I take this add one and square so this is just a function I happen to formalize it in terms of an inner product in any other space transformed otherwise it is just a function okay so now I am going to take this function and I am going to write it explicitly in terms of the components I am still working with two-dimensional input so this would be you know the inner product here would be X1 X1 dash plus X2 X2 dash so this is the quantity that I have and I can definitely square things and I get this quantity okay so this is the value of the kernel now this looks awfully familiar okay it looks like an inner product except for these annoying twos okay this would have been as if I transformed to the second order and took it except that have these guys okay but is this going to discourage me no this still is an inner product and the transformation to the space that makes this an inner product is this fellow X goes to this see I put a square root okay I can put anything right this is my transformation the only test you need to ask me is whether I applied exactly the same transformation to X dash yes I did so that is indeed a transformation of X into a Z and when I take the inner product what do I get I get my kernel okay good establishment of concept here but the idea is this okay that's a lot of fuss about nothing really I could have done this in the first place now think of what happens if I am instead of taking I do it to the 100 look at the difference between computing this quantity and actually going to the 100 order transformation getting is expanded and getting the other one expanded and then doing the inner product so let's see how this works that's called the polynomial kernel so now I take a D dimensional space not 2 but general D and I would like the transformation of that space into Q's order polynomial okay and here is my kernel the equivalent kernel I am putting it between quotation because the square root will happen here again in abundance okay but it's just a scale okay the main idea is still there so here is my kernel I get 1 plus X that to the Q okay first establishing what does it take to compute this a kernel a valid kernel a valid kernel is an inner product in some space I haven't seen that yet I pretty much suspect that it will be by the previous argument but it will become clearer when I do this okay so when I evaluate this so this is inner product now I have D dimension so I have D of these guys corresponding to each other and then multiply them raised to the power Q how much computation does it take you to do this and then I need to raise it to the power Q with our Q is 10 or 100 or a million it's the same complexity what I'm going to do I'm going to take the logarithm multiply it by Q and exponentiate it doesn't matter what this fellow is right this is a number I'm not expanding them I'm just plugging in this and this becomes the number raising it to the power 100 or raising it to the power 1000 okay so this is a very simple operation to carry out now think of what happens if you were actually taking D equals 10 and Q equals 100 and you can see that if I actually expanded this conceptually not computationally it's very convenient because every time an X appears the X-version of it appears when I multiply any combination that will always be the case I will get a tremendous number of terms which are all orders up to Q of different combinations of the X and I will have a huge expansion here and it shouldn't be a surprise that I will be able to decompose this into something of X dot something of X- because every term here appears with both X and X- the same trick we did here except more elaborately but if you actually go at it explicitly you have to give me the entire vector in the Z-space that results from a 100-order polynomial transformation of a 10th-order guy this will be an ugly beast to deal with okay just a number take your X and X get this number raise it to the power Q and I already as if I visited the Z-space and got that number there okay okay so if you're worried about the square root of L because obviously we will get this but you will get a bunch of combinations this gets here gets here and now you are power 100 so there will be all kinds of combinations so you will get a bunch of constants in front of the terms and you are going to you can adjust the scales a little bit not fully by taking your kernel instead of being 1+, you have scales A and B that will mitigate a little bit the diversity of the coefficients you get here but the bottom line is that a kernel of this does correspond to an inner product in a higher space and by computing it just in the X-space using this formula I am doing now with this in mind so I did this by construction because it's easy polynomial and we can visualize it we can get it in the two-case and extrapolate mentally for the other case now we realize in this case that we only need Z to exist in this case I showed you what Z is explicitly in the case of D equals 2 and by sort of hand waving in the case in the bigger case but you can visualize what Z is so this is the case so we take this to be an inner product in some space Z and once you do that we are good with the entire machinery and the guarantees and we will get the support vectors and we will get generalization bound all of the above and here is an example of a kernel this would be a useful kernel let's look at it it's an X- as much as I would require the minimum requirement for this to be true and now it doesn't even have an inner product term clearly either in X-space or Z-space and I have no idea what that is I can compute it and my question is does this actually correspond to some Z-space an inner product in Z-space so is this equivalent to taking each of them by itself transforming it into a Z in that space and taking a straight inner product with the same number by visiting some space so the answer is yes and the interesting thing is that that space is infinite dimensional so by doing this operation which is not very difficult to compute you have done an inner product in an infinite dimensional space congratulations and you will get the full benefit of a horrific non-linear transformation and you in the third lecture when I introduced linear models I told you go to an infinite dimensional space you will probably be screaming at me because the generalization has become completely ridiculous but here we don't worry we'll carry the machinery and then we'll count the number of support vectors if I have a thousand examples and you only have ten support vectors I know I'm in good shape well if I get 500 support vectors so I'm trying to convince you that this indeed is the case so what I'm going to do I'm going to take a simple case that I can illustrate so let's take the kernel but apply it in this case to one dimensional space so x and x are both scalers so I call them x and x and I'm going to take gamma to be one which is I do the following I expand this so I get x squared x dash squared and minus twice x x dash and minus twice gets the minus and becomes a plus so I get e to the minus x squared e to the minus x dash squared e to the 2 x x dash so that's legitimate expansion of this so now I take this and expand it using Taylor and I get this fellow this is whatever the argument is raise it to the power k divided by k factorial sum up from k equals 0 to infinity that's the Taylor series for the e now I conveniently took out the x dash to the k x to the k and 2 to the k separately this is just to put it in that form and I get that that's very nice you seem to be complicating matters rather than simplifying it remember my purpose is to convince you that there is a z space in which this is an inner product last line and miraculously some terms will turn blue keep your eyes on it oh you see where I am going with this the guys that go with x have turned blue the guys that will go with x dash have turned red and why am I doing that because I am going to separate this into an inner product something coming from x and something coming from x dash I am sure that it's the same once it is the same then the dimensionality is really this summation each of this is a coordinate and this is the contribution of the inner product to the inner product by this coordinate so here I am getting this x dash multiplied by x both of them normalized by e to the minus x dash so if I want to see what is the transformation of the first guy it would be e to the minus x squared multiplied by k and as k goes from 0 to infinity I get different coordinates I would be ready to go except for the annoying constants so let's put them in purple okay what do you do with them you divide them between red and blue take the square root of that and put it in the red and take the other square root and put it in the blue and now we have formally two identical vectors one is the transformed version of x and one of the these because you are summing for 0 until infinity okay so now this is a very interesting kernel it will actually be it's called radial basis function kernel if that rings a bell indeed that's the subject of the next lecture okay so let us look at this kernel in action and it's very interesting because it's a very interesting so let's look at it in action okay I'm going to take a slightly non-separable case in the x-space I mean after all I'm taking this glorious non-linear transformation I'd better have something which is not linearly separate but in order to show you the goods but I'm taking it slightly in order to make a point okay at random and I get them here and if you look at the hundred points there is really there is no line to separate them okay okay so now what I'm going to do I'm going to lighten the target function because the target function did its job generated the examples but I'm going to leave it in order to compare it with the final surface that we get so I'm just going to have it as a light surface you can even see it probably it's in our green surface okay so this is the dataset that I'm working with okay so I'm going to transform x into an infinite dimensional space okay someone else worries about that all I'm doing in my mind I am effectively doing that by just computing the kernel instead of just the simple inner product in the x space okay and the kernel is the kernel I got from the last slide which happens to be a simple exponential I compute it and get that okay so what happens when you do that okay so you get the kernel okay so you get the support vectors let me magnify it okay so you have the two classes and I darkened the points that ended up being support vectors okay so these one two three four blue guys and in the red I have one two three four five I have nine support vectors all together okay now it's nine support vectors how many points 100 points can you tell me what is the out-of-sample error as you know can you bound it above oh it looks like it should be less than 10% okay I have gone to an infinite dimensional space you are witness to that right okay I used what is effectively an infinite number of parameters okay completely suicidal in terms of generalization okay but hey three okay so now let's look at the surface in the Z space when I transform it back here okay so again the Z space is a mysterious guy okay it's a hyper plane of the infinity minus one okay that's very nice okay and now I am trying to transform this to this space and look at it and it looks like this okay how did I get that I didn't go to the Z space what I did when I transform from minus one to plus one that's my only tool okay but I can do it because the kernel is easy to compute okay if I went to this space you would have never heard from me again okay so this is a good way of doing it okay so you look at it and it's really very pretty okay so first okay so you don't get the green thing exactly but you can see okay that's pretty good okay the other thing is that when you think of the notion of a distance remember that okay so this is linearly separable in that space okay had better be if you don't get linear separability there you are really in trouble okay and when I get the linear separability there I get a margin I try to maximize the margin that has already been maximized by the machinery so I get a respectable margin and I and now when I look at the distance the value of the margin the value of the margin is in this space I cannot see that okay but here you can see that if you look by the distance these two support vectors are awfully close to the surface this support vector is not that close well maybe it will become close when this goes extend but it's definitely further away in the space where you get support vectors far away and here it's like a strange thing don't sweat bullets over it okay it's happening in a space that we don't understand as long as the machinery for the solution is correct and I get the support vectors that happen to have lambda greater than zero I am in business okay so let's shrink this back okay so we have a nice tool to have and we ask ourselves was this an overkill to go to an infinite dimensional space yes early on before we started this thing we say that it's a complete overkill even for these two dimensions slightly if you went to a fifth order polynomial I would already be worried that you are really doing more much now you went to an infinite one and that will tell you the generalization okay okay so now let's we are completely solved on the idea of the kernels okay so now let's look at if I give you a kernel and it's a valid kernel that corresponds to an inner product in some z space how do you formulate the problem okay this is just formality you already know but just a matrix that is the big Q matrix that you pass onto the algorithm and you compute it in terms of inner products and these were genuine inner products when you were working with a linearly separable data in the X space okay so now the only thing you are going to do is that you are instead of passing this to the quadratic programming you are going to pass this instead that's it absolutely nothing else if you look at the rest of the details nothing is affected by that transformation other than this quadratic programming matrix okay okay that's good now quadratic programming passes you the alphas you need the hypothesis so how do I construct the hypothesis in terms of the kernel okay so this is G of X equal that I am writing it because it's safe to write there is a z space I know now I just want to translate it in terms of the kernel okay I know that I can because we've spent a lot of time realizing that we don't need anything from the z space other than the inner product and the inner product is the kernel I just want to put the explicit form here okay so you want to put this in terms of kernel of something and you take W to be this okay and you are not going to this fellow so you took this substitute there you took the inner product the inner product it was kernel is you put the kernel in place of it and you get that okay now this is very interesting because this is your model so to speak support vector machines is a plural support vector machines doesn't dictate a particular model you choose a kernel and it will give you a different model so if you use the kernel the kernel you choose appears here it gets summed up with coefficients the coefficients happened to be determined by alpha they all happened to agree in sign with the label that's one of the artifacts of that because alphas are non-negative and we have plus B okay and again plus B is the one that we haven't solved for but I can solve for it using the other one and I end up with this you plug it in and you have that okay so you have the full definition of your hypothesis and you get the solution in this one okay and this is for any support vector which is defined by alpha m greater than zero okay now let me make a point okay the non-linear transformation that started support vector machine is this guy so in reality I have an unit that depends fully on X so I end up with one X X squared X cubed X to the 4 if it's Y and I could if you know if I'm working from an X that is more than one dimensional I get X1 X2 squared X1 X3 whole thing okay I just avoided the labor by using the is that this is my transformation so to speak the case I have however many of them as there are terms here and each of them has a coefficient it should be a legitimate way of looking at it okay the only thing to remember and it's very important to remember is that this transformation depends on your data set you see this Xn okay this one is the data set I decided that I'm going to use the RBF exponential so I get 1X X2 X3 X4 all of this is determined without looking at the data set this transformation in order to get this thing I need to know what Xn is okay but we have seen this before remember the hidden layer in neural networks it got I mentioned a space I'm only determining this this is the solution after all of the manipulation has been done and that is why it has this form but then it will allow us to compare support vector machines to other approaches for example if I put the RBF kernel here the one with E to the minus X2 with the norm okay I will get a functional form it is completely legitimate to say okay let me look at functional forms of that form ever hearing of support vectors it's just a model let me see if I can get a solution and it's very interesting to go through this exercise and to compare the result of doing it this way versus doing it the SVM route you can also do that for a neural network and other kernels that you have okay now the question is I am completely ready here if you give me the kernel everything is understood I can solve it and I can interpret the solution and judge the quality of the solution and all of that the only problem I have is that we don't know that the kernel is valid if I provide I tell you K of X and X dash is and just give you a formula okay so the whole idea of the kernel is that you don't visit the Z space so how are you going to verify that this is a valid kernel namely an inner product in some space without visiting that space that's the question okay you will come up with your own kernels that we have so it's a good idea to just ask yourself you know what are the you know what are conditions to get the kernel right so in order to get to be valid kernel there are three approaches okay first approach we have already seen this is by construction conceptual construction if not explicit construction like we did with the polynomial we looked at it and we realized that okay I realize that there will be corresponding terms and I will be able to separate them so in my mind that is the Z space and without constructing it explicitly I realize that the kernel that I wrote will correspond to an inner product in that space okay that is a very effective approach and the polynomial transformations are the most famous ones there okay the other one is the one we are going to talk about in the next slide which is using math properties of the kernel okay something called Mercer's condition okay I wish it was a practical condition okay it's a very appealing condition theoretically you will find it a little bit difficult to apply in given situations the good news is that people have applied it to a bunch of kernels and have declared them legitimate so you can pick from that catalogue without worrying about it that these have already been established it comes into play when you want to test a new kernel not an easy endeavor not an impossible endeavor but not an easy endeavor okay the third approach is the one I find rather interesting okay so how do you know that Z exists who cares that's approach this is approach followed by people say okay this looks like a great machine you have you give me the kernel I do this I do that okay so I'll just improvise the kernel and who cares if there is a Z-space or not I never visit it anyway okay wait a minute you don't visit it but it has to exist for all the guarantees that I talked about and you get support vectors and alpha greater than zero and the generalization all of that depends on the fact that Z-space is there and you are actually separating the data there believe it or not there is quite a number of people who just improvise a kernel apply the machinery and see what happens okay and sometimes they succeed okay I have my reservations let me put it this way okay so let's go for the mathematical route if you actually care okay rather than who cares okay so if you design what happens okay so here is the condition the following statement holds the kernel that you wrote down is a valid kernel that is the Z-space that you are talking about actually exist if and only if okay two conditions in conjunction are satisfied one is the fact that the kernel is symmetric that should be abundantly obvious symmetric being k of x and x dash being equal to k of x dash and x well that should be the dot product in the Z-space right so we are going to transform x and x dash into z and z dash well in the Z-space certainly z dot z dash is the same as z dash dot z okay inner product is commutative okay so if this has a chance it had better BC metric so this is definitely one of the condition the other one is that there is a matrix that we are going to require a property on and that matrix looks like this similar to the one that we are going to apply okay so what you do you just list the value of your kernels on all the pairs coming from your dataset okay so if this was a genuine inner product and you had it explicitly each of these will be the inner product this one between Z1 transpose Z1 this would be Z2 transpose Z2 et cetera and you will get that okay so the condition here on that matrix without visiting the Z is that when you put these numbers to your pleasant surprise this needs to be positive semi definite that is in matrix this matrix should be greater than equal to zero that's positive semi definite it really means conceptually okay okay so and this now you can see the difficulty if I want to satisfy that this is true for any you know I choose the point so obviously I have to have some math helping me to corner that this has to be positive semi definite for some reason okay but this is indeed the condition and if you look at the case where you know the transformation into the Z and you put this as an outer product between a bunch of Z and a bunch of Z what you are standing the same vector standing here and you are guaranteed to get a number greater than equal to zero for any vector that's what's positive semi definite means okay if you put that and the matrix happens to be the outer product of these guys then the guy sleeping here gets multiplied by Z and the other guy is the outer realization but that is indeed the condition and if you manage to establish this for any kernel then you establish that the Z space exists even if you don't know what the Z space is okay done with kernels that's have the deal and now we are going to the case where the data we have seen this before and we are going to this actually turns out to be the subject of this lecture if you will so there are if the data is not separable that could be slightly not separable like this where these guys are just you can take them as here and here these are outliers okay I really don't want to go to a high dimension and a linear space in order to just do and even with counting the support vectors by the time I do this and come back I will have touched on so many points that the chances are the number of support vectors will be huge so in this case if there is a method like the pocket where I okay I will just make errors on those except an E N which is non zero but since the generalization is good then there is a seriously non separable case as in you get this okay it's not a question of outliers it's just you know the surface is there and you have to go to on a linear transformation kernels deal with this okay soft margin support vector machines deal with this okay and in all reality when it will have a built-in non-linearity and still even modules that non-linearity some annoying guys are there just to test your learning ability okay and therefore you will be combining the kernel with the soft margin support vector machines in almost all the problems that you encounter okay so now let's focus on this I am now back to the X space okay the data is not linearly separable and I want to algorithm notwithstanding that and after I do that I am not going to even go through the router and by the way you can transform X into Z and by the way you can instead of going to Z you do that yourself I will just do the basic case and you know how to extrapolate to both the Z and to the kernel case okay so here is the idea of an error measure as we have before I am going to consider the margin violations all the support vector machines in a linearly separable case you maximize the margin and these will be the ones that achieve the margin and these guys will be interior points okay and now we are going to consider errors there are many ways to consider errors I can consider the number of points I misclassify okay we realize that it's not a good idea to deal with the number of points that are misclassified because optimization becomes completely interactable in this case it's a combinatorial and we talked about perceptron and pocket and we said that the problem of optimizing getting the absolute optimum in this case is generally NP-hard okay so we are going to have a numerical value and because the margin means something to me now it's not a question of being on the right side of the line it's a question of how far you are from the line that turned out to be an important notion in support vector machines I am going to define my error that used to be here has violated the margin okay now I'm not saying that once you put here the same solution will hold or whatever I'm just illustrating to you what is a violation of the margin and how do I quantify so this is just an illustration so this point went in okay in spite of the fact that it's correctly classified yes because this is the line and it's on the blue side of the line so to speak so there is no change in terms of the label if I'm working and the amount of violation will be decided by this displacement okay so here is what I'm going to do this would be the case if the margin is satisfied for every point that is the canonical form we put and when this fails the margin is violated and I'd like to quantify that so the way I'm going to quantify it I'm going to introduce a slack for every point potentially every point hopefully most of them will satisfy the margin and most of them will violate it and I'm going to say that the quantity that used to be greater than equal to one is actually greater than equal to one minus a slack okay so this is what I love so the the movement from here to here resulted in the red psi okay and the slack is greater than equal to zero so I'm only considering violations the total violation you made what is the total violation I'm just going to add up these violations okay we have seen error measures before we know that it's largely hand-waving because I have something in mind either I'm thinking of an optimizer and I want to hand something friendly to it or I'm thinking of something that is analytically plausible okay this is no different why did I choose this instead of square why did I choose this instead of all of these are considerations that will come up this does seem like violating the margin this does seem like measuring the violation of the margin so in the absence of further evidence one way or the other this is a good error measure to have and then when I plug this error measure in what we had things will collapse completely back to where we solved it already so this is the big advantage here okay so that is going to be my error measure so now the new optimization I'm going to do is the following it used to be that I'm minimizing this because that was what we did the last lecture okay and now I'm going to add an error term that corresponds to the violation of the margin and it's going to be this so this is the quantity that I promised you capture the violation of the margin and this is a constant that gives me the relative importance of this term versus this term this is no different from our notion an augmented error we used to have the in-sample performance which I guess would be the violation of the margin here if you're violating too much you'll start making errors plus lambda times a regularization term this looks pretty much like a regularization term like with decay okay so this C is actually one over the other lambda okay but this is a standard formulation in SVM for a good reason C will appear in a very nice way in the solution okay so this is an augmented error if C close to infinity then what am I saying you'd better not violate the margins because the slightest violation you mess up what you are minimizing so the end result is that you are going to to pick size all of them close to zero and then that had better be linearly separable and that's what you are solving for so you go back to the hard margin if C is very very small then you could be violating the margin right and left okay but you are violating it very frequently okay and there is a compromise here okay but that's what you are minimizing subject to this is what I had before okay and now the condition adds XI to it so I'm requiring this is to be the case and I said that XI are non-negative I'm only penalizing the violating of the margin I'm not rewarding the anti-violation of the margin if there is and one of the points there good for it okay I'm not going to give it credit that allows me to violate the other guys because that's not going to help me okay so I'm XI is non-negative and I get this condition for all points all capital N of them okay and finally I have the range in which I'm optimizing which used to be this and now I have the size out the red and you have the problem you already solved the hard margin SVM in the linearly separable case so this is the added guy so now what we are going to do we are actually going to go through the Lagrangian again okay because the Lagrangian is not that much different from before so you can take it as a review and the good thing is that if you thought that we have L of W B and alpha and some missing guys that will be filled and we have this and minus that so you can see that it's spread out because obviously I'm going to put the new stuff in what I put here is exactly the Lagrangian you worked with before this was your target this was the zero form of the inequality that is mean this minus one is greater than negative minus because it's in the form of greater than or equal to and this is your Lagrangian that we solve and we ended up with the quadratic programming problem we had so now there is a new guy which is x that's a new variable that I am determining okay how does it appear in the Lagrangian well okay now the constraint that I had used to be this is greater than or equal to one so I had it as minus one for the zero form okay the new constraint is this is greater than equal to one minus xi right so I need to put the new constraint and when you put it's minus minus you get the plus this okay not scary at all xi n is really the constraint on xi in the zero form let me xi is greater than equal to zero so if I wanted to put it in the zero form then I put xi that has to be greater than equal to zero pretty much like this fellow had to be greater than equal to zero I need to multiply it via Lagrange multiplier so it's absolutely nothing different here and now I add the new Lagrange multiplier to and I get this fellow okay now I'm proud of this because of a reason okay the slides are widescreen for this course right okay I had to have an equation that takes the full width of that okay and finally in lecture number 15 I managed to do that okay now you say forget it please bear with me because terms will be dropping like flies okay so just follow this and see where we arrive okay so we are going to maximize this with respect to W and B which we used to do and with respect to xi which is our new guys minimize and then we are going to maximize with respect to the Lagrange multiplier the alphas which we used to have and the betas are the new guys okay so let's do the first guy with respect to W which we did before okay can you differentiate this with respect to W okay I will get a W here this red guy doesn't contribute here I will get what I used to get this guy doesn't interfere this guy doesn't play a role that's encouraging I am actually getting what I got before okay let's do partial by partial B okay does this guy play any role does this guy play any role this guy gets multiplied by here does this guy B doesn't appear here I get exactly what I got before okay so the final guy is to get the partial by partial xi that's the new guy so you do this let's see what happens I will do it one at a time so there is a capital N of those I didn't put it as a gradient just to make it simple and I see xi N gets multiplied by C xi N gets multiplied by alpha with a negative sign xi gets multiplied by a beta with a negative sign so if I differentiate with respect to xi this is what I am going to get C minus alpha minus beta equate that to zero okay now isn't that grand because now you are saying that this quantity is always zero for any small N from one to capital N okay let's look at the ramifications as far as the Lagrangian when substituting the Lagrangian here I have a C here I have a minus alpha here I have a minus beta these are multiplied by xi well that combination happens to be zero so conveniently this guy and this guy and this guy together with beta are dropping out we are back to exactly the same Lagrangian we had before with exactly the same solution we had before okay and what happened to beta well beta did its service and we thank it for its great service and we bid farewell okay it's gone the only ramification of beta that we have is that because beta is greater than or equal to zero and we have this condition alpha is not only greater than or equal to zero which is what it used to be it also cannot be bigger than C because if it's bigger than C this quantity becomes negative and all of a sudden I cannot find the legitimate beta to make this true so the only thing out of all of this adventure is that we are going to require that alpha B at most C so everything before plus this added condition that's the whole thing okay so you get the solution you get this that's what we saw before with respect to alpha and beta doesn't appear and you have alphas being non-negative with the added red condition that's the only thing which is added less than or equal to C the inequality constraint is there the equality constraint that is inherited from the condition from the previous slide same as we did before okay and when you get the solution W will be this okay and W will guarantee that you are minimizing this plus the new objective okay so if you have already wrote your routine in order to apply support vector machines all you need to do now is go to that routine and instead of zero less than or equal to alpha less than or equal to infinity make it zero less than or equal to alpha less than or equal to C and you have the soft margin support vector machines okay that is good bargain so let's look at just very quickly types of support vectors okay this is the picture and this is the pictures where you have support vectors in the hard margin case there are only two types of points here interior where the margin is greater than one strictly and boundary guys that happen to be support vectors where the margin is exactly one now it is the quantity that corresponds to the margin is exactly one okay and that is all I have now with when you have the soft version we are going to label these guys margin support vectors just because there will be other guys that violate the margin okay and there will be support vectors they will get large multipliers that are greater than one okay and in this case the margin support vectors that used to be just alpha greater than zero they also happen to be strictly less than C okay you can look at it independently because in order to understand but let me just give you the hint here okay when alpha hits C beta the lost multiplier hits zero we know when the Lagrange multiplier hits zero the corresponding slack has to become positive that was one of the conditions we had okay and therefore for because here the slack is zero you actually don't have psi, psi is zero you have be clear of C that's the reason for it okay so this is the condition for those guys and psi is zero and these are the guys you used to solve in order to get you know the B and you know these are as clean as they used to be now we add the non-margin support vectors okay and by those we mean that now alpha N equal C so it's positive they are support vectors alpha N is greater than zero but they happen to hit C and now I have a slack the slack psi starts becoming positive okay and therefore the margin is violated I am less than one okay so that's one minus psi and psi is positive indeed psi is positive in this case okay so let's look at those non-margin support vectors and see what they look like okay so again just for illustration I am going to take these two points and start making them violate the margin not that the new solution will be exactly the same except that these guys are inside you have to resolve it with C et cetera you violated the margin but you are still classifying them correctly these are non-margin support vectors one type you can violate it further and cross so now these points are misclassified okay and they are still non-margin support vectors and now the E N is affected and you can go wild and you are just completely deep so all of them as long as you are violating you are a support vector okay not a clean support vector but a support vector okay now the value of C okay is a very important parameter here because it tells us how much violation we have versus the width of the of the yellow region and this is a quantity that will be decided in a practical problem using old fashioned cross validation okay this is a point a parameter that we need to determine and whenever we have one parameter to determine ultimately we can use cross validation so as you see validation and cross validation are on top of this here I am using a very elaborate algorithm which support vector machines yet I am resorting to cross validation in order to determine C okay I'll make two quick technical remarks and end the lecture here okay these are just practical points in case they bother you if you didn't see them they are not going to bother you with the hard margin I apply this machinery I get the dual pass it to quadratic programming so I'm asking myself if the data is not linearly separable what gives because think about it I never told you to check that the data is linearly separable I give you the data formulated the minimize this subject to that now if the data is not linearly separable subject to that will be impossible to satisfy there will be no feasible solution nonetheless this didn't prevent me from getting a dual and passing it to quadratic programming will give me back a solution so now I'm in a strange word so the key thing to realize is that the translation from the primal form minimizing W transpose W to the dual form maximizing with respect to alpha the Lagrangian that step is mathematically valid only if there is a feasible solution the KKT conditions are necessary so they have to no point in the domain then I'm now working pretty much like the guy who improvised the kernel that does not correspond to a Z space yeah you can plug it in and get a solution and no guarantees there okay in this case actually if you go to the Lagrangian the quadratic programming quadratic programming will try to compare to something infinity but you need not to worry about this case quadratic programming quadratic programming passes alphas back to you okay now it's impossible that all of a sudden the data became linearly separable right you don't have to worry you can always check if the solution separates the data you can evaluate the solution on every point compare it with the label okay and when you realize that it's not agreeing with the label you realize that something is wrong okay no just be lazy if you want and go through this and when it comes back check if it's linearly separable things are valid there is a feasible solution the dual solution is valid and quadratic programming worse if it's not then something went wrong chances are you won't get to that stage because quadratic programming will be complaining but quadratic programming will be complaining anyway as you may have experienced when you try it okay so it's not a perfect package okay so this is just a reminder that we will never be susceptible to a big mistake like getting a solution when none exists the last point is when we transform to the z-space you may have noticed that some of that transformation had a constant coordinate one right one in our mind used to correspond to W0 we made it a point at the beginning of discussing support vectors that there is no W0 we took it out and call it B the bias we treated it differently okay so now we are working with both W0 and B because if you have a constant okay you may not call it W0 but effectively it's W0 it's the guy that gets multiplied by the constant so what gives now I have two guys that plays the same role and you don't have to worry about that have the z-space have 20 constant coordinates when you get the solution all of the corresponding weights will go to 0 and all the bulk of the bias will go to the B okay how do I know that because you are charged for the size of W because it's part of W you are minimizing have W transpose W okay you are not charged for the size of B so obviously if you want to minimize and you can do it with both everything will go to the B and this guy will go to 0 so we need with that we'll stop here and we'll take questions after a short break okay so let's start the Q&A and we have an in-house question hi it seems intuitive to me that the number of support vectors goes linearly with the dimension of the space that you're looking at okay for example in an in-dimensional you clean in space you need you need in vectors to define a n minus 1 in-dimensional hyperplane and one other to define the thickness of the fat plane right okay it's it's not that easy because you could you could get I mean I could get two clusters that are far away of points okay and then two points that are plus one and minus one that are close to each other okay and in order to separate I have to be sandwiched here and then I am guided by these guys you know I have the orientation here and the orientation is decided by the two points that are that are around me so in spite of the fact that in a general case I will do that I could construct cases where it's not it's not linear in the dimension let's put it this way so you're saying it's less than linear it's better than linear it's better usually better and obviously if it was completely linear and I transform it to the in-dimensional space I obviously would be in trouble yes that's my so that was the question yes you should go positively with the dimension it's it's likely to increase with the increasing dimension okay and the exact form depends on the data set and depends on the position of the data set including the interior points I could have I could have I mean let me put this way if I give you even without considering the interior points this way I give you two points one plus one and one minus one okay there is an optimal separating plane and how many support vectors am I going to get I cannot get more than two because I only have two even if I go to 100 dimensional space so the linearity is an impression that requires further assumptions but in general it will not hold yes and the for example the RBF kernel it may in its form it looks like infinite dimensional but in reality I think it's it's effective dimension is very small because the high order terms are decay very fast I agree and decay with both an exponential term and a factorial term I completely agree and indeed that affects it because you are actually measuring a distance proper are you cleaning distance proper in that space so if a dimension is very small then it doesn't you know affect it very much so I mean it's whether it's a really infinite dimension or infinite dimension you know in in disguise in general when you have an infinite dimension space the only way to really define the inner product is to have a a decaying term so that the thing converges so this is essential when you want to compute so if it doesn't converge so you change the negative sign in the RBF kernel into positive signs then it won't be but I mean the inner product now is not is not well defined right so it won't be a valid kernel and you'll get horrible yeah I mean a valid because a valid kernel I have to I have to be able to evaluate the inner product and infinity would not I mean or lack of conversions would not would not allow that okay so people are curious how can you generalize SVMs to a a regression case for a oh there is there is a huge body of knowledge for generalizing it and I didn't touch on it for two reasons again pretty much like when I did the VC analysis it's you know more technical and I I get the basic concept without having to go through the technicality the other aspect is that the major success of support vector machines is really in classification okay they are not as successful competitively in regression that's the practical experience so I have found that it's not it's not worth the the amount of time to go into that so is it safe to assume then that in the infinite that if you do the transformation to an infinite dimensional space the data will be linearly separable there it is safe but not certain okay I can create situations that are you know opposed to that but again this is one of the reasons why I made the final remarks there because let's say that I took my data and applied RBF I really don't know whether they will be linearly separable in that space or not I just applied the machinery but I can always find what the solution back and see if their points are classified correctly yeah so a technical question on the quadratic programming so usually will the if the if the matrix you give is not positive or there will not be complaints that people are okay my experience with quadratic programming there are tons of packages there so I'm describing a subset of them necessarily the ones I tried okay they tend to complain okay and it's it's almost like you know when you use matlab and it tells you the condition please get me the inverse and it tells you the the condition number is bad and so on okay so I in most of the cases you know a certain reliability that it has to have in order not to complain so invariably when you use quadratic programming there will be a complaint one way or the other okay but I have learned not to be completely discouraged by that and tweak limit variables and whatnot but this is just completely a practical situation depending on the package I'm going back to previous questions so when you said safe but not certain so does that mean but in or with a with a real data set that is not completely ridiculous okay I have never seen it happen okay you can always do I mean you can have I mean in some sense you can have a especially with the radial biz function you can have one on top of each point and whatnot so you can you can separate whatever is there okay so I have not encountered it so another question is it possible how useful is it to you know you can do it as long as the combination is legitimate that it maintains that there is a Z space in which this is an inner product that's really the the requirement okay if you have that then I mean okay there are many variations of the methods the basic SVM method that are in the literature people tried several things and you know as long as there are many guarantees of SVM it can be done since we are only talk about inner products and they usually induce a norm are we always prefer the Euclidean norm or can can it still be changed and still use inner product the way I derived it based on Euclidean norm and straightforward inner product okay there are obviously variations of that that you can you can get and still it's not impossible but you just need to make sure that you the quadratic programming problem you are solving corresponds to the version of the norm that you used and the version of the inner product that you used what what would you say is the scale of the problems that can be solved by SVMs again the number of points the scale of problems that can be solved by quadratic programming is a more pointed question because that's the bottleneck okay okay so okay I mean it depends on I mean if you are using MATLAB versus you know something else the they sort of they get saturated at different stages okay I would say if you get to 10,000 points that's pretty formidable okay and if you are below a thousand you should be okay but you know some packages will still give you hard time there are packages specifically for SVM that use heuristics so they don't specifically pass on the thing to quadratic programming directly but try to break it get support vectors for each case and then get the union and so on hierarchy methods and other methods so they are basically heuristic methods for solving SVM when straight forward quadratic programming will fail and these are also available and should be used when you have too many data points I think that's it okay so we'll see you on Thursday