 المساعدة المساعدة is brought to you by Caltech Welcome back Last time we talked about kernel methods which is generalization of the basic SVM algorithm to accommodate feature spaces Z which are possibly infinite and which we don't have to explicitly know or transform our inputs to in order to be able to carry out the support vector machinery and the idea was to define a kernel that captures the inner product in that space and if you can compute that kernel the generalized inner product for the Z space this is the only operation you need in order to carry the algorithm and in order to interpret the solution after you get it and we took an example which is the RBF kernel suitable since we are going to talk about RBF's radial basis functions today and the kernel is very simple to compute in terms of X it's not that difficult however it corresponds to an infinite dimensional space Z space and therefore by doing this it's as if we transformed every point in this space which is two-dimensional into an infinite dimensional space carried out the SVM there and then interpreted the solution back here and this would be the separating surface that corresponds to a plane so to speak in that infinite dimensional space so with this we went into another way to generalize SVM not by having an unlinear transform in this case but by having an allowance for errors errors in this case would be violations of the margin the margin is the currency we use in SVM and we added a term to the objective function that allows us to violate the margin for different points according to the variable xi and we have a total violation which is this summation and then we have a degree to which we allow those violations if C is huge then we don't really allow the violations and if C goes to infinity we are back to the hard margin case and if C is very small then we are more tolerant and we would allow violations and in that case we might allow some violations here and there and then have a smaller W which means a bigger margin a bigger yellow region that is violated by those guys think of it as it gives us another degree of freedom in our design and it might be the case that in some datasets there are a couple of outliers where it doesn't make sense to shrink the margin just to accommodate them or by going to a higher dimensional space with a non-linear transformation to go around that point and therefore generate so many support vectors and therefore it might be a good idea to ignore them and ignoring them meaning that we are going to commit a violation of the margin could be an outright error could be just a violation of the margin where we are here but we haven't crossed the boundary so to speak and therefore this gives us another way of achieving the better generalization by allowing some in-sample error or margin error in this case at the benefit of getting better generalization prospects now the good news here is that in spite of this sort of significant modification of the statement of the problem the solution was identical to what we had before we are applying quadratic programming with the same objective the same equality constraint and almost the same inequality constraint the only difference is that it used to be alpha n could be as big as it wants now it is limited by capital C and when you pass this to quadratic programming you will get your solution now C being a parameter and it is not clear how to choose it there is a compromise that I just described the best way to pick C and it is the way used in practice is to use cross validation to choose it so you apply different values of C you run this and see what is the out-of-sample error estimate using your cross validation and then pick the C that minimizes that and that is the way you will choose the parameter capital C so that ends the basic part of SVM the hard margin the soft margin and the non-linear transforms together with the kernel version of them together they are a technique really that is superb for classification and it is by the choice of many people the model of choice when it comes to classification very small overhead there is a particular character that makes it better than just choosing a random separating plane and therefore it does reflect on the out-of-sample performance today's topic is a new model which is radial basis functions not so new because we had a version of it under SVM and we will be able to relate to it but it is an interesting model in its own right it captures a particular understanding of the input space that we will talk about and the most important aspect that the radial basis functions provide for us is the fact that they relate to so many facets of machine learning that we have already touched on and other aspects that we didn't touch on in better recognition that it's worthwhile to understand the model and see how it relates it almost serves as a glue between so many different topics in machine learning and this is one of the important aspects of studying the subject so the outline here it's not like I'm going to go through one item then the next according to this outline what I'm going to do I'm going to define the model define the algorithms and so on as I would describe any model in the course of doing that I will be able at different stages to relate RBF in the first case nearest neighbors which is a standard model in better recognition we will be able to relate it to neural networks which you have already studied to kernel methods obviously it should relate to the RBF kernel and it will and finally it will relate to regularization which is actually the origin in function approximation for the study of RBFs okay so let's first describe the basic radial basis function model the idea here is that every point in your data set will influence the value of the hypothesis at every point X well that's nothing new that's what happens when you are doing machine learning you learn from the data and you choose a hypothesis so obviously that hypothesis will be affected by the data but here it's affected in a particular way it's affected through the distance so a point in the data set will affect the nearby points more than it affects the far away points that is the key component that makes it a radial basis function okay so let's look at a picture here imagine that this is a the center of this bump happens to be the data point so this is Xn and this shows you the influence of Xn on the neighboring points in the space okay so it's you know it's most influential nearby and then the influence goes by and dies and the fact that this is symmetric around means that it's function only of the distance which is the condition we have here so let me give you concretely the standard form of a radial basis function model okay it starts from H of X being and here are the components that build it as promised it depends on the distance okay and it depends on the distance such that the closer you are to Xn the bigger the influence is as seen in this picture so if you take the norm of X minus Xn squared and you take minus you know gamma is a positive parameter fixed for the moment you will see that this exponential really reflects that picture the further you are away you go down and you go down as a Gaussian okay so this is the contribution to the point X at which we are evaluating the function according to the data point Xn from the data set okay now we get an influence from every point in the data set and those influences will have a parameter that reflects the value as we will see in a moment of the the target here so it will be affected by Yn that's the influence is having the value Yn propagate so I'm not going to put it as Yn here I'm just going to put it generically as a weight to be determined and we'll find that it's very much correlated to Yn and then we will sum up all of these influences from all the data points and you have now let me in terms of this slide describe why it is called radial basis function it's radial because of this and it's basis function because of this this is your building block you could use another basis function okay so you could have another shape that is also symmetric and center and has the influence is basically the model it's simplest form and it's most popular form most people will use a Gaussian like this and this will be the functional form for the hypothesis okay now we have the model the next question we normally ask is what is the learning algorithm so what is what is a learning algorithm in general you want to find the parameters and we call the parameters W1 up to Wn and they have this functional form so I put everything else is fixed okay and we would like to find the Wn that minimize some sort of error okay so we base that error on the training data obviously so what I'm going to do now I'm going to evaluate the hypothesis on the data points and try to make them match target value on those points try to match them Y so as I said Wn would be exactly Y but it will be right now there is an interesting point of notation because the points appear explicitly in the model Xn is the the end training input okay and now I'm going to evaluate this on a training point in order to evaluate the in-sample error okay so because of this there will be an interesting notation when we let's say ask ambitiously to have the in-sample error being zero I should expect that I will be able to do that why because really I have quite a number of parameters here don't die okay I mean I have capital N data points and I'm trying to learn capital N parameters okay notwithstanding the generalization ramifications of that statement it should be easy to get parameters that really knock down the in-sample error to zero okay so in doing that I'm going to apply this to every point XN and ask that the output of the hypothesis be equal to YN no error at all so indeed the in-sample error will be zero so let's substitute in the equation here and this is true for all N up to capital N and here is what you have okay first you realize that I changed the name of the dummy variable the index here I changed it the reason I did that because I'm going to evaluate this on XN and obviously you shouldn't have sort of recycling of the dummy variable as a genuine variable okay so in this case you want this quantity which will be in this case be the evaluation of H at the point XN you want this to be equal to YN that's the condition and you want this to be true for N equals one to capital N okay not that difficult to solve okay so these are the equations they ask ourselves how many equations and how many unknowns okay well I have capital N data points so I'm listing capital N of these equations so indeed I have capital N equations how many unknowns do I have well what are the unknowns the unknowns are the W's and I happen to have N unknowns okay that's familiar territory so all I need to do is just solve it so let's put it in matrix form which will make it easy so here is the matrix form with all the coefficients for N and M okay so you can see that you know this goes from 1 to N and in this thing the second guy goes from 1 to N so these are the coefficients you multiply this by a vector of W's okay so I'm putting all the N equations at once in matrix form and I'm asking this to be equal to the vector of Y's okay so let's call the matrix is something so this matrix I'm going to call Phi and I am recycling the notation Phi Phi used to be the non-linear transformation and this is indeed a non-linear transformation of sorts slight difference that we will discuss but we can call it Phi and then these guys will be called the standard name the vector W and the vector Y okay what is the solution to guarantee a solution is that Phi be invertible under these conditions the solution is very simply just W equals the inverse of Phi times Y okay in that case you interpret your solution as exact interpolation because what you are really doing is on the points that you know the value which are the training points you are getting the value exactly that's what you solve for okay and now the kernel which is the Gauss and in this case what it does is interpolate between the points okay to give you the value on the other axis okay and it's exact because you get it exactly right on those points okay now let's look at the effect of gamma there was a gamma a parameter that I considered fixed from the very beginning and this guy I'm highlighting it in red okay so when I give you a value of gamma you carry out the machinery that I just described okay but you suspect that gamma will affect the outcome and indeed it will so let's look at two situations let's say that gamma is small what happens when gamma is small what happens is that this Gaussian is wide okay going this way okay then I would be going this way okay now depending obviously on where the points are how sparse they are it makes a big difference whether you are interpolating with something that goes this way or something that goes this way and it is reflected in this picture let's say it is like this case and I have three points just for illustration so the total contribution of the three interpolation is exactly through the points because this is what I solved for that's what I insisted on okay but the the small gray ones here are the contribution according to each of them so this would be w1 w2 w3 if these are the points and when you add w1 times the Gaussian plus w2 times the Gaussian et cetera you get a curve that gives you exactly the y1 y2 and y3 now because of the width there is an interpolation here that is successful okay between two points you can see that there is a meaningful interpretation interpolation okay so if you go for a large gamma this is what you get okay so now the Gaussians are still there you may see them faintly okay but they die out very quickly and therefore in spite of the fact that you are still satisfying your equations because that's what you solved for the interpolation and the influence of these points dies out so in between you just get nothing okay so clearly gamma matters okay and you probably in your mind things that gamma matters also in relation to the distance between the points because that's what the interpolation is and we will discuss that choice of gamma towards the end after we settle all the other parameters we will go and visit gamma and see how we can choose it wisely okay so with this but that model if you if you look at it is a regression model I consider the output to be real valued okay and I match the real valued output to the target output which is also real valued often we will use RBFs for classification so when you look at h of x which used to be regression this way it gives you a real number okay now we are going to take as usual the sign of this quantity plus 1 interpret the output as a yes-no decision and we would like to ask ourselves how do we learn the W's under these conditions okay that shouldn't be a very alien situation to you because you have seen before linear regression used for classification that is pretty much what we are going to do here we are going to focus on the inner part which is the signal before we take the sign and we are going to try to make the signal itself match the plus minus 1 target like we did when we used linear regression for classification and after we are done since we are trying her to make it plus 1 or minus 1 then and if we are successful we get the exact solution then obviously the sign of it will be plus 1 or minus 1 if you are successful if we are not successful and there is an error as will happen in other cases then at least since you try to make it close to plus 1 with other one close to minus 1 you would think that the sign at least will agree with plus 1 or minus 1 okay so the signal here is what used to be the whole hypothesis value and what you are trying to do you are trying to minimize the mean square error between that signal and y knowing that y actually on the on the training set knowing that y you report the sign of that as your your value so we are ready to use the the solution we had before in case we are using RBFs for classification okay so now we come to the observation that the radial basis functions are related to other models and I'm going to start with a model that we didn't cover it's extremely simple to cover in 5 minutes okay these functions that is important so this is the nearest neighbor method so let's look at it the idea of nearest neighbor is that I give you a dataset and each dataset has a value yn could be a label if you are talking about classification could be a real value and what you do for for classifying other points or assigning values to other points is very simple x sub n in the training set that is closest to me in Euclidean distance and then you inherit the label or the value that that point has very simplistic so here is a case of a classification the dataset are the red pluses and the blue circles and what I'm doing is that I'm applying this rule of classifying this plane which is the script x the input space according to the label of the nearest point within the training set so as you can see if I take a point here this is the closest that's why this is pink and here it's still the closest once I'm here this guy becomes the closest and therefore it gets blue so you end up as a result of that as if you are breaking the plane into cells each of them has the label that happens to be in the cell and this tessellation of the plane into these cells describes the boundary for your decisions okay this is the nearest neighbor method now if you want to implement this using radial basis functions there is a way to implement it it's not exactly this but it has a similar effect where you basically are trying to take an influence of a nearby point okay and that is the only thing you take the basis function in this case to look like this instead of a Gaussian it's a cylinder so it's still symmetric it depends on the radius but the dependence is very simple I'm constant and then I go to zero so it's very abrupt so in that case I'm not exactly getting this but what I'm getting is a cylinder around every one of those guys that inherits the value of that point and obviously this is a question and that is what makes a difference from here okay so in both of those cases it's fairly brittle okay you go from here to here you immediately change values and if there are points in between you keep changing from blue to red to blue and so on in this case it's even more brittle and so on so in order to make it less abrupt the nearest neighbor is modified to becoming k nearest neighbor you look let's say for the three closest points or the five closest points or the seven closest points and then take a vote okay if most of them are plus one you consider yourself plus one that helps even things out a little bit so you know an isolated guy in the middle that doesn't belong gets filtered out by this so this is the standard way of smoothing so to speak the surface here it will still be very abrupt going from one point to another but at least it will go down the way you smoothen the radial basis function is instead of using a cylinder you use a Gaussian so now it's not like I have an influence I have an influence I have an influence I don't have any influence no you have an influence you have less influence you have even less influence and eventually you have effectively no influence because the Gaussian went to zero and in both of those cases you can consider the model whether it's nearest neighbor or clean nearest neighbor a radial basis a similarity-based method you are classifying points according to how similar they are to points in the training set and the particular form of applying the similarity is what defines the algorithm whether it's this way or that way whether it's abrupt or smooth and whatnot okay now let's look at the model we had which is the exact interpolation model and modify it a little bit to deal with the problem that you probably already noticed which is the following in the model we have capital N parameters W should be W1 up to WN okay okay and it is based on capital N data points I have N parameters I have N data points okay we have alarm bells that calls for a red color okay because right now you usually have the generalization in your mind related to the ratio between data points and parameters parameters being more or less a VZ dimension and therefore in this case it's pretty hopeless to generalize it's not as hopeless as in other cases because the Gaussian is a pretty friendly guide nonetheless you might consider the idea that okay I'm going to use the radial basis function so I'm going to have an influence you know symmetric and all of that its own influence what I'm going to go I'm going to elect a number of important centers for the data have these as my centers and have them influence the neighborhood around them okay so what you do you take capital K which is the number of centers in this case and hopefully it's much smaller than N so that the generalization worry is mitigated and you define the centers these are vectors Mu1 up to Mu sub K as the centers of the radial basis functions instead of having X1 up to Xn the data points themselves being the center okay now those guys live in the same space these guys let's say in a D dimensional Euclidean space these are exactly in the same space except that they are not data points they are not necessarily data points they may be we may have done elected some of them as being important points or we may have generically there will be Mu1 up to Mu K okay and in that case the functional form of the radial basis function changes form and it becomes this so let's look at it used to be that we are counting from 1 to capital N now from 1 to capital K and we have W so indeed we have fewer parameters okay and now we are comparing the X that we are evaluating at not with every point but with every center okay and according to the distance from that center the influence of that particular center which is captured by WK is contributed and you take the contribution of all the centers and you get the value exactly the same thing we did before except with this modification that we are using centers instead of points okay so the parameters here now are interesting because I have WKs are parameters and I'm supposedly going through this entire exercise because I didn't like having capital N what we did Mew Ks now are parameters right I don't know what they are and I have capital K of them that's not a worry because I already said that K is much smaller than N but each of them is a D-Dimensional Vector isn't it okay so that's a lot of parameters so if I have to estimate those etc. I haven't done a lot of progress in this exercise but it turns out I have to estimate those without touching the outputs of the training set so without contaminating the data that's the key okay so two questions how do I choose the centers which is an interesting question because I have to choose it now if I want to maintain that the number of parameters here is small I have to choose it without really consulting the YNs the values of is how to choose the weights okay choosing the weights shouldn't be that different from what we did before it would be a minor modification because it has the same function for this one is the interesting part or at least the novel part okay so let's talk about choosing the centers what we are going to do we are going to choose the centers as representative of the radial basis function for each of them okay and therefore what I'm going to do I'm going to have a representative would be nice for every group of points that are nearby to have a center near to them so that it captures this cluster this is the idea okay so you are now going to take XN okay and take a center which is the closest to it and assign that point to it so here is the idea I have the points spread around I'm going to select centers okay not clear how do I choose the centers but once you choose them I'm going to consider the neighborhood of the center within the data set the XNs as being the cluster that has that center okay if I do that then those points are represented by that center and therefore I can say that their influence will be propagated through the entire space by the radial business function that is centered around this one okay okay so let's do this it's called K-means clustering because the center for the points will end up being the mean of the points as we'll see in a moment and here is the formalization you split the data points X1 up to XN into groups clusters so to speak hopefully points that are close to each other and you call this S1 up to SK so each cluster will have a center that goes with it and what you minimize in order to make this a good clustering and these are good representative centers is to try to make the points close to their centers okay so you take this for every point you have but you sum up over the points in the cluster so you take the points in the cluster whose center is this guy and you try to minimize the mean square error there mean square error in terms of Euclidean distance okay so this takes care of one cluster S sub small K okay you want this to be small overall the data so what you do is you sum this up over all the clusters so that becomes your objective function in clustering okay so someone gives you capital K that is the choice of the actual number of clusters is a different issue but let's say capital K is 9 I give you 9 clusters okay then I'm asking you to find the MUSE and the break up of the points into the SKS such that this value assumes its minimum if you succeed in that then I can claim that this is good clustering and these are good representatives of the cluster okay so now I have and some bad news okay the good news is that finally we have unsupervised learning I did this without any reference to the label YN I'm taking the inputs and producing some organization of them as we discussed the main goal of unsupervised learning is okay so we are happy about that okay now the bad news the bad news is that the problem as I stated it is NP-hard in general okay it's a nice unsupervised problem but not so nice it's intractable if you want to get the absolute minimum okay so our goal now is to go around it that sort of problem being NP-hard never discouraged us discouraged us remember with neural networks we said that the absolute minimum of that error in the general case finding it would be NP-hard and we ended up we say some heuristic which was gradient descent in this case that led to bad propagation we'll start with a random configuration and then descend and we'll get not to the global minimum which is the finding of which is NP-hard but a local minimum hopefully a decent local minimum we'll do exactly the same thing here okay so here is the iterative algorithm for solving this problem the K-means and it's called Lloyd's algorithm it is extremely simple to the level where the contrast between this algorithm not only in the definition of it by how quickly it converges and the fact that finding the global minimum is NP-hard is rather mind-boggling okay so here is the algorithm what you do is you iteratively minimize this quantity okay so you start with a some configuration and get a better configuration and as you see I have now two guys in purple which are my parameters here mews are parameters by definition these I am trying to find what they are but also the sets the clusters are parameters I want to know which guys go into them these are the two things that I am determining so the way this algorithm does is that it fixes one of them and tries to minimize the other so it tells you for this particular membership of the clusters could you find the optimal centers okay now that you find the optimal centers forget about the clustering that resulted in that these are centers could you find the best clustering for those centers and keep repeating back and forth so let's look at the steps you are minimizing this with the next set of posts so you take one at a time okay so now you update the value of mu okay how do you do that you take the fixed clustering that you have so you have already a clustering that is inherited from the last iteration okay what you do you take the mean of that cluster you take the points that belong to that cluster you add them up and divide by their number okay now in our mind it is very good in minimizing the mean squared error because the squared error to the mean is the smallest of the squared errors to any point that happens to be the closest to the points collectively in terms of mean squared value okay so if I do that I know that this is a good representatives if this were the real cluster okay so that's the first step so now I have new mu case forget about the clustering you had before now you are creating new clusters and the idea is the following you take every point and you measure the distance between it and mu k the newly acquired mu k and you ask yourself is this the closest of the mu that I have so you compare this with all the other guys okay and if it happens to be smaller then you declare that this Xn belongs to SK okay you do this for all the points and you create a full clustering now if you look at this step we argued that this reduces the error it has to because you pick the mean for every one of them and that will definitely definitely not increase the error okay this will also decrease the error because the worst that it can do is take a point from one cluster and put it in another but in doing that so the term that used to be here is now smaller because it went to the closer guy so this one reduces the value this one reduces the value you keep back and forth and the quantity is going down are we ever going to converge yes we have to because by structure we are only dealing with a finite number of points and there are a finite number of possible values for the mu given the algorithm because they have to be the average of points from those so I have 100 points there will be a finite but tremendously big number of possible values but it's finite all I care about it's finite number and as long as it's finite and I'm going down I will definitely hit a minimum it will not be the case that it's a continuous thing and I'm doing half and then half again and half of half and never arrive here you will arrive perfectly at a point okay the catch is that you are converging to good old fashioned local minimum okay depending on your initial configuration you will end up with one local minimum or another okay but again exactly the same situation as we had with neural networks we did converge to a local minimum with back propagation right and that minimum depended on the initial weights here it will depend on the initial centers or the initial clustering whichever way you want to begin what you do is try different starting points and you get different solutions and you can evaluate which one is better because you can definitely evaluate this objective function for all of them and pick one out of a number of runs that usually works very nicely it's not going to give you the global one but it's going to give you a very decent clustering and very decent representative news okay so now let's look at Lloyd's algorithm in action and I'm going to take the problems that RBF-Kernel this is the one we are going to carry through because we can relate to it now and let's see how the algorithm works okay so the first step in the algorithm give me the data points okay thank you here are the data points if you remember this was the target the target was slightly non-linear okay we had minus one and plus one and we have them with this color and that is the data we have first thing I only want the inputs I don't see the labels and I don't see the target function you probably don't see the target function anyway it's so faint but I really you don't see it at all okay so I'm going now to take away the target function and the labels I'm only going to keep the position of the inputs so this is what you get looks more formidable now right I have no idea what the function is but now you realize one interesting point I'm going to cluster those without any benefit of the label so I could have clusters that belong to one category plus one or minus one and I could as well have clusters that happen to be on the boundary half of them are plus one or half of them minus one that's the price you pay when you do unsupervised learning you are trying to get similarity but the similarity is as far as the inputs are concerned not as far as the behavior with the key okay so I have the points what do I do next you need to initialize the centers there are many ways of doing there are some you know a number of methods I'm going to keep it simple here and I'm going to initialize the centers at random so I'm just going to pick nine points and I'm picking nine for a good reason remember last lecture when we did the support vector machines we ended up with nine support vectors okay so here are my initial centers totally random looks like a terribly stupid thing to have three centers near to each other and have this entire area empty but let's hope that Lloyd's algorithm will place them a little bit more strategically okay now you iterate okay so now I would like you to stare at this okay I will even make it bigger okay stare at it because I'm going to do a full I'm going to do reclustering and reevaluation of the mu and then show you the new mu okay one step at a time okay so this is the first step keep your eyes on the on the screen okay they moved a little bit and I am pleased to find that those guys that used to be crowded are now serving different guys okay they are moving away second iteration okay these are actually I mean I have to to say these are they need at a certain rate in order not to completely bore you would be sort of you know clicking through the end of the lecture and then we'll have the clustering at the end of the lecture nothing else okay so next iteration look at the screen okay the movement is becoming smaller third iteration just a touch fourth nothing happened I actually flipped the slide okay nothing happened okay so we have converged and these are your muse okay and it does converge very quickly and you can see now the centers make sense okay these guys have a center these guys have a center this guy and so on these guys I guess it started here it got stuck here and just serving two points or something like that but more or less understanding the fact that there was no natural clustering for the points it's not like I generated these guys from nine centers okay these were generated uniformly so the clustering is incidental but nonetheless the clustering here makes sense okay now this is the clustering right surprise we have to go back to this and now you look at the clustering and see what happens this guy takes points from both plus one they look very similar to it because it only depended on X many of them are deep inside and indeed deal with points that are the same the reason I'm making an issue of this because the way the center will serve as a center of influence for affecting the value of the hypothesis it will get a WK and then it will propagate the WK according to the distance from itself okay so now the guys that happen to be the center of positive and negative points will cause me a problem but indeed that is the price you pay when you use unsupervised learning okay so this is those algorithm in action now I'm going to do something interesting we had nine points that are centers of unsupervised learning in order to be able to carry out the influence of radial business functions using the algorithm we will have that's number one last lecture we had only also nine guys they were support vectors they were representative of the data points and since the nine points were representative of the data points and the nine centers here are representative of the data points it might be illustrative to put them next to each other to understand what is common what is different where did they come from and so on so let's start with the RBF centers here they are and I put them on the the data that is labeled not that I got them from the labeled data I have the same picture right and left so these are where the centers are everybody sees them clearly now let me remind you of what the support vectors from last time looked like here are the support vectors very interesting indeed support vectors obviously are here all around here they had no interest whatsoever in representing clusters of points that was not their job here these guys have absolutely didn't even know that they were separating surface they just looked at the data and you basically get what you set out to do here you are representing the data inputs and you got a representation of the data inputs here you are trying to capture the separating surface that's what support vectors do they support the separating surface and this is what you got these guys are generic centers they are all black these guys are blue and some red because they are support vectors that come with a label because of the value yn so some of them are on this side and indeed they serve completely different purposes and it's rather remarkable that we get two solutions using the same kernel which is the RBF kernel using such an incredibly different difference between when you do the choice of important points in an unsupervised way and here patently in a supervised way choosing the support vectors was very much dependent on the value of the target the other thing you need to notice is that the support vectors have to be points from the data okay the views here are not okay you go here and one of them became a center one of them became a support vector okay on the other hand this point doesn't exist here it's just a center that happens to be anywhere in the play okay so now we have the centers okay so I will give you the data I tell you capital K equals 9 you go and you do your Lloyd's algorithm and you come up with the centers and have the centers are vectors of the dimension and now I found the centers without even touching the labels I didn't touch so I know that I didn't contaminate anything and indeed I have only the weights which happen to be capital K weights to determine using the labels and therefore I have good hopes for generalization okay so now I look at here I froze it it became the same question I want this to be true for all the data points if I can and I ask myself how many equations how many unknowns so I end up with N equations same thing I want this to be true for all the data points I have capital N data points so I have N equations how many unknowns the unknowns are the W's okay and I have capital K of them and whoops K is less than N I have more equations than unknowns okay so something has to give and this fellow is the one that has to give that's all I can hope for okay so I'm going to get it close in a mean squared sense as we have done before okay okay I don't think you'll be surprised by anything in this slide you have seen this before okay so let's do it this is the matrix phi now it's a new phi it has K columns and N rows so according to our criteria that K is smaller than N okay you multiply it by W which are capital K weights and you should get approximately Y can you solve this yes we have done this before in linear regression all you need is to make sure that phi transpose phi is invertible and under those conditions you have one step solution which is this you do inverse you take phi transpose phi minus one times phi transpose Y and that will give you the value of W that minimizes the mean squared difference between these guys okay so you have this you do inverse instead of the exact interpolation and in this case you are not guaranteed that you will get the correct value at every data point so you are going to be making an insample error but we know that this is not a bad thing on the other hand we are only determining capital K weights so the chances of generalization are good okay and this will help me relate it to neural networks okay so this is the second link so we already related RBF to nearest neighbor methods similarity methods now we are going to relate it to neural networks so let me first put the diagram okay so here is my illustration of it I have X I am computing the radial aspect the distance from in this case the Gaussian you can have other basis functions like we had the cylinder in one case but cylinder is a bit extreme but there are other functions you get features that are combined with weights in order to give you the output okay now this one could be just passing the sum if you are doing regression could be hard threshold if you are doing classification could be something else okay but what I care about the configuration looks familiar to us it's layer I extract features and then I go to output so let's look at the features the features are these fellows right now if you look at these features they depend on D Mu in general are parameters if I didn't have this slick algorithm and key means and unsupervised thing I need to determine what these guys are and once you determine them the value of the feature depends on the data set and when the value of the feature depends of the data set all bets are off it's no longer a linear model pretty much like a neural network doing the first layer extracting the features okay now the good thing is that because we used only the inputs in order to compute Mu it's almost linear because in this case that one didn't we didn't have to go back and adjust Mu because you don't like the value of the output these were frozen forever based on inputs and then we only had to get the W's and the W's now look like multiplicative factors in which case it's linear on those W's and we get the solution okay now in radial basis functions there is often a bias term added you don't only get those you get either a W0 or a B and it enters the final layer okay so you just add another weight that is this time multiply by 1 and everything remains the same the phi matrix has another column because of this and you just do the machinery you had before okay now let's compare it to neural networks so here is the RBF network we just saw it and I pointed X in red this is what gets passed to this gets the features and gets you the output and here is a neural network that is comparable in structure okay so you start with the input you start with the input now you compute features and here you do and the features here depend on the distance and they are such that when the distance is large the influence dies so if you look at this value okay and this value is huge you know that this feature will have zero contribution here this guy big or small is going to go through a sigmoid okay so it could be huge small negative and it goes through this so it always has a contribution so one interpretation is that what radial basis function networks do is look at lower regions in the space and worry about them without worrying about the far away points okay so I have a function that is in this space so I look at this part and I want to learn it so I get a basis function that captures it or a couple of them etc and I know that by the time I go to another part of the space whatever I have done here is not going to interfere whereas in the other case of neural networks it did interfere very very much and the way you actually got something interesting is making sure that the combinations of the guys you got give you what you want okay but it's not local as it is in this case so this is one the first observation the second observation is that here the non-linearity is we call phi the corresponding non-linearity here is theta and then you combine with the Ws and you get H so very much the same except the way you extract features here is different okay and W here was full fledged parameter that depended on the the labels we use back propagation in order to get those so these are learned features which makes it completely not a linear model this one if we learned mus based on the effect on the output which would be a pretty hairy algorithm that would be the case but we didn't and therefore this is almost linear in this part and this is why we got this part fixed and then we got this one using this you do inverse okay one last thing this is a two-layer network and this is a two-layer network of this type of structure lends itself to being a support vector machine the first layer takes care of the kernel and the second one is the linear combination that is built in in support vector machines so you can solve a support vector machine by choosing a kernel and you can picture in your mind that I have one of those where the first part is a neural network kernel for support vector machines but it deals only with two layers as you see here okay not multiple layers as the general neural networks would do okay now the final parameter to choose here was gamma the width of the Gaussian and we now treat it as a genuine parameter okay so we want to learn it and because of that it turned purple okay so now mu is fixed okay according to Lloyd now I have parameters w1 wwk and then I have also gamma and you can see that this is actually pretty important because as you saw that if we choose it wrong the interpolation becomes very poor and it does depend on the spacing in the dataset and whatnot so might be a good idea to choose gamma in order to also minimize in sample error get performance so of course I could do that and I could do it all I care I can do it for all the parameters because here is the value I am minimizing mean square error so I'm going to compare this with the value of the yn when I plug in xn and I get an in sample error which means square I can always find parameters that minimize that using gradient descent the most general one start with random values and then descent and then you get a solution however it will be ashamed to do that simple algorithm that goes with them if gamma is fixed this is a snap you do the pseudo inference and you get exactly that so it is a good idea to separate that for this one it's pretty it's inside the exponential and this and that I don't think I have any hope of finding a shortcut I probably will have to do gradient descent for this guy but I might as well do gradient descent for this guy not for these guys and the way this is done is by an iterative approach the theme of the lecture okay and in this case it is a pretty famous algorithm a variation of that algorithm the algorithm is called EM expectation maximization and it is used for solving the case of mixture of gaussians which we actually have except that we are not calling them probabilities we are calling them bases that they are implementing a target so here is the idea fixed gamma okay that we have done before we have been fixing gamma all through so if you want to solve for w based on fixing gamma you just solve for it using the pseudo inverse okay so now we have w's now you fix them they are frozen and you minimize the error the squared error with respect to gamma one parameter it would be pretty easy to gradient descent with respect to one parameter you find the minimum you find gamma freeze it and then we go back to step one and find the new w's that go with the new gamma back and forth converges very very quickly and then you will get a combination of both w's and gamma's and because it is so simple you might be even encouraged to say okay why do we have one gamma I have data sets it could be that these data points are close to each other and one data point is far away okay now if I have a center here that has to reach out further and a center here that doesn't have to reach out looks like a good idea to have different eyes granted and since this is so simple all you need to do is now have capital K parameters gamma K so you double the number of parameters but the number of parameters is small to begin with okay and now you do the first step exactly you fix the vector gamma and you get these guys and now you are doing gradient descent in a K dimension space okay we have done that before it's not a big deal you find the minimum with respect to those freeze them and go back and forth and you are in this space okay now very quickly I am going to go through two aspects of RBF one of them relating it to kernel methods which we already have seen the beginning of we have used it as a kernel so we would like to compare the performance and then I will relate it to regularization so it's interesting that RBFs as I described them like intuitive local influence all of that you will you will find in a moment a rose in the first place in function approximation okay so let's do the RBF versus its kernel version okay last lecture we had a kernel which is the RBF kernel okay and we had a solution with nine support vectors and therefore we ended up with a solution that implements this let's look at it okay I am getting a sign that's built in part for classification I had this guy after I expanded the the Z transport Z in terms of the kernel so I am summing up over only the support vector there are nine of them okay this becomes my parameter the weight it happens to have the sign of the label that makes sense because if I won't see the influence of Xn it might as well be that the influence of Xn agrees with the N that's how I want to if it's plus one I want the plus one to propagate so because the alphas are non-negative by design they get their sign from the label of the point okay and now the centers are points from the data set they happen to be the support vectors and I have a bias there so that's the solution we have so what did we have here we had the straight RBF implementation with nine centers because this is not an integral part I could have done a regression part but since I am comparing here I am going to take the sign and consider this a classification I also added a bias also in blue because this is not an integral part but I am adding it in order to be exactly comparable here okay so the number of terms here is nine the number of terms here is nine I am adding a bias I am adding a bias now the parameter here is called W the data set indeed they are most likely are not okay and they play the role here so these are the two guys so how do they perform that's the bottom line okay can you imagine I mean this is exactly the same model in front of me and in one of them I did what unsupervised learning of centers followed by a pseudo inverse and I used linear regression for classification that's one route what did I do here maximize the margin I equate with a kernel and find the special quadratic programming completely different routes and finally I have a function that is comparable so let's see how they perform just to be fair to the poor straight RBF implementation okay the data doesn't cluster normally and I chose the nine because I got nine here so the SVM has the home advantage here okay just to compare I didn't optimize the number of things so if this guy ends up performing better okay it's better SVM is good but it really is has an advantage a little bit of unfair advantage in this comparison but let's look at what we have so this is the data so let me magnify it so that you can see the surface so now let's start with the regular RBF both of them are RBF but this is the regular RBF so this is the surface you get after you do every thing I said the Lloyd and the and whatnot and the first thing you realize is that the in-sample error is not zero there are points that are misclassified not a surprise I had only K centers and I'm trying to minimize mean square error it is possible that some points close to the boundary will go one way or the other I'm interpreting the signal as being plus closer to plus one or minus one sometimes it will cross and that's what I get so this is the guy that I get here is the guy rather interesting first it's better because I have the benefit of looking at the green the faint green line which is the target and I am definitely closer to the green one in spite of the fact that I never used it explicitly in the computation I used only the data the same data for both but this tracks it better it does zero in sample error it's fairly close to this guy so here are two solutions coming from two different words one is the same kernel and I think by the time you have done a number of problems using these two approaches you have it cold you know exactly what is going on you know the ramifications of doing unsupervised learning and what you miss out by choosing the centers without knowing the label versus the advantages of support vectors and what not okay so the final item that I promised was RBF versus the regularization it turns out that you can derive RBFs entirely based on regularization you are not talking about inference of a point you are not talking about anything okay so here is the formulation from function approximation that resulted in that and that is why people consider RBFs to be very principled and they have a merit and what not it is modular assumptions as always and we will see what the assumptions are okay so let's say that you have a one-dimensional function so you have a function and you have a bunch of points the data points and what you are doing now is you are trying to interpolate and extrapolate between these points in order to get the whole function which is what you do in function approximation what you do in machine learning if your function happens to be one-dimensional so what do you do in this case there are usually two terms one of them you try to minimize the insample error and the other one is regularization to make sure that your function is not crazy outside that's what we do so look at the insample error that's what you do which I took out to simplify the form you take the value of your hypothesis compare it with the value y the target value square and this is your error insample now we are going to add a smoothness constraint okay and in this approach the smoothness constraint is always taken almost always taken as a constraint on the derivatives it's okay so if I have a function and I tell you that the second derivative is very large what does this mean it means okay it's really good so that's not smooth okay and if I go to the third derivative the rate of change of that and so on so I can go for derivatives in general and if I can if you can tell me that the derivatives are not very large in general that corresponds in my mind to smoothness so the ways I formulated the smoothness the case derivative of your hypothesis now is a function of x I can differentiate it I can differentiate k times assuming that it's parameterized in a way that is analytic okay and now I'm squaring it because I only interested in the magnitude of it okay and what I'm going to do I'm going to integrate this from minus infinity to plus infinity okay this will be an estimate of the size of the case derivative that's bad for smoothness if this is small that's good for smoothness now I'm going to up the empty and combine the contributions of different derivatives so I am going to combine all the derivatives with coefficients if you want some of them all you need to do is just set these guys to zero for the ones you are not using so typically you would be using let's say first derivative and second derivative and you try to minimize the augmented error here okay and the bigger lambda is the more insistent you are on smoothness versus fitting and we have seen all of that before okay so the interesting is that if you actually solve this okay under conditions and assumptions and after an incredibly hairy mathematics that goes with it okay you end up and I am looking for as smooth an interpolation as possible in the sense of the sum of the squares of the derivatives with these coefficients it's not stunning that the best interpolation happens to be Gaussian that's all we are saying so it comes out and that's what gives it the sort of a bigger credibility as being a sort of inherently self-regularized and whatnot okay and you get this is the smoothest interpolation and that is one interpretation of radial basis functions okay on that happy note we will stop and I'll take questions after a short break okay let's start the Q and A because first can you explain again how does an SVM simulate a two level neural network okay okay look at the RBF in order to get a hint okay what does this feature do it actually computes the kernel right so think of what this guy is doing as implementing the kernel so what is it implementing is implementing theta the sigmoidal function the tension in this case of this guy so now if you take this as your kernel okay and you verify that it is a valid kernel okay in the case of radial basis functions we had no problem in the case of neural network believe it or not depending on your choice of parameters that kernel could be a valid kernel corresponding to a legitimate this space or can be an illegitimate kernel but basically use that as your kernel and if it's a valid kernel you carry out the support vector machinery so what are you going to get you're going to get that value of the kernel evaluated at different data points which happen to be the support vectors these become your units and then you get to combine them using the weights and that is the second layer of the neural network so it will implement two layer neural networks this way okay in a real example where you're not comparing to a support vector how do you choose the number of centers okay this is perhaps the biggest question in clustering okay there is no conclusive answer there are lots of information criteria and this and that okay but it really is an open question that's probably the best answer I can give okay in many cases there is a clear criteria at least relatively clear criteria I'm looking at the minimization and if I increase the cluster by one supposedly the sum of the the the square distance it should go down because I have one more parameter to play with okay the the objective function the mean square error goes down significantly then it looks like it's meritorious that it was warranted to add this center okay and if it doesn't then maybe it's not a good idea there are tons of heuristics like that okay but it's it's it is really a difficult question and the good news is that if you don't get it exactly it's not the the number of clusters that are needed if the goal is to plug them in later on for the rest of the RBF machinery so validation would be one way of doing it but I mean there are so many things to validate with respect to but this is definitely one of them also in is our RBF practically in applications where the where there's a high dimensionality suffer from from high dimensionality it's it's a question of you know so distances become funny or sparsity becomes funny in higher dimensional space so the question of choice of gamma and other things become more critical and if it's really very high dimensional space and you have few points then it becomes you know very difficult to to expect good interpolation so there are difficulties but the you know you use other methods and you also suffer from from one problem or another and can you review again how to choose gamma okay so this is one way of doing it okay let me so here I am I am trying to take advantage of the fact that determining a subset of the parameters is easy if I didn't have that I would have treated all the parameters on equal footing but I have just used a general non-linear optimization like gradient descent in order to find all of them at once iteratively until I converge to a local minimum with respect to all of them now that I realize that when gamma is fixed there is a very simple way in one step to get to the W's I would like to take advantage of that the way I am going to take advantage of it is to separate the variables into two groups okay the expectation and the maximization to the EM algorithm okay and when I fix one of them when I fix gamma then I can solve for W's case directly I get them so that's one step and then I fix W's that I have and then try to optimize with respect to gamma according to the mean square error so I take this guy with W's being constant gamma being a variable and I apply this to every points in the training set x1 up to xn and take it minus yn squared sum them up and then get the gradient of that and try to minimize it until I get to a local minimum and when I get to a local minimum now it's a local minimum with respect to this gamma and with the W's as being constant there's no question of variation of the W's case in those cases but I get a value of gamma at which I assume a minimum now I freeze it and repeat the iteration and going back and forth will be far more efficient than doing gradient descent in all just because one of the steps that involves so many variables is a one shot and usually algorithm converges very quickly to a very good result it's a very successful algorithm in practice going back to neural networks not that you mentioned the relation with SVM in practical problems is it necessary to have more than one hidden layer or well in terms of the approximation there is an approximation result that tells you you can approximate everything using a two layer neural networks and the argument is fairly similar to the argument that we give before so it's not necessary and if you look at people who are using I would say the minority use more than two layers so I wouldn't consider the restriction of two layers dictated by support vector machines as being a very prohibitive restriction in this case but there are cases where you need more than two layers and in that case you go just for the straightforward neural networks and you have an algorithm that goes with that there is an in house question Hi professor I have a question at the beginning about slide one yeah why we come out with this redo basis function you said that because the hypothesis is affected by the data point which is closest to the to the 2x okay this is the slide you are referring to right oh yeah this is slide so is it because you assume that the target function should be smooth so that's why we can use okay it turns out in hindsight that this is the underlying assumption okay because when we looked at solving the approximation problem with smoothness we ended up with those radial basis function there is another motivation which I didn't refer to it's a good opportunity to raise it let's say that I have a data set x1 x1 y1 x2 y2 up to xnyn and I'm going to assume that there is noise but it's a funny noise it's not noise in the value y it's noise in the value x that is I can't measure the input exactly and I want to take that into consideration in my learning the interesting ramification of that is that if I assume that there is noise and let's say that the noise is Gaussian which is a typical assumption so although this is the x that was given to me the real x could be here or here or here and what I have to do since I have the value y at that x the value y itself I'm going to consider to be noiseless which x it corresponds to then you will find that when you solve this you realize that what you have to do you have to make the value of your hypothesis is not change much by changing x because you run the risk of missing and if you solve it you end up with actually having an interpolation which is Gaussian in this case so you can arrive at the same thing under different assumptions so there are many ways of looking at this but definitely smoothness comes one way or the other observing here by observing the regularization by observing the input noise interpretation or other interpretations okay I see another question is about I guess slide 6 is when we choose small gamma or large gamma I guess here yeah so actually here from just from this example can we say that definitely small gamma is better than large gamma here well small is relative okay so the question is this is related to the distance between points in the space okay because the Gaussian I mean the value of the Gaussian will decay in that space okay and this guy looks great if the two points are here but the same guy looks terrible if the two points are here because by the time you will get here it will have died out so it's all relative but relatively speaking it's good idea to have the width of the Gaussian comparable to the distances between the points so that there is a genuine interpolation okay and the objective criteria for choosing gamma will will affect that because when we solve for gamma we are using the k centers so you have points that that that have the center of the Gaussian but you need to worry about that Gaussian covering the data points that are nearby okay so the good news is that there is an objective criteria for choosing it okay this slide was meant to make it up to make the point that gamma matters okay now that it matters let's look at a principle way of solving it and the other one the other way was the principle way of solving it so does that mean that choosing gamma makes sense when we have like a fewer clusters than number of samples because in this case we have three like yeah sure for gamma was meant just to visually illustrate that gamma matters but the main utility indeed is for the k centers okay I see because here actually for both cases the ensemble arrow is zero and the same generalization no question about that absolutely so can we say that k the number of clusters as a measure of VC dimension in this sense well it's a cause and effect it's when I decide on the number of clusters I decide on the number of parameters and that will affect the VC dimension yes so this is the way it is rather than that I didn't I didn't want people to to take the question as oh we want to determine the number of clusters so let's look for the VC dimension that will be the argument backwards so the statement is correct they are related but the cause and effect is that your choice of the number of clusters affect the complexity of your hypothesis set now do you reverse because I thought for example we have if you have a data and we know that what kind of VC dimension will give good generalization so based on that can we oh so this is out of necessity so you are not saying that this is the inherent number of clusters that are needed to this this is what I can afford and in that case it's true but in this case it's not the number of clusters you can afford I mean it is indirectly it is the number of parameters you can afford because of the VC dimension or not they break the data points correctly or not so the only thing I'm trying to avoid is that I don't want people to think that this will carry an answer to the optimal choice of clusters from an unsupervised learning point of view that link is not there I see but because like in this example we deal with it seems there's no natural cluster in the input sample it's uniform distributed in the input space even if there is clustering you don't know the inherent number of clusters but again the saving grace here is that we can do sort of a half cooked clustering just to have a representative of some points and then let the supervised stage of learning take care of getting the values right so it is just a way you think of clustering I'm trying instead of using all the points and I want them to be as representative as possible and that will put me ahead of the game and then the real test would be when I when I plug it into the supervised okay thank you are there cases when RVFs are actually better than SVMs I mean you can you can run them in a number of cases and if the the data is clustered in a particular way and the clusters happen to have a common value then you would expect to get me ahead whereas the SVMs now are on the boundary and they have to be such that the cancellations of RVFs will give me the right value so you can definitely create cases where one will win over the other okay most people will use the RVF kernels the SVM approach okay I think that's it for today thank you so we'll see you next week