 This is basically a lecture on support vector machines though you would see on the screen the title statistical learning theory. Let me tell you a bit of history, there were two statisticians named Wapnick and Chervonenkis they are the main persons who created this subject statistical learning theory, they are statisticians from Russia, you know there was a cold war period between America and Russia and these people have done their work basically during the cold war period and after the cold war was over after then there were communication between Russia and America they became normal. These people they went to United States and they presented the statistical learning theory in a conference in computer science, in a computer science conference they presented this one the basic problem that they attempted to solve here is when you design a classifier you have a training set using the training set you design the classifier then using the test set somehow you measure its performance and then if the performance on the test set is also satisfactory then you say that fine everything is fine with the classifier but is it really fine with the classifier even if it does well on the test set how do you say that your classifier is generalizable the performance of the classifier that you have somehow got how do you say that it has generalization capability is there any mathematical way of expressing it and if you express it mathematically is there any way of obtaining it and if you also obtain it then for the different classifiers that we are using is there any way in which you can calculate the generalizability of this classifier. So this is the basic question that they attempted to solve there since they are statisticians and they attempted to solve the whole thing using statistical language so they coined the term statistical learning theory support vector machines which probably you have heard from many people is sort of a byproduct of statistical learning theory sort of a byproduct of statistical learning theory this is the basic history and you will find a book by Wapnick on statistical learning theory which is basically a book on statistics and you will find support vector machines being considered a part of neural networks you will find them to be considered a part of machine learning and data mining and of course since we are talking about classifiers and their performance you will consider them to be a part of pattern recognition. So you will find support vector machines almost in all these fields you will find support vector machines in all these fields and the generalization of support vector machines like kernel machines etc this is sort of the basic little bit of history now let me try to explain the basic terminology you look at your screens first one is you are given points in capital N dimensional space okay you are given small n points x1 x2 xn you see the very first step you are given small n points x1 x2 xn they are in capital N dimensional space ?i denotes the label of the class label of the point xi I am assuming that you have two classes only and the class labels are given as – 1 and 1 and px ? small px ? this is a probability distribution on the data that means there is some probability distribution this is the small p is the density function and capital P that is the actual capital P of a is equal to integral of small p integral over a of small p small p is the density function and capital P is the actual probability now these points are x1 x2 xn and the corresponding ? is they are assuming they are assumed to come from the distribution px ? where px ? is not known and here it is written they are IID independent and identical distributed now in classification what exactly is the problem the problem is you are given n points small n points and you have a corresponding class labels somehow you need to find the function from xi to ?i are you understanding it somehow you need to find the function if you find the functional form which for every xi if it gives you if you find you find the functional form where for every xi the function gives the value ?i then you are done you need to find the corresponding functional form that is the basic problem of classification these f's you can call them as you can have many names for it okay and some f's when for one class it will give you plus one another class they are going to give you minus one and at some place they will get the value 0 if when the value is 0 then you call it as the separation between the class 1 and class minus 1 right when the value is given as 0 then you call it as separation between the class plus one and between the class plus one and minus now what is it that we are given we are given a set of functions script f that is the functions are small f xs are the input a a is the parameter set a bar it is written it is the vector form so all these alphas are adjustable parameters and f is a function actually let me just try to explain it to you I hope all of you know what multi-layer perceptron is you have an input layer and you have some hidden layers you have an output layer when you are going from input layer to hidden layer you have several connections and you have connection weights you start with some initial connection weights from input layer to hidden layer one hidden layer one if you assume two hidden layer and hidden layer one to hidden layer two then hidden layer two to output layer at between every two such layers you have many connections and you have connection weights put all the connection weights together and write it as a vector form that vector you take it as alpha that vector you take it as alpha okay and xs are your input okay given an x and given an given an x given an x and given an alpha and you have the usual neural network which is feed forward neural network it will give you an output the output is this f okay given the set of in given the given the set of input x1 x2 xn input vectors given alpha the adjustable parameters then if you apply your neural network methodology the output is f okay these are the adjustable parameters and f is a function now if you change is the alphas if you change is alphas the corresponding f is going to I mean the values are going to be changed if you change your network architecture then f is itself is going to be changed so for a given input x and choice of alpha f of x alpha will always give the same output if you have a function f and if you have specified your alpha f of x alpha will always do the same output a particular choice of alpha generates a train mission why this particular choice when you are training a neural network you start with some some choice and you go on changing it till by the end of your I mean you have given some rule for termination and when it terminates you assume that you have got a nice I mean values for alpha so you are training the machine to get nice value for this alpha in neural network with a fixed architecture course with alpha corresponding to the weights and biases is a learning machine now when you have a function f when you fix alpha what is the exact risk that you are taking the risk is it is observed is f of x alpha expected is the actual one is ? take the difference take the difference modulus and dpx ? do the integration over all these excess that will give you sort of error or risk what is it that we are calculating what we are calculating is ?i for the ith point for an alpha ?i is the targeted output f of xi alpha is your observed output the difference i is equal to 1 to n and 1 by 2 n this is empirical risk this is what actually we are calculating what we are supposed to calculate is this here I need to tell you one thing I need to tell you one thing all these things are explained here using the sign modulus and the similar results actually you will get when you take the square terms when you take the square terms which in neural network when you try to minimize the error you take the square terms and then you take the double summation you use some gradient descent and then you do that try to do the minimization okay there you take the square term for the error okay any of the results are similar okay so here everything is explained using the sign modulus this is one thing that I am telling you there is another one you will get them all this material from a famous lecture notes or I do not want to call it lecture notes it is tutorial it is written by Christopher Burgess it is available on internet okay a tutorial on support vector machine whatever I am going to tell you about the support vector machines most of it you will find you will get from that particular tutorial okay now so this is the risk that we are calculating and the actual risk that we are supposed to calculate is RF alpha now choose eta so that 0 less than or equal to eta less than or equal to 1 here I am sorry this less than or equal to sign should not be there it should be strictly less than 1 okay in fact I would prefer that the other less other equality also should not be there 0 than eta strictly less than 1 this equality signs should not be there okay then what was proved by Wapnick was the actual risk that we are supposed to calculate it is less than or equal to the empirical risk plus square root there is an h here log 2n n is the number of points divided by h plus 1 minus log eta by 4 by n with probability 1 minus eta that means this relationship holds with probability 1 minus eta this was what was shown by Wapnick this was I think in the year 1983-84 or is it 93-94 I am not exactly sure 83-84 or 93-94 this thing was shown okay now if you look at this expression note that in neural networks we are trying to minimize this empirical risk we are trying to minimize this empirical risk but what we are supposed to be doing is we are supposed to be minimizing the actual risk okay actual risk if we minimize it actual risk if we minimize it then that is the thing that we want to do it but by minimizing the empirical risk are we actually able to minimize the actual risk the problem is that after minimizing the empirical risk still this much term is there still this much term is there and the actual risk is less than or equal to this plus this this relationship is holding with probability 1 minus eta now if we want this relationship to hold then probably we need to take the value of eta to be very very small suppose we take it to be 0.05 then RF alpha less than or equal to empirical risk plus this that will happen with probability 0.95 so usually people take the value of eta to be either 0.05 or some 0.01 some very very small value now there is an unknown term here the term is H n is the number of points eta is this parameter 0.05 or I mean point yeah point 0.05 or 0.01 n is the number of points and there is this H what is this H H is a non-negative integer this is called VC dimension Vapnik VC for Vapnik Chervon in case this is a non-negative integer H is called VC dimension it has to be always an integer it cannot take fractional values this H provides what is known as capacity we will come to what this H is slightly later eta is a small value say 0.05 now let us denote this term this one square root of H log to n by H this whole term let us denote it by Xi okay this Xi is independent of the distribution P so in this one in this Xi eta is a constant that we have already fixed n is the number of points so the only term is H this H is a non-negative integer in fact H is independent of distribution what is this VC dimension we will define define it slightly later this is something independent of this is independent of the distribution of the points. So the whole Xi is independent of the distribution P Xi is called VC conference now if we know H we can compute now learning machine is another name for a family of functions f we take that the machine which minimizes the right hand side of one this is the right we actually we minimize this this is something independent of that independent of the independent of the distribution and we minimize this and by minimizing this we hope that RF alpha and empirical alpha they are somehow very close now let us see what the VC dimension is VC dimension it is actually a nice quantity Xi theta i they are the given points they belong to Xi they belong to Rn and you have n number of points theta is they take values – R1 and these are the family of functions under consideration now if you have small n points in how many different ways you can label them to power in different ways do you agree to that if you have small n points you can label them in two power in different ways that is all the points you put it in class 1 that is one way one point you put it in class 1 the rest and my sorry one point you put it in class – 1 rest n – 1 you put it in one point all the points you put it in class 1 that is one way n – 1 points you put it in class 1 1 point you put it in class 2 then n – 2 points you put it in class 1 1 point you put it in class 2 if you do like this you have 2 power n different ways in which you can label n points. Now a set of functions tau is set to shatter n points a collection of n points if for every labeling of these n points we can get a function f which provides that labeling is this clear to you let me try to explain you you have you have a set of functions you have a set of n points this set of n points can be labeled in 2 power n ways for every labeling you need to get a function note that when I started this lecture I asked you what is our aim from the set of points you need to get a function to ?i the corresponding labels once we get a function then we are through our aim is just to get that function okay. Now you have got 2 power n different labeling is possible okay for each labeling if you have a function which gives that labeling then we say that this set of functions is set to shatter n points I will explain it to you slightly more in a more detail after a few minutes. Now VC dimension for a set of functions is defined as the maximum number of points that can be shattered by this VC dimension is the maximum number of point that can be shattered by suppose the maximum number of points is 10 that means a set of 10 points if it is shattered by the collection of functions and no set of 11 points 11 points or 12 points or 13 points or 14 points no set of 11 points or 12 points or 13 points or 14 points can be shattered by the set of functions then the VC dimension then the VC dimension of the set of functions is that value 10 VC dimension for a set of functions is defined as is defined as the maximum number of points that can be shattered by ? VC dimension is H implies there exists one set of H points that can be shattered by ? but it does not mean that every set of H points can be shattered by it it does not mean that every set of H points can be shattered by it I will explain all these things by using an example now I can do the example on the board I will do the example on the board okay I can do it on the board suppose you take set of straight lines suppose you take set of straight lines and you take two points your function ? your set ? is all possible straight lines all possible straight lines okay and let us say we are in two dimensional space let us say we are in two dimensional space now you take two points how many labelling are possible you have four labelling okay now you take one straight line here and this arrow denotes they are the given the sign positive sign that means all these both the points they are in the class plus one this is the straight line that is giving you this now the second one for the same two point this is another straight line that is putting this point in class one this point in class minus one for the same two points we have a straight line which puts this point in class one this point in class minus one and you have the fourth one both the points are in class minus 1 and this side is class plus 1 is this clear. So two points they can be shattered by the set of straight lines I have taken these two points like this I could have taken them like this I could have taken these two points in this way I could have taken these two points in this way I could have taken these two points in this way in any way I take okay every set of two points it can be shattered by set of straight lines am I correct every set of two points now what I will do is that instead of two I will take three I will take three this is one two three five eight here first I will put all of them in class one all the three of them in class one then I will start putting two points these two points are in class one these two points are in class one and then these two points are in class one okay now I will put one point in class one this is in class one this is in class one this is in class one then I will put no point in class one for this one no point in class one so here I have taken a set of three points this is one set of three points this is shattered by all the lines this is shattered by the set of possible lines is this clear now let us see whether every set of three points can be shattered the answer is no the answer is no you take three points on a single line let us say this point goes to minus one these two points are plus one can you get a single straight line which gives you this result no okay so now here you have a set of three points that can be shattered by straight lines now you take any set of four points take any set of four points no set of four points can be shattered by straight line no set of four points can be shattered by straight line you have this famous example you remember this example Now you are seeing the connection between this one and neural networks many persons when they introduce support vector machines they I mean they into when I introduce this one they first tell this example and then they go to all the shattering and other things in fact this can be proved mathematically that no set of four points can be shattered by straight line so in R2 for the set of straight lines the VC dimension for the set of straight lines is 3 let me repeat in R2 the VC dimension for the set of straight lines is 3 because there exists a set of three points that can be shattered by the set of straight lines a set of three points okay and no set of four points are anything more than four can be shattered by set of straight lines okay so VC dimension for the set of straight lines is 3 this is the one so in two dimensions how consists of all straight lines this is the example VC dimension of straight lines is greater than or equal to 3 then note that VC dimension of straight lines is not four because no set of four points can be shattered by this one so VC dimension is 3 now it can be proved that VC dimension of hyper planes in Rn is n plus 1 VC dimension of hyper planes in Rn is n plus 1 that means if you are looking at Rn and if you are looking at all possible hyper planes then you will be able to get a set of n plus 1 points which can be shattered by these hyper planes and no set of n plus 2 or n plus 3 or n plus 4 points can be shattered by it why this shattering is important the shattering is important because note that in MLP we assume an architecture okay we assume an architecture and then we make it learn now the moment you assume an architecture you have assumed certain functional form right you have assumed certain functional form now with that functional form by varying all those alphas the given set of x1 x2 xn that for the given set of points if you are not able to forget about given set of points if you can if you are in a position to shatter at least a set of n points then probably we can think about getting the classification properly for the given set of n points let me repeat if the with the function under consideration if you are able to shatter at least a set of n points say you are given smaller number of points then we can think of whether we can shatter the given set of n points the given set of n points it has two classes some labeling is there and you are assuming a functional form by assuming an architecture you are assuming a functional form by assuming an architecture you are assuming a functional form whether this functional form whether it at all can it shatter at least a set of n points if it is not able to shatter it whatever you do I mean it is not going to I mean if the VC dimension is less than that then you have a problem are you understanding me if the VC dimension is less than the value small n then you do have a problem now there are some comments it is not necessarily true that learning machines with more parameters will have a high VC dimension and learning machines with less parameters will have low VC dimension examples exist in literature the second one is that a family of classifiers will have infinite VC dimension if they can shatter a set of n points however large n may be okay now examples exist in literature where a set of functions has infinite VC dimension but they are not able to shatter a set of four points here I wrote set of finitely point finitely many point the example that was given there was for four points it has infinite VC dimension but on the other hand a finite point set having just four points it is not able to shatter why the problem is that if you have a set of n points that can be shattered by the given function set then the VC dimension is at least equal to that value smaller a set of n point we are not saying that every set of n points is to be shattered so VC dimension by the very definition it is a very weak one I hope you are understanding this it is very weak because you are satisfied if it shatters a set of n points one set of n points if it shatters you are satisfied but then our given point set that also has n points but then one set it can shatter but this need may not be able to shatter then you have a difficulty here right then you have a difficulty here this is one of the problems that is actually I mean because VC dimension as per definition it is a very weak one yes if VC dimension is 10 means at least one set of 10 points it can shatter so if it is something 11 12 or 13 you know that you probably may not get the what is that you may not get the classification that is there but VC dimension 10 means 8 9 8 7 6 whether you can get the classification of this 8 7 6 point that is not clear something more you know that you cannot get it but something less than that you do not have an idea that is the basic difficulty with VC dimension that is the place where theory needs to be developed it should be something more strong than that that a set of points can be shattered these I do not want to delve into these things theory connecting SVM to structural risk minimization principle is not dealt here so I do not want to deal with these are extremely extremely highly mathematical subjects these are highly mathematical subjects and I do not want to go into all those mathematics at least now now maximum margin classifier so I do not want to go into the connections between the VC dimension theory and SVM I am directly coming to SVM so you have x i is ? i is equal to 1 to n ? i belongs to – 1 to plus 1 x i is r in Rn now I am assuming that data is linearly separable that means there exists a hyperplane which gives you on one side of the hyperplane you will get the positive points plus one point and on the other side of the hyperplane you will get negative point that is – 1 point now there is a basic theorem here if the given data set is linearly separable the given data set is linearly separable if and only if the convex hulls of those things do not intersect I hope you know this result okay I will repeat it the data is linearly separable that means the plus one point set and the – one point set they are linearly separable that there exists a hyperplane where on the positive side of a hyperplane you get all the plus one point on the negative side of hyperplane you get all the – one point this is possible if and only if you take all the positive points can construct its convex hull take all the negative point construct its convex hull then this convex hulls they do not intersect this is if and only if that is if the convex hulls do not intersect then you get a hyperplane and if you get a hyperplane then the convex hulls do not intersect both these things are satisfied so data is linearly separable so there exists a hyperplane W okay so W' XI greater than 0 for all I for which ? is equal to 1 less than 0 for all I for which ? is equal to – 1 or you multiply by ? I then ? I times these W' XI is greater than 0 for all I for when ? I is equal to 1 1 times W' XI that is greater than 0 when it is – 1 – 1 times this is also going to become greater than 0 so if there exists one such vector W for which this is greater than 0 this place this I and this I they should be replaced by prime transpose then there are infinitely many such vectors how does one choose one optimal classifier I hope this is known to all of you if you have one hyperplane then you are going to have infinitely many hyperplanes you are going to have infinitely many hyperplanes if you look at the basic the hard limiting simple perceptron okay simple perceptron then in the convergence theorem in the simple perceptron you assume the linear separability of the classes and you assume a hyperplane and you go on changing it till you go on changing it and then you can prove that as the number of iterations increases the error actually I mean it goes to actually 0 that can be shown mathematically that is called perceptron convergence theorem and so and you have too many hyperplanes it will go to one of the hyperplanes now here the question is how does one choose an optimal classifier optimal from the point of view of what now if there exists W such that ? I x W' XI is greater than 0 for all I then you take any ? x any ? then ? W also satisfies this condition okay then what we can do is that what we can do is that we shall set the margin that is minimum distance of hyperplane to the positive point same as minimum distance hyperplane to the negative point that is positive points and negative points we shall set the margin as one and achieve it with minimal weight now let us see the meaning of that let us see the meaning of that now you please look at it here you have two classes in this class 1 2 3 4 5 6 points in this class 1 2 3 4 5 6 7 points two classes now this is one hyperplane with respect to this plane this hyperplane you take the distance of this hyperplane with everyone of the point find out the one that has the minimal distance the minimal distance is this again with respect to this hyperplane find out the distance of this hyperplane with everyone of the point find out the one that has the minimal distance the minimum distance is this okay now you can choose this hyperplane in such a way that this is the shift is 1 and this shift is 1 okay this shift is 1 and this shift is 1 so that this totally it becomes 2 and the distance is actually 2 by norm of the weight weight norm of the vector okay it is 2 by norm of the vector now basically if you take this this distance is some value but then you look at this hyperplane this hyperplane when you take the distance of this hyperplane with everyone of these points the one that has the minimum distance is this and for this hyperplane again you do the same thing here the one that has the minimum distance is this now this is more than this this distance is more than this so basically what we would like to do is that we would like to choose a hyperplane in such a way that for that hyperplane with the same shift that you get a negative point and the same shift then you should get this positive point so that then basically you are going the distance between these two is 2 by actually norm of F norm of W where W is the W gives you the equation for this hyperplane that W prime X okay W gives you the equation for this hyperplane similarly in this case also W gives the equation for this hyperplane okay now you are what we would like to do is we would like to maximize this or another way of putting it is norm of W is same as W prime W and you take the square root then you will get the norm so you like to maximize this because you want to take the distance to be same a distance to be the maximum its maximization of this is same as minimization of this you want to maximize the margin maximize the distance between this and this the distance between this and this is taken as the margin and this classifier there is another name for it that name is maximum margin classifier the word margin is used as the distance between this hyperplane and this hyperplane the distance between this hyperplane and this hyperplane here if you take the distance between this and this hyperplane this will give you some margin here and this will give you another margin this margin it is maximum of all the possible margins that we can have so this is maximum margin classifier so a way of saying it is you get the margin you maximize it or you minimize this or you minimize this if you write a half here it does not matter because half is a constant you minimize this or you maximize this so minimization of half of W prime W where ? prime ? I into W prime Xi is greater than 1 for all i is equal to 1 to n now this is a what is known as QP problem quadratic programming problem many results in fact there is quite a bit of literature on convex optimization there is quite a bit of literature on convex optimization the functions under consideration here they are all they are mostly convex function in fact W prime W that is a convex function do you know the meaning of a convex function a function is said to be convex a function is said to be convex a function is said to be convex F is said to be convex if for every x and y these are vectors f of ? x 1- ? y is less than or equal to ? x 1- ? x f of y as an example you please look at this please look at this say this is your x this point is x say this point is y okay and this is your function now ? x 1- ? y is a point here ? x 1- ? into y this is a point in between now this is f of x this is f of y right ? x 1- ? x f y this is if you vary ? over all 0 to 1 this is for all ? belonging to 0 to 1 if you vary ? in the interval 0 to 1 then you will basically get this line segment now you consider every value of the function in this interval that value is less than the corresponding value here so this is convex is this clear you take any value of the function here and this is less than this value ? x 1- ? y so this is a convex function there is quite a bit of literature on convex optimization and this quadratic programming problem of how to get these W's that is basically solved by using many results that are available in convex optimization many results that are available in convex optimization as you can see the main problem here is a quadratic programming problem I hope all of you know what linear programming linear programming means you have constraints linear and the function that is to be optimized that is also linear then the problem is called linear programming problem in quadratic programming problem the function to be optimized that is quadratic as you can see W'W that is quadratic that is why actually it is called quadratic programming problem QP problem and so we have assumed that the data is not is linearly separable now if it is not linearly separable then what people would do is ? I times W' XI is greater than or equal to ? 1- ? or you can take this ? to be dependent on I you can take this ? to be dependent on I that also you can have it either you can have something fixed usually if you look at the literature you have this thing ? as dependent on I 1- ? I so minimum of again W'W subject to these constraints now this is going to be an extremely I mean in fact it is an extremely complicated problem to solve and this is when the classes are not linearly separable then we make what is known as a soft formulation of the problem now though it is quadratic programming problem that is true but then when people try to solve this thing quadratic programming is coming slightly later okay and you see the optimization function that is under consideration is half of W'W- ? I times W'XI-1 where this is are the Lagrange multipliers use Lagrange multiplier the basic problem is quadratic programming problem because the constraints that is the function to be maximized is quadratic that is half of W'W now you when you want to solve it one of the ways of that you do it is by using Lagrange multiplier this is are Lagrange multipliers now you do differentiation partial differentiation so you get you get this as one so you take this thing to be equal to 0 that gives your W as this now with a dual formulation the minimization can be achieved that is true but it is quite intensive in programming minimization can be achieved which is true but then the programming part of that thing is not actually a simple one is not actually a simple one so when this is generalized this is generalized to a case where the data sets the two class problem is not linearly separable which we had discussed in the previous slide then you take 1- ? or 1- ? I the second one is that suppose you have more than two classes suppose you have 3 4 5 classes then the problem becomes more complicated there are two ways in which people have tried to solve it one is one against the rest belongs to class 1 and not belonging to class 1 belongs to class 2 and not belonging to class 2 class 3 and not belonging to class 3 so one against the rest that is one way and the second way is you take every pair 1 2 1 3 1 4 1 5 and for each pair you try to get the linear boundary or the soft boundary for each pair you try to get either the linear one or the soft one now the problem formulation in all these cases it becomes the solution of the problem becomes extremely complicated extremely complicated and this has given raise to another class of problems which are known as support vector regression problems you see in this one this is the actual decision boundary this is the one that is giving you the maximum margin now this decision and this one it is passing through this point which is a data point this one it is passing through this data point what is the support vector in fact these two points are actually known as support vectors these two points are actually known as support vectors because this is the actual decision boundary and then if you give – 1 it is coming here if you give plus 1 it is coming here and the plus 1 line is passing through this one the – 1 line is passing through this now these are known as support vectors now what is the usual regression problem the usual regression problem is I will do it here the usual regression problem is you have a data set suppose you are looking at linear regression then you are you would like to approximate this data set by a line like this now you have a line draw two parallel lines with the same distance in such a way that on this side there are no points and on this side there are no points all the points are lying in between these two lines now getting hold of this line amounts to getting hold of these three lines and which amounts to the problem of support vector machines where in support vector machines that problem formulation the points are lying either on this side of the line or on this side of the line two classes here all the points are lying in between are you understanding me it is basically the complementary one here all the points are lying in between so when you have all the points lying in between when you get this slide this is basically your regression line which best approximates this so it gives it has given rise to what is known as support vector regression and regression has too many applications any forecasting problem is basically a regression problem or let me say most of the forecasting problems are regression problems for the past 20 days the price of this stock is so and so tomorrow what is the price so what line approximates this what curve approximates this is regression so and regression has too many applications even classification has too many application regression has too many application and quite many people are working on support vector regression there is one another comment that I would like to say I have been talking about linear boundary when linear boundary is non existing then what I am seeing is that I have put a margin there soft margin now there is an extension of this one to non-linear boundaries where people actually consider kernels people consider quadratic kernels or some other kernels for getting for obtaining boundaries non-linear boundaries you will find several topics or several subjects name say one subject is kernel machines you might have found a book titled kernel machines it is basically extension of SVM to non-linear when you have non-linear boundary then you basically use kernels to obtain them to at least try to obtain the non-linear boundaries. So this is another extension from support vector machines one is linear but not exactly error 0 and another one that is soft margin another one is non-linear where you consider kernels and another one is support vector regression so lot of work is going on in all these fields and the work on these fields nowadays is termed as machine learning so with this I stop the lecture.