 You need to solve the classification problem. What algorithm do you use? Most people from noobs to professionals would consider support vector machines. But why is this the case? Why are support vector machines thrown around so much even though people don't know that much about it? Well, we're going to answer that question. And in this video, we're going to see exactly why support vector machines are so versatile by getting into the math that powers it. So, let's get to it. So, why are support vector machines so versatile? If you watched my last video, you would probably already know the answer. Specifically, SVM makes use of the kernel trick to model nonlinear decision boundaries. Before we see how, let's set the groundwork for this and understand what we're dealing with. First, what is a point? We have a set of points we want classified. Say each point is represented by some feature vector x. But this may be too simple for many applications. So we want to map this to a more complex nonlinear feature space, say phi fx. Thus, we have transformed the feature space where each of these input features x is mapped to a transformed basis vector phi fx. Nice. Now, what is a decision boundary? The decision boundary is the main separator that divides these points into their respective classes. This hyperplane is represented as w transpose phi fx plus b is equal to zero. I write the intercept bias term separately. In d-dimensional space, a hyperplane is a d-minus one-dimensional separator. So in a two-dimensional space, the hyperplane is a one-dimensional line. In three-dimensional space, the hyperplane is a 2d plane. You get it. Now, how do we represent distance? Well, the distance of a line ax plus by plus c is equal to zero from a point x zero y zero is this. In the same way, we can write the equation of distance of a hyperplane from a point vector. Now, what are we optimizing? Consider the case where data is perfectly separable. This is when there exists at least one hyperplane that can separate the training data groups with 100% accuracy. To make things simpler, we consider a binary classification, just two groups. Let's call them the positive and negative group for now. For this data, there can be many hyperplanes that give us the same 100% accuracy. But how do we know which is the best to choose? Intuitively, we place it right down the center where the distance from the closest points is maximum. This way, we can make less mistakes during testing. And SVM does exactly this. The goal of SVM is to find that hyperplane that has the maximum margin from the closest points. In other words, the goal is to maximize the minimum distance. It's so intuitive. While making a prediction, substituting any group one or positive group point in the hyperplane equation will lead to a value that's greater than zero. And any group two point in the hyperplane equation will lead to a negative value. Since we have training data, our labels will be plus one for the positive group and negative one for the negative group. The product of the predicted label and the actual label will be greater than zero if classified correctly. Otherwise, it's less than zero. Since we're dealing with a perfectly separable data set, the optimal hyperplane will classify all points correctly. Substituting our optimal values into the weight equation and bringing the independent weight term outside, we get the final form. The inner term, min yn times W transpose 5x plus b, represents the distance of the closest point to the decision boundary. Now, for ease of simplification, let's normalize this equation. By this, I mean let's rescale the distance of the closest point to be one. This is possible by multiplying some constant factor c to the weight factor w and the bias b. Changing the terms, the vectors are still in the same direction and the margin or the hyperplane equation doesn't change. Using this back in our equation, the second term representing the distance of the closest point from the hyperplane becomes one. In other words, any point is at least at a distance one from the hyperplane. We also convert it to a minimization problem and introduce the half for notation convenience. Constants don't change the solution after all. This is the primal formulation for perfectly separable data. By primal form, I mean the constrained optimization problem we just formulated contains the original variables of the problem. This primal formulation is the first type of formulation we make based on intuition, or the problem statement. Of course, this form was constructed assuming the data is separable, or the fact that all points are always classified correctly. However, in the real world, this is almost never the case. So instead of trying to make our classifier try to classify every single training point correctly and risk overfitting, we allow it to make some mistakes. And this is done by introducing for each data point a slack variable, represented by the Greek letter Kasai. Think of it as credit or money you owe or a penalty for incorrect classification. So a data point classified correctly will have a slack of zero, and a data point classified incorrectly will have a positive slack. Introducing Kasai to account for a non-perfect separation, we now have a new primal formulation. Two major changes. In the optimization, we have an extra term. We can introduce slack for every variable, but it should be as low as possible. This is regulated by the hyperparameter capital C. So if C is zero, the classifier isn't penalized by slack. Hence the optimization can afford to use it everywhere. So even large misclassifications would be acceptable. The decision boundary itself would be linear, which may lead to underfitting. But an infinitely high C means that even a small slack is highly penalized. So the classifier cannot afford to misclassify a data point. And this may lead to a very complex decision boundaries and hence overfitting. So it's important to choose an appropriate C. The second change is the introduction of the non-negative slack term in the constraints. This is an example of a convex quadratic optimization. Why? The objective function is quadratic in W and the constraints are linear in W and Kasai. However, there is a problem here. To solve this optimization, we need to know what phi is. And so this becomes pointless. What we need to do is rewrite this primal formulation in a different way, such that it is independent of phi. This is because phi has a very complex notation that we really don't know the nature of. The alternate or dual formulation is going to make use of the concept we covered in our last video to eliminate the dependence on phi. And this concept is kernels. For now, the dual formulation involves rewriting the same problem using a different set of variables. To solve this constraint optimization problem, we use the method of Lagrange multipliers. In the general case, x is the original primal variable where we seek to minimize a function f under a set of constraints determined by different functions g sub i. The new set of variables, also called the Lagrange multipliers, is a set of lambda i. We now solve for the primal variables by differentiating this unconstrained Lagrange. Then we just plug it back into the original equation to have an equation of only the dual variables. It'll be clear when we actually apply this reduction to our primal form for SVM. First, determine the Lagrangian. Then solve by differentiating with respect to the primals individually, w, b, and q. Express each primal as a function of dual variables, the set of lambdas and alphas. Now substitute these values back into the Lagrangian to eliminate the primal variables. And we just keep simplifying until we now have our dual formulation. But what was the point of this again? We already had a primal form and we wanted to get rid of the complex phi term. We see it here too, but this form actually doesn't depend on phi. How? Kernelization. Finally getting to the topic, a kernel is basically a function of the base features x, which has a property of symmetry and is positive semi-definite. The inner product of these complex basis vectors can be represented in a simple function that only depends on the base feature vector x. This is the gist of it, but for a detailed explanation on kernelization, check out my last video. Kernelized models are quite powerful. We can substitute this kernel in our dual formulation and hence make it independent of the complex basis feature phi. So how are predictions made? Since we now have a dual that is much easier to compute, we pass this through a convex quadratic solver so as to get the dual variables, the set of alphas and lambdas. When we say we want a prediction, we are given some test feature vector x, which is then mapped to a complex phi effects. And this prediction is W transpose phi effects, which can be rewritten using the duals. The prediction is completely independent of the complex feature basis phi. In other words, we don't need a complex basis to model non-separable data well. We just need a kernel function to do the job. You may have used the RBF or Gaussian kernels with SVM. Hope you now understand why that works so well. Man, this video took a long time to make, so just show some love with that like and a subscribe, ring that bell for notifications. If you want to watch some more content, click one of the videos right here. Please, just please support this young lad in his ventures. Don't you want to see skits explaining concepts too? I can be funny. Yeah, I can. It's true. My mom said so.