# Support Vector Machines: All you need to know! Support vector machine is one of the best nonlinear supervise to machine learning models. Given a set of labeld training data SVM will help us to find optimal hyperplane which categorized new examples. In one dimensional space. The hyperplane is a point; in two dimensional space the hyperplane is a line and in three dimensional space, the hyperplane is a surface that divides a space into two parts.

Each class lies on either side. In the first section of this article, we’re going to go through the detailed process of finding the optimum hyperplane. Let’s first start with a two dimensional, linear, separable case.

• Figure 1 shows that the data are separated by a line. In this case, the data are linearly separable.
• Figure 2 is an example of non linearly separable data, which means that we can not find a line to separate the data.

Therefore, we say the data are linearly separable If we can find a line to separate the original data. Let’s have a closer look at figure 1. In general, there are a lot of possible solutions for drawing a hyperplane to separate the positive samples from the negative samples. But which one is the best?

SVM can help us choose the best hyperplane, which maximize the margin, or sometimes we call it a street around the hyperplane to separate the positive ones from the negative ones. That’s the reason why sometimes we also call SVM a maximum margin classifier.

Now let’s introduce another important concept – support vectors. Support factors are the data points that lie closest to the hyperplane. They are the data points most difficult to classify, which also decide which hyperplane we should choose. Unlike some other machine learning you like linear regression or neural network, which all data points influence the final optimization. For SVM. Only the difficult points, which is support vectors, has an impact on the final decision boundary. Moving a support vector moves the decision boundary, moving the other vectors has no impact.

Just like any other machine learning models, during the inference time, SVM will predict output y with input x. The input is just a set of features x for each sample. SVM will help us to find a set of weights W for each feature. Then, through the linear combination of input features and weights, we can predict the output y. So far, it is just I like neural nets. However, there is a major difference. SVM will optimize the weights in such a way that the only support vectors determine the weights and the decision boundary.

Now, let’s start the fun part of this section. We’re going to go through the underlying mathematical process for SVM to find the hyperplane and maximize the margin. This part only requires that you have some basic background of linear algebra and optimizations theory. Still, for this 2d example, we have three important lines, the middle line of the street, Hzero, which is a hyperplane, two gutters of the street, H1 and H2.

Now, let’s assume there is a vector w perpendicular to these three lines. For all the points on H0, they should have the same projection distance on w direction, which is w dotted x over the magnitude of w is equal to c. Let’s multiplied the magnitude of w on each side of the equation. We can obtain w dotted x is equal to c times the magnitude of w.

Now let’s substitute c times the magnitude of w with -b. we can get the function of the line H0 as w dotted x plus b is equal to 0, which is the equation for H0. Here we only have two dimensions, so this can be expanded as a following function. In fact, this expression works for any number of dimensions. Once we have the expression of hyperplane, we can then use it to make predictions, for this example, If the expression for any unknown sample is equal or greater than zero.

we predict it as a positive sample. Otherwise, we predict the unknown as negative. Similarly, we can write the expression for gutter H1 and H2. k is just another constant value and we can rescale w and b to make k is equal to 1 for mathematical convenience. As we have introduced before, SVM is a maximum margin classifier. To find a way to express a margin or street width here, Let’s introduce two vectors one x negative to a negative sample on the gutter H2, another one x positive to a positive sample on the gutter H1.

Now let’s take the difference of these two vectors and dot product it with the unit vector perpendicular to H0, we can get the expression of the width of the street. From previous derived function for gutter H1 and H2, we can get w dotted x positive is equal to one minus b for the positive sample and w dotted x negative is equal to negative y minus b for negative sample, let’s substitute them back to the street width equation and derive it as 2 over the magnitude of w.

In order to maximize this width, we thus need to minimize the magnitude of w. For mathematical convenience, it is equivalent to minimize one half times the magnitude of w squared, which also equivalent to one half times w dotted w. However, we cannot make the street width arbitrary large. We can only expand the street width under the condition that there are no data points between H1 and H2 mathematically.

This means for all the positive samples, they should obey the constraints the w dotted x plus b is greater or equal to 1, and for all the negative samples, the constraint is w dotted x plus b is less or equal than -1. Now let’s introduce another variable y to transforms these two equations to make it more convenient from mathematical perspective.

Here we assume y to be one for all positive samples and -1 for all negative samples. So the previous two equations can be rewritten as a single equation for all the samples. Let’s revisit the problems we are trying to solve here. We want to optimize the street width 1/2 times w dotted w subjected to the constrained equation for all the samples. It is a constrained optimization problems and we have a powerful technique Lagrange multiplier for this.

If we are going to find an extreme of a function with constraint, then we are going to have to use like Lagrange multiplier. This will give us some new expressions, using which we can find the optimization of it without thinking about the constraints anymore. For this problem, the new expression can be written as one-half times the w doted w minus the summation of all the constraints, and each of the constraints is going to have a multiplier alpha sub i. What we are going to do next is to find the extreme of this new expression.

Well, for this we just need to find all the derivatives of this expression and set them to zero. The first equation is a partial derivative of L, the Lagrangian with respect to the vector w. After setting it to zero. We can get to the w equals to the summation of alpha of sub i. times y sub i times x sub i over i for all the samples. This is very interesting. it tells us the vector w can be obtained with a linear sum of the sample vectors. Actually, at the end of the optimization.

A lot of alpha i is going to be zero and the non zero alpha. correspond to the samples in the gutter, which is the support vectors. Now we should differentiate Lagrangian with respect to to b as well, which is going to be equal to the sum of alpha sub i times y sub i, also at the extreme condition, it should be equal to zero.

Now we are going to plug the expression w back in the Lagrangian equation and the magic happens here. We can merge the first term and the second term and the rewrite the equation as a following one. What we’ve discovered is that the optimization depends only on the dot product of pairs of samples.

Now we can also substitute w back to the decision rule, which is going to be sum of alpha sub i times y sub i time x sub i dotted the unknown vector u plus b, if that is greater or equal to zero, then the prediction is a positive sample. Now, we discovered that the decision rule also depends only on the dot product of those sample vector with the unknown vector.

All our previous discussion is based on the assumption that samples are linearly separable and there are no outliers in the data, and the above SVM formulation is called Hard Margin SVM. What if there are some noise and outliers in the dataset, then the Hard Margin SVM simply can’t tolerate them and fail to find the optimization. Let’s revisit our inequality constraints here, for the optimization problem to be solved, all the constraints have to be satisfied.

In this part, we’ll talk about how to deal with this limitation using Soft Margin SVM. Basically, the trick is very simple, let’s add a slack variable zeta to the constraints of the optimization problem, it now becomes this equation with zeta. By adding these slack variables, when minimizing the objective function, it is now possible to satisfy the constraint even if some outliers do not meet the original constraint.

The problem is we can always choose a large enough value of zeta so that all the examples will satisfy the constraints. One technique to handle this problem is to use regularization. For example, we could use L1 regularization to penalize the large value of zeta. Now the regularized optimization problem becomes the following expression, notice that zeta here is always a positive number. And the regularization parameter C determines how important zeta should be, which means how much we want to avoid misclassifying each training example.

A smaller c emphasizes the importance of zeta and a larger C diminishes the importance of zeta, if we set c to positive infinite, we will get the same optimization result as the hard margin SVM. So small values of c will result in a wider margin, at the cost of some misclassifications, large values of c will result in the smaller margin and less tolerant to outliers.

Now, the soft margin SVM can handle the non-linearly separable data with some outliers and noise, what if the non-linearly separable data is not caused by the noise？ What if the data are characteristically non linearly separable like the following case？ In this part, we will talk about a technique called kernel trick to deal with this kind of problem.

For this example, it looks like there is no easy way to draw a line to separate the positive samples from the negative samples. However, if we transform the data from one space to another space, we would be able to find a hyperplane to separate the data. Now let’s revisit our lagaragin expression, we find that the optimization only depends on the dot product of the sample x sub i and x sub j.

So we could use any transformation function that can be expressed as the dot product of x sub i and x sub j. We call this function kernel function. So the kernel trick is to define a kernel function K (xi,xj) and use it to replace the x sub i dotted x sub j in the lagrangian expression. This is a small but powerful trick. For this problem, we could write the kernel as x one sub i square plus x two sub i square. If we plot these samples in the new space, a clear separation is visible and a line can be drawn to separate the positive samples from the negative samples.

Here I will also show two popular kernels, polynomial kernel, and RBF kernel. The polynomial kernel is defined as the following expression. This kernel contains two parameters, a constant c and a degree of freedom d. A larger value of d will make the decision boundary more complex and might result in overfitting.

The RBF kernel is defined as the following expression: The RBF kernel is also called Gaussian kernel. It will result in a more complex gaussian boundary, It has one parameter gama. Here are some examples for SVM with RBF kernel of different gamma values. The simulation is done with scikit learn, A small value of gamma will make the model behave like a linear SVM without kernel.

A large value of gamma will make the model heavily impacted by the support vectors and may result in overfitting.