Support vector machines (SVM) are a set of supervised learning models which can be used for binary data classification problems. They can be trained on labeled data sets describing objects as belonging to one category or another and can then be used to assign new unseen objects into one of the two categories. The objects are usually described via numerical vectors but other representations can also be used. The algorithm can also be used for classifications into multiple categories by reducing the multiple-class problem to multiple two-class problems.

This is a subject I ran into quite recently and this article is my attempt at trying to get a better understanding of the algorithms and the math behind them. I will go step by step through most of the mathematics necessary to derive support vectors and will try to explain how everything fits together.

Down the page you will find an interactive demo of SVM which is capable of classifying 2D points. You are able to add/remove points manually as well as select other customization options and see the SVM recalculate everything in real time.

In very simple terms what an SVM tries to accomplish is calculate "something" which separates points labeled in one of two ways into two distinct groups. It then classifies new points as belonging to one group/label or another based on which side of the "something" the points find themselves.

In the figure below (source: wikipedia) we have white points and black points and a bunch of lines are drawn to try and separate the points. Of the three lines, only the red and blue ones do their job. The green line is not acceptable because on the right hand side of it we can find both black and white points.

We can however tell just by looking at the picture that the red line is kinda better at its job then the blue one. The blue one gets awfully close to some of the points and that raises the question: if another white point is observed close to the border will it be classified together with the other white points or misclassified as being black? The red line generalizes better.

So how do we go about finding something which looks like the red line?

Well, first of all, the concept of a separating line is only valid if what we are trying to separate are 2D points. The "something" I mentioned earlier is not necessarily a line. More generally, it is called a hyperplane. In 1D this is a point. In 2D, a line. In 3D a plane (not an aircraft). In n dimensions a hyperplane. Note that in general we try to classify n-dimensional points. A hyperplane is defined as

$f(x) = \vec{w} \cdot \vec{x} + b = 0$

where $\vec{w}$ is the vector normal to the hyperplane (i.e. perpendicular to the hyperplane) and $b$ is something called a bias. To better understand where this is coming from go back to the 2D space and you'll see this is just another representation for a line. Take $\vec{w}(-a, 1)$ and $\vec{x}(x, y)$ and you get

$\vec{w} \cdot \vec{x} = y - ax$

$\vec{w} \cdot \vec{x} - b = y - ax - b$

Note that both $+b$ and $-b$ are used depending on who is doing the writing. Both are correct, as long as you are consistent with your notation. I will be using plus.

We know that for any $x$ belonging to the hyperplane $f(x)$ is going to be zero. We also know that for points on one side of the hyperplane $f(x) > 0$ and for points on the other side $f(x) < 0$. This is how we are going to decide in the future which class new unlabeled points belong to. So now we must find $f(x)$. The unknowns here are $\vec{w}$ and $b$.

We know that for points on one side of the hyperplane (labeled for example with plus)

$f(x_{+}) = \vec{w} \cdot \vec{x_{+}} + b > 0$

What we are going to demand from the optimal hyperplane is that for points on the plus labeled side of it we have

$f(x_{+}) = \vec{w} \cdot \vec{x_{+}} + b \geq \delta$

By convention $\delta = 1$. The idea is that our $f(x)$ is a discriminant function where we are only interested in its sign. We can select infinitely many forms of $f(x)$ by scaling $\vec{w}$ and $b$ but to keep the maths neat we choose $\delta = 1$. This gives rise to the following equations (where plus is one label and minus another):

$\vec{w} \cdot \vec{x_{+}} + b \geq +1$

$\vec{w} \cdot \vec{x_{-}} + b \leq -1$

If we imagine the hyperplane as a two-lane street dividing the two groups then $f(x)$ is the middle of the street and these two equations demand that points lie on either side of the street (i.e. beyond the margins of the street, clear of the lanes).

If we introduce a new variable $y_{i}$ associated with all $x_{i}$ taking the value $+1$ when when $x_{i}$ is of one label and $-1$ when it is of another then we can reduce the two equations above to:

$y_i (\vec{x_i} \cdot \vec{w} + b) - 1 \geq 0$

Why did we bother building a whole street to separate some points when a line would've been enough? Because we want to leave as much space between the labeled groups as possible. In other words we want as wide of a street as possible. But what is the width of the street?

Notice that for points lying precisely at the edge of the street (in the gutters if you will) the equation above is precisely equal to zero:

$y_i (\vec{x_i} \cdot \vec{w} + b) - 1 = 0 \ \ \ \ (1)$

We also know that if we take one negative sample lying in the gutter and one positive sample lying in the gutter we can calculate the width of the street as follows:

$W = (\vec{x_{+}} - \vec{x_{-}}) \cdot \frac{\vec{w}}{||\vec{w}||} \ \ \ \ (2)$

Note that $(\vec{x_{+}} - \vec{x_{-}})$ is simply giving us a vector going from one positive sample in the gutter to one negative sample in the gutter. To get the width of the street we must dot that vector with a unit vector normal to the street. We are using $\frac{\vec{w}}{||\vec{w}||}$ as our normal unit vector because we know $\vec{w}$ is normal to the hyperplane and dividing it by its magnitude makes it a unit vector.

Note that we can use equation $(1)$ to simplify equation $(2)$ and this gives us:

$W = \frac{1 + b + 1 - b}{||\vec{w}||} = \frac{2}{||\vec{w}||}$

Because we said we want a hyperplane as wide as possible it follows we must try and maximize $W$. But this is equivalent to minimizing $||\vec{w}||$. Thus our problem then becomes:

Minimize $||\vec{w}||$ subject to $y_i (\vec{x_i} \cdot \vec{w} + b) - 1 \geq 0$. How do we do this?

Solving the minimization problem can be done using Lagrange multipliers. Generically a Lagrangian function will be of the form L = function to find the extreme of plus a summation over all the constraints plus a summation over the Lagrangian multipliers. The function we are trying to minimize is $||\vec{w}||$ or equivalently $\frac{1}{2} ||\vec{w}||^2$. The latter notation is preferred because it gives us nicer results later when we have to take some derivatives. The constraint was given above. The Lagrangian function is:

$L_P = \frac{1}{2} ||\vec{w}||^2 - \sum_{i}^{l} \alpha_i y_i (\vec{x_i} \cdot \vec{w} + b) + \sum_{i}^{l} \alpha_i$

To solve this we must take the derivatives of L with respect to anything that might vary.

$\frac{\partial L}{\partial w} = \vec{w} - \sum \alpha_i y_i x_i = 0$

$\frac{\partial L}{\partial b} =\sum \alpha_i y_i = 0$

An interesting result which stems from these equations is that in fact

$\vec{w} = \sum \alpha_i y_i x_i$

Plugging this back into $L_P$ gives us the following:

$L_D = \sum_{i}^{} \alpha_i - \frac{1}{2} \sum_{i,j}^{} \alpha_i \alpha_j y_i y_j \vec{x_i} \cdot \vec{x_j}$

where indexes i and j are used to differentiate between the summations which were in LP originally and the new summation we substituted into the equation. Note this last equation is what's called a Lagrangian dual and maximizing this is equivalent to minimizing the original (primal) Lagrangian.

But why did we bother with the Lagrangian dual? Because, as you might notice, it only relates to our original samples via dot products between pairs of x samples. Both $w$ and $b$ disappear and we'll later see we can replace this dot product with any other function of the two x vectors (kernel function) for better results in certain situations.

The next question of course is how do we actually maximize this $L_D$. Well, there are various quadratic programming methods which can be used to solve this problem but this subject area is something I know little about. However, a specific solution I was able to find and actually follow is something called Sequential minimal optimization. This is a technique invented by John Platt at Microsoft Research for solving precisely this problem. You can find his paper here, the wiki page here, and a simplified smo version here.

Using SMO to solve this problem will in the end give us the unknowns $\vec{w}$ and $b$ we were looking for. These two determine the decision function $f(x)$ we set out to find.

An issue with our method so far is that for some non linearly separable points we will not be able to determine our decision function. In our Lagrangian dual our coefficients will grow arbitrarily large and we will not be able to find the optimal hyperplane (because there isn't one). To solve this we will introduce some new variables which will control how large the coefficients can grow. This will allow us to actually determine a hyperplane with the added minus that some points will be miss-classified.

For a more in depth description of this non separable case see this paper. It's actually a full description of how SVM works and quite easy to follow.

To keep it short, what happens is we introduce a new "slack" variable $\xi$ which relaxes our initial street margin conditions (i.e. some points will in fact be allowed to exist on the wrong side of the road). Our constraints then become:

$\vec{w} \cdot \vec{x_{+}} + b \geq +1 - \xi_i$

$\vec{w} \cdot \vec{x_{-}} + b \leq -1 + \xi_i$

$\xi_i \geq 0$

A way to then incorporate these errors into our minimization problem is to add them to our original function $\frac{1}{2}||\vec{w}||^2$ like so:

$\frac{1}{2}||\vec{w}||^2 + C \sum \xi_i$

where $\sum \xi_i$ is an upper bound on the number of training errors and $C$ is a parameter to be chosen by the user where larger values of $C$ corresponding to assigning a higher penalty to errors.

What's interesting about this approach is that our optimization problem changes very little. Our Lagrangian dual will still look the same. What happens is now we have an upper bound on the Lagrangian coefficients. This bound means our Lagrangian will not grow arbitrarily large and a solution will always be found. Our new equations then become:

$L_D = \sum_{i}^{} \alpha_i - \frac{1}{2} \sum_{i,j}^{} \alpha_i \alpha_j y_i y_j \vec{x_i} \cdot \vec{x_j}$

subject to

$\frac{\partial L}{\partial w} = \vec{w} - \sum \alpha_i y_i x_i = 0$

$\frac{\partial L}{\partial b} =\sum \alpha_i y_i = 0$

and (only new thing)

$0 \leq \alpha_i \leq C$

This equation is solved just like before to obtain our classifier function (with the exception that this time some points may lie on the wrong side of the road).

Something else to do with the x vectors not being linearly separable is that while they may not be separable in their current space, they may be separable in another space.

For example some points may look one way (inseparable) in 2D but look differently (separable) when projected into 3D. Remember how I said earlier what's cool about that Lagrangian is we can replace that dot product? What we replace that dot product with is called a kernel function. These kernel functions transform the feature vectors (our x values) into a different space and separate them there.

See the figure below for a representation of this idea (source here).

So using a kernel function our Lagrangian becomes

$L_D = \sum_{i}^{} \alpha_i - \frac{1}{2} \sum_{i,j}^{} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$

and is solved just like before. The difference is this time our separators will probably be more "creative", i.e. not as boring as a simple straight line, but curves, spheres etc.

Some popular kernel choices are:

$K(\vec{x_i}, \vec{x_j}) = (\vec{x_i} \cdot \vec{x_j} + c)^n$

$K(\vec{x_i}, \vec{x_j}) = e^{-\frac{||\vec{x_i} \cdot \vec{x_j}||^2}{2\sigma^2}}$

Note that not any function can be a kernel function and choosing a kernel function seems to be a matter of experimentation, cross-validation, and domain-specific knowledge related to the data the algorithm is trained on.

In this section you will find a small interactive JS demo of SVM in 2D. You can add points to the white canvas by clicking on it. You can select the label of each point by clicking either the plus or the minus button before adding the point. Once you have added at least one point from each category they will be classified according to your other inputs.

You can select what kernel function you want to use. Linear represents the simple dot product. RBF stands for radial basis function and is the kernel written above which contains $e$. The other kernel is the polynomial one and is the other kernel written above. For this demo I've set $c = 1$. You can customize $\sigma$ for the RBF kernel and $n$ for the polynomial kernel.

C | 1 |

The code actually computing the classifier is an implementation of the SMO algorithm mentioned above. Credits for the SMO implementation go to Andrej Karpathy who published this code under the MIT license.

SVM is inherently a binary classifier. One object can either belong to one label or another. However, can also use it in multiclass problems. This is usually done by reducing the multiclass problem to multiple two-class problems. Two popular ways of doing this are:

- One-vs-all
- One-vs-one

The first method involves building as many classifiers as there are labels and then running a new element through all of them (each classifier can tell if an item belongs to a specific label or to any of the others). The predicted class is given by the classifier which can select a class with the highest margin.

The second method requires creating one classifier for every possible pair of labels. The predicted class is the one selected by the most classifiers.

A paper comparing various methods for multiclass classification in SVMs can be read here. Based on this paper, libsvm a popular SVM implementation uses the one-vs-one approach.