Support Vector Machine

Introduction

Support Vector Machines are a class of supervised learning methods primarily used for classification. Although they can be formulated for regression and outlier detection as well. Instead of optimizing a set of parameters which compress or summarize the training set, they use a small subset of the training data to compute the decision function.

They rely on the data being linearly separable, so feature transformations are critical for problems in which the original representation of the data is not linearly separable.

Maximum Margin Classifier

Let’s start with a simple classification model as we studied with Logistic Regression. That is, we have

f(x)=wTϕ(x),

where ϕ(x) is a function which transforms our original input into some new feature space. The transformed input is assumed to be linearly separable so that a decision boundary can be computed. In the original logistic regression problem, a decision boundary was found through optimization. For linearly separable data, there are an infinite number of decision boundaries that satisfy the problem.

What about the quality of the decision boundary?

Is one decision boundary better than the other?

Formulation

Given a training set {x1,,xn} with labels {y1,,yn}, where yi{1,1}, we construct a linear model which classifies an input sample depending on the sign of the output.

Our decision rule for classification, given some input x, is

f(x)={1 if wTx+b01 if wTx+b<0

How large should the margin be?

In the original formulation of Logistic Regression, we saw that the parameter vector w described the normal to the decision boundary. The distance between a given point x and the decision boundary is given by

yif(x)||w||.

We can frame this as an optimization problem: come up with a value for w that maximizes the margin.

arg maxw,b1wminiyi(wTϕ(xi)+b)

We can arbitrarily scale the parameters, so we add an additional constraint that any point that lies on the boundary of the margin satisfies

yi(wTx+b)=1.

Under this constraint, we have that all samples satisfy

yi(wTx+b)1.

That is, all positive samples with target 1 will produce at least a 1, yielding a value greater than or equal to 1. All negative samples with target 1 will produce at most a 1, yielding a value greater than or equal to 1.

Another way of writing this is

f(x)={1 if wTx++b11 if wTx+b1,

where x+ is a positive sample and x is a negative sample. The decision rule can then be written as

yi(wTx+b)10.

This implies that the only samples that would yield an output of 0 are those that lie directly on the margins of the decision boundary.

Given this constraint of yi(wTx+b)1=0, we can derive our optimization objective.

The margin can be computed via the training data. To do this, consider two data points which lie on their respective boundaries, one positive and one negative, and compute the distance between them: x+x. This distance with respect to our decision boundary, defined by w, is given by

(x+x)w||w||.

For clarity, we can rewrite this as

1||w||(x+wxw).

If we substitute the sample values into the equality constraint above, we can simplify this form. For the positive sample, we have wTx=1b. For the negative sample, we get wTx=1b. The equation above then becomes

1||w||(1b(1b))=2||w||.

Thus, our objective is to maximize 2||w|| which is equivalent to minimizing 12||w||2 subject to the constraints yi(wTx+b)1. This is a constrainted optimization problem. As discussed previously, we can simplify such problems by introducing Lagrangian Multipliers. Doing this produces the dual representation of our optimization objection:

L=12||w||2i=1nαi(yi(wTxi+b)1).

To solve for w we compute wL.

wL=wi=1nαiyixi.

Setting this to 0 yields

w=i=1nαiyixi.

Doing the same for the other parameter b yields

0=i=1nαiyi.

We can now simplify our objective function by substituting these results into it:

L=12(i=1nαiyixi)2i=1nαi(yi((i=1nαiyixi)Txi+b)1)=12(i=1nαiyixi)2(i=1nαiyixi)2i=1nαiyib+i=1nαi=12(i=1nαiyixi)2+i=1nαi=i=1nαi12i=1nj=1mαiαjyiyjxixj

Thus, the objective is dependent on the inner product of samples xi and xj. If these were representations in some complex feature space, our problem would remain computationally inefficient. However, we can take advantage of Kernels for this.

Note that, in most cases, αi will be 0 since we only consider support vectors. That is, the points that lie on the margins of the decision boundary.

Overlapping Class Distributions

The above formulation is fine and works with datasets that have no overlap in feature space. That is, they are completely linearly separable. However, it is not always the case that they will be.

To account for misclassifications while still maximizing a the margin between datasets, we introduce a penalty value for points that are misclassified. As long as there aren’t too many misclassifications, this penalty will stay relatively low while still allowing us to come up with an optimal solution.

This penalty comes in the form of a slack variable ξi0 for each sample that is 0 for points that are on or inside the correct margin and ξi=|yif(x)| for others. If the point is misclassified, its slack variable will be ξi>1.

Multiclass SVM

Similar to our simple Logistic Regression method, SVMs are binary classifiers by default. We can take a similar approach to extending them to multiple classes, but there are downsides to each approach.

The “one-vs-all” approach entails building |K| classifiers and choose the classifier which predicts the input with the greatest margin.

The “one-vs-one” approach involves building |K||K|12 classifiers. In this case, training each classifer will be more tractable since the amount of data required for each one is less. For example, you would have a model for class 1 vs 2, class 1 vs 3, …, class 1 vs K. Then repeat for class 2: 2 vs 3, 2 vs 4, …, 2 vs |K|, and so on.

A third approach is to construct several models using a feature vector dependent on both the data and class label. When given a new input, the model computes

y=argmaxywTϕ(x,y).

The margin for this classifier is the distance between the correct class and the closest data point of any other class.

Additional Resources