Support Vector Machine
Introduction
Support Vector Machines are a class of supervised learning methods primarily used for classification. Although they can be formulated for regression and outlier detection as well. Instead of optimizing a set of parameters which compress or summarize the training set, they use a small subset of the training data to compute the decision function.
They rely on the data being linearly separable, so feature transformations are critical for problems in which the original representation of the data is not linearly separable.
Maximum Margin Classifier
Let’s start with a simple classification model as we studied with Logistic Regression. That is, we have
where
What about the quality of the decision boundary?
Is one decision boundary better than the other?
Formulation
Given a training set
Our decision rule for classification, given some input
How large should the margin be?
In the original formulation of Logistic Regression, we saw that the parameter vector
We can frame this as an optimization problem: come up with a value for
We can arbitrarily scale the parameters, so we add an additional constraint that any point that lies on the boundary of the margin satisfies
Under this constraint, we have that all samples satisfy
That is, all positive samples with target
Another way of writing this is
where
This implies that the only samples that would yield an output of 0 are those that lie directly on the margins of the decision boundary.
Given this constraint of
The margin can be computed via the training data. To do this, consider two data points which lie on their respective boundaries, one positive and one negative, and compute the distance between them:
For clarity, we can rewrite this as
If we substitute the sample values into the equality constraint above, we can simplify this form. For the positive sample, we have
Thus, our objective is to maximize
To solve for
Setting this to 0 yields
Doing the same for the other parameter
We can now simplify our objective function by substituting these results into it:
Thus, the objective is dependent on the inner product of samples
Note that, in most cases,
Overlapping Class Distributions
The above formulation is fine and works with datasets that have no overlap in feature space. That is, they are completely linearly separable. However, it is not always the case that they will be.
To account for misclassifications while still maximizing a the margin between datasets, we introduce a penalty value for points that are misclassified. As long as there aren’t too many misclassifications, this penalty will stay relatively low while still allowing us to come up with an optimal solution.
This penalty comes in the form of a slack variable
Multiclass SVM
Similar to our simple Logistic Regression method, SVMs are binary classifiers by default. We can take a similar approach to extending them to multiple classes, but there are downsides to each approach.
The “one-vs-all” approach entails building
The “one-vs-one” approach involves building
A third approach is to construct several models using a feature vector dependent on both the data and class label. When given a new input, the model computes
The margin for this classifier is the distance between the correct class and the closest data point of any other class.