7_svm

Support Vector Machine

Table of Content

1. Support Vector Machine (SVM) 1.1. Hard-margin SVM 1.2. Soft-margin SVM 1.2.1. How to solve soft-margin SVM 1.3. SVM for non-linear data 1.4. Kernel trick 1.4.1. RBF kernel 1.5. The dual problem 1.5.1. Lagrange multiplier 1.6. More on kernel trick 1.7. Kernel trick - other resources 1.8. Multi-class SVM 1.8.1. Structured SVM 1.9. SVM for regression 1.10. Other considerations

1. Support Vector Machine (SVM)• SVM removes the concept of decision threshold by instead selecting a decision boundary which maximizes the distance between itself and the two most difficult examples to classify.• These two most difficult examples to classify are called the support vectors.– In comparison, logistic regression finds a decision boundary which minimized the negative log loss of the training examples.– SVM on the other hand, only focuses on the support vectors.• For SVM, the decision boundary (also called the hyperplane) is defined by•

w^{T} x - b = 0

• • The values of

w

and

b

will be optimized when the distance between the decision boundary and the support vectors are maximized.

Figure 1:SVM, decision boundary is the red line.

• You can think of SVM is actually making three lines; one line for the hyperplane, defined by equation

(1)

, two more lines (support vectors) with these equations:

\begin{array}{c} w^{T} x - b = 1 \\ w^{T} x - b = - 1 \end{array}

• Note: All the positive/negative examples lie in the left/right of these lines.• The distance between these two support vectors is called the margin →

\frac{2}{| | w | |}

, where

| | w | | = \sqrt{w_{1}^{2} + w_{2}^{2} + \dots + w_{n}^{2}}

. • Note: The distance between the decision boundary and the support vector →

\frac{1}{| | w | |}

. 1.1. Hard-margin SVM• • Maximizing the distance between the decision boundary and support vectors means minimizing the denominator of the margin, i.e. minimizing

| | w | |

→ this ensures the maximum margin between the two support vector. • What we can't do is maximize our margin so much so that we actually end up going past the support vectors themselves. Mathematically, this means:

\begin{array}{c} \min | | w | | \\ s . t . \\ w^{T} x - b ⩾ 1 w h e n y_{i} = 1 \\ w^{T} x - b ⩽ - 1 w h e n y_{i} = - 1 \end{array}

• Equation

(3)

ensures that SVM produces something greater (or equal) to 1, given that it is in fact a positive example (i.e.

y_{i} = 1

). Similarly for for equation

(4)

.• We can summarize equations

(3)

and

(4)

into one inequality equation:

y_{i} (w^{T} x - b) ⩾ 1

• This is a form of constrained optimization problem.• Since

| | w | |

is a linear term, we can use linear programming → however, linear programming won't guarantee us a unique solution and it can be unstable in some cases.• Typically, we square the

| | w | |

to use quadratic programming to minimize

| | w | |^{2}

. Quadratic programming will guarantee us a unique solution.• If we use quadratic programming, we'd have something called a hard-margin SVM.

Figure 2:Data becomes linearly separable as quadratic transformation

1.2. Soft-margin SVM • Hard-margin SVM is sensitive to outliers. Check out the figure below. • Obviously, the solution above (the orange) isn't the best separator.– What we'd really like is to be able to give some slack to the hard-margin SVM such that it could assign support vectors like the purple one and ignore the outliers.• To ignore the outliers, we're using something called slack. – Slack allows us to relax our constraints such that every single data point like the outlier above doesn't have to be to the left of the left support vector.– This gives us the soft-margin SVM.– We do this by adding an error term to the constraint. – Adding the error term means that no longer does every single point have to exist on the correct side of the margin → Instead, we can allow some points to go within the margin.

\begin{array}{c} \min | | w | |^{2} + C \sum_{N}^{} e_{i} \\ s . t . \\ y_{i} (w^{T} x - b) ⩾ 1 - e_{i} \end{array}

• The error,

e_{i}

, is the distance between the new support vector (in purple) and where the old support vector (left orange support vector) would have been in hard-margin SVM.–

e_{i} = 0

→ correct side of margin–

0 < e_{i} < 1

→ still correct but with a penalty–

e_{i} ⩾ 1

→ incorrect (example is on the other side of margin/hyperplane), higher penalty• Note: Here, we are minimizing the errors as well as the the sum of squared weights.• Note: The parameter

C

is a regularization parameter. It indicates how much we want to penalize a particular example lying within the margin. 1.2.1. How to solve soft-margin SVM• We have the error term,

e_{i}

, in both equations

(6)

and

(7)

. • We can solve for

e_{i}

in equation

(7)

and plug it in for

e_{i}

in equation

(6)

such that we no longer have any constraints.• We can do this as follows,

\begin{array}{c} e_{i} ⩾ 1 - y_{i} (w^{T} x - b) \\ since e_{i} ⩾ 0 \\ e_{i} = \max (0, 1 - y_{i} (w^{T} x - b)) \\ Substitute it in the optimization function \\ \min | | w | |^{2} + C \sum_{N}^{} \max (0, 1 - y_{i} (w^{T} x - b)) \end{array}

• This part →

\max (0, 1 - y_{i} (w^{T} x - b))

is called hinge loss.• Since hinge loss is not a differentiable function, we can't find its gradient.• To handle the not-differentiable point, we use a technique called sub-gradient descent, which in particular use Pegasus algorithm for soft-margin SVM → which can solve/optimize equation

(9)

with gradient descent.• The benefit is that we get a guaranteed minimum. 1.3. SVM for non-linear data• To handle non-linear data, we can project it into additional dimensions.– Primarily, we can multiply elements by themselves or have feature interaction terms.• Projecting data into higher dimensions make it possible to find a hyperplane that can separate the data. • Adding interaction terms gets difficult when we have many features → i.e. it would result in very high dimensions.– For example, starting with, say, 100 features, we can end up with thousands of features by adding all the interaction terms.– Also, the thing is based on equation

(9)

, we eventually have to dot product all the thousands of features into their weights which results in a single number.– Instead, we can doing that, we can do a more clever technique called the kernel trick. 1.4. Kernel trick• What the kernel trick does is that it allows us to avoid transforming all of our features into these larger dimensions but still allows us to still extract that dot product without performing the feature transformation.• The only caveat is that we can't use the kernel trick with the SVM in its primal form, i.e. equation

(9)

.• We need to implement the kernel trick on the dual form of SVM.• We can use the representer theorem to represent SVM weights by this equation:

\begin{array}{c} w = \sum_{N}^{} a_{i} y_{i} x_{i} \\ a_{i} = {\begin{cases} 1 when i is a support vector \\ 0 otherwise \end{cases} \end{array}

• Now, we take the Lagrangian dual of the SVM which gives us the new optimization form to represent SVM:•

\max \sum_{N}^{} a_{i} - \frac{1}{2} \sum_{N}^{} 𝛼_{j} 𝛼_{k} y_{j} y_{k} x_{j}^{T} x_{k}

• The

x_{j}^{T} x_{k}

term benefits us in two ways:– If the number of examples that you're trying to classify is far less than the number of dimensions that each example has, then this computation will be a lot more efficient compared to computing

w^{T} x

.– Now that we have this new form, equation

(11)

, we can apply the kernel trick.

\max \sum_{N}^{} a_{i} - \frac{1}{2} \sum_{N}^{} 𝛼_{j} 𝛼_{k} y_{j} y_{k} Φ (x_{j}^{T} x_{k})

• • With kernel trick, we can have some dimensions of

x

. The kernel trick function,

Φ (.)

, allows us to calculate the high dimensional dot product and it will return a singular scalar value.– This is massive savings when the number of examples that we have is far less than the dimensions that we want to project our data into.• The kernel trick function can be:– Linear function– Polynomial function– Gaussian (e.g. RBF kernel)– etc. • Note: Kernel trick is not only for SVMs. For instance, we can take Lagrangian dual of linear regression, and we can get our

x

terms together → that means we use the kernel trick here as well.

y = \sum_{N}^{} a_{i} Φ (x^{T} x)

1.4.1. RBF kernel• RBF → Radial Basis Function

\begin{array}{c} K_{R B F} (x, x') = e_{}^{- (\frac{| | x - x' | |^{2}}{2 𝜎^{2}})} \\ 𝜎 is a tuning parameter: \\ if 𝜎 is too small → overfitting \\ if 𝜎 is too large → underfitting \end{array}

• It represents a separate dimension per data point that you have, i.e. if you have a 100 data points, the RBF kernel would represent your data in 100 dimensions.• Why is this useful?– RBF kernel assigns every data point to a Gaussian distribution that can have different height or width.– Then, it traces a line (or hyperplane) of the sum of the Gaussian distributions.– It does it so it can project data points onto that line within its Gaussian distribution.– By doing this, we get a linearly separable data. 1.5. The dual problem• In convex optimization, for every primal problem (e.g. equation

(9)

) we can derive a dual problem.• Let

𝛼 \in R^{N}

be the dual variables, corresponding to Lagrange multipliers that enforce the

N

inequality constraints.• The generalized Lagrangian is given below,

L (w, w_{0}, 𝛼) = \frac{1}{2} w^{T} w - \sum_{n = 1}^{N} 𝛼_{n} ({\tilde{y}}_{n} (w^{T} x_{n} + w_{0}) - 1)

• To optimize this, we must find a stationary point that satisfies

(\hat{w}, \hat{w_{0}}, \hat{𝛼}) = \min_{w, w_{0}} \max_{𝛼} L (w, w_{0}, 𝛼)

• We can do this by computing the partial derivatives w.r.t.

w

and

w_{0}

and setting to zero,

\begin{array}{c} \nabla_{w} L (w, w_{0}, 𝛼) = w - \sum_{n = 1}^{N} 𝛼_{n} {\tilde{y}}_{n} x_{n} \\ \frac{𝜕}{𝜕 w_{0}} L (w, w_{0}, 𝛼) = - \sum_{n = 1}^{N} 𝛼_{n} {\tilde{y}}_{n} \\ and hence \\ \hat{w} = \sum_{n = 1}^{N} {\tilde{𝛼}}_{n} {\tilde{y}}_{n} x_{n} \\ 0 = \sum_{n = 1}^{N} {\tilde{𝛼}}_{n} {\tilde{y}}_{n} \\ Plugging these into the Lagrangian yields the following \\ L (\hat{w}, {\hat{w}}_{0}, 𝛼) = \frac{1}{2} {\hat{w}}^{T} \hat{w} - \sum_{n = 1}^{N} 𝛼_{n} {\tilde{y}}_{n} {\hat{w}}^{T} x_{n} - \sum_{n = 1}^{N} 𝛼_{n} {\tilde{y}}_{n} w_{0} + \sum_{n = 1}^{N} 𝛼_{n} \\ = \frac{1}{2} {\hat{w}}^{T} \hat{w} - {\hat{w}}^{T} \hat{w} - 0 + \sum_{n = 1}^{N} 𝛼_{n} \\ = - \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} 𝛼_{i} 𝛼_{j} {\tilde{y}}_{i} {\tilde{y}}_{j} x_{i}^{T} x_{j} + \sum_{n = 1}^{N} 𝛼_{n} \end{array}

• This is called the dual form of the objective.• We want to maximize this w.r.t.

𝛼

subject to the constraints that

\sum_{n = 1}^{N} {\tilde{𝛼}}_{n} {\tilde{y}}_{n} = 0

and

𝛼_{n} ⩾ 0

for

n = 1 : N

.• The above objective is a quadratic problem in

N

variables.• Standard QP solvers take

O (n^{3})

time.• However, specialized algorithms, such as the sequential minimal optimization (SMO) algorithm, is developed for this problem that takes

O (n)

.• Since this is a convex objective, the solution must satisfy the KKT conditions,

\begin{array}{c} 𝛼_{n} ⩾ 0 \\ {\tilde{y}}_{n} f (x_{n}) - 1 ⩾ 0 \\ 𝛼_{n} ({\tilde{y}}_{n} f (x_{n}) - 1) = 0 \end{array}

• Hence either

𝛼_{n} = 0

(in which case example

n

is ignored when computing

\hat{w}

) or the constraint

{\tilde{y}}_{n} ({\hat{w}}^{T} x_{n} + {\hat{w}}_{0}) = 1

is active.• This latter condition means that example

n

lies on the decision boundary (they're called support vectors). We denote the set of support vectors by

S

.• To perform prediction, we use

f (x; \hat{w}, {\hat{w}}_{0}) = {\hat{w}}^{T} x_{n} + {\hat{w}}_{0} = \sum_{n \in S}^{} 𝛼_{n} {\tilde{y}}_{n} x_{n}^{T} x + {\hat{w}}_{0}

• To solve for

{\hat{w}}_{0}

, we can use the fact that for any support vector, we have

{\tilde{y}}_{n} f (x; \hat{w}, {\hat{w}}_{0}) = 1

. • Multiplying both sides by

{\tilde{y}}_{n}

, and exploiting the fact that

{\tilde{y}}_{n}^{2} = 1

, we get

{\hat{w}}_{0} = {\tilde{y}}_{n} - {\hat{w}}^{T} x_{n}

. • In practice, we get better results by averaging over all the support vectors,

{\hat{w}}_{0} = \frac{1}{| S |} \sum_{n \in S}^{} ({\tilde{y}}_{n} - {\hat{w}}^{T} x_{n}) = \frac{1}{| S |} \sum_{n \in S}^{} ({\tilde{y}}_{n} - \sum_{m \in S}^{} 𝛼_{m} {\tilde{y}}_{m} x_{m}^{T} x_{n})

1.5.1. Lagrange multiplier• In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints.• The basic idea is to convert a constrained problem into a form such that the derivative test of an unconstrained problem can still be applied.• The method can be summarized as follows: in order to find the maximum or minimum of a function

f (x)

subjected to the equality constraint

g (x) = 0

, form the Lagrangian function, equation

(23)

, and find the stationary of

L

considered as a function of

x

and the Lagrange multiplier

𝜆

L (x, 𝜆) = f (x) + 𝜆 g (x)

• • This means that all partial derivatives should be zero, including the partial derivative w.r.t.

𝜆

.• The solution corresponding to the original constrained optimization is always a saddle point of the Lagrangian function.• The great advantage of this method is that it allows the optimization to be solved without explicit parameterization in terms of the constraints.– As a result, the method of Lagrangian multipliers is widely used to solve challenging constrained optimization problems.

Figure 3:The red curve shows the constraint

g (x, y) = c

. The blue curves are contours of

f (x, y)

. The point wherethe red constraint tangentially touches a blue contour is themaximum of

f (x, y)

along the constraint, since

d_{1} > d_{2}

1.6. More on kernel trick• Kernel Definition: A function that takes as its input vectors in the original space and returns the dot product of the vectors in the feature space is called a kernel function.– More formally, if we have data

x, z \in X

and a map

𝜙 : X \to R^{N}

, then

k (x, z) = ⟨ 𝜙 (x), 𝜙 (z) ⟩

is a kernel function. • Once we converted our problem into a dual problem form, equation

(20)

, it has

N

unknowns (

𝛼

) which (in general) takes

O (n^{3})

time to solve, which can be slow.• However, the principal benefit of dual problem is that we can replace all inner product operations

x_{}^{T} x'

with a call to a positive definite (Mercer) kernel function,

K (x, x')

→ This is called the kernel trick.• In particular, we can rewrite the prediction function, equation

(21)

, as follows,

f (x) = {\hat{w}}^{T} x_{n} + {\hat{w}}_{0} = \sum_{n \in S}^{} 𝛼_{n} {\tilde{y}}_{n} x_{n}^{T} x + {\hat{w}}_{0} = \sum_{n \in S}^{} 𝛼_{n} {\tilde{y}}_{n} K (x_{n}, x) + {\hat{w}}_{0}

• We also need to kernelize the bias term. This can be done by kernelizing equation

(22)

{\hat{w}}_{0} = \frac{1}{| S |} \sum_{i \in S}^{} ({\tilde{y}}_{i} - (\sum_{j \in S}^{} 𝛼_{j} {\tilde{y}}_{j} x_{j})^{T} x_{i}) = \frac{1}{| S |} \sum_{i \in S}^{} ({\tilde{y}}_{i} - \sum_{j \in S}^{} 𝛼_{j} {\tilde{y}}_{j} K (x_{j}, x_{i}))

• This kernel trick allows us to avoid having to deal with an explicit feature representation of our data, and allows us to easily apply classifiers to structured objects, such as strings and graphs. 1.7. Kernel trick - other resources• PDF screenshots here.

1.8. Multi-class SVM• One way to do this is to create an SVM per class → i.e. classify one class against every other classes → This paradigm is called one-vs-rest.– To make a prediction for an example, we feed it to all the trained [one-vs-rest] SVMs and we measure the margin each SVM produces. We choose the prediction that produces the largest margin between that example and the other classes.• Another way to handle multi-classes is to create a one-vs-one paradigm.• Here, we're creating a pair-wise SVM so that for every single pair of classes we're creating a single SVM.– To get a prediction, we simply feed the unseen example through every single SVM and select whichever class was most often predicted for that example.• Note: Even though the one-vs-one paradigm requires a lot of SVMs, the data required per each SVM is only two classes → So, this can actually be faster in some cases depending on the data. 1.8.1. Structured SVM• Here, instead of the margin being

- 1

and

1

, the margin is actually the distance between the two closest classes. 1.9. SVM for regression• The idea is very similar to SVM classifier. • The only difference is that the goal is to keep all points within the margin.• The slack variable (i.e. the error terms) here comes from points that lie outside of the margin. 1.10. Other considerations• SVMs can linearly separate data out of the box → hard-margin SVM.• If we add slack variables → soft-margin SVM.• We can use sub-gradient descent to optimize soft-margin SVM.• We can add interaction terms to separate data even if it's not linearly separable. This is often preferred when you have either a low number of dimensions that you want to project into or if your data is extremely large.• However, if your data isn't so large, and you want to project your features into a very high dimensional space, you can use kernel trick, which allows us to avoid computing this actual feature transformation.• SVMs are distance-based. So, we have to consider scaling our features as well.• Why to use SVM over logistic regression?– If you have a low number of examples → just start with a linear SVM because they only focus on the support vectors.– If you have a ton of data and not many features → you might be better off starting with logistic regression. Back to Top