Support Vector Machine
Table of Content
1. Support Vector Machine (SVM) 1.1. Hard-margin SVM 1.2. Soft-margin SVM 1.2.1. How to solve soft-margin SVM 1.3. SVM for non-linear data 1.4. Kernel trick 1.4.1. RBF kernel 1.5. The dual problem 1.5.1. Lagrange multiplier 1.6. More on kernel trick 1.7. Kernel trick - other resources 1.8. Multi-class SVM 1.8.1. Structured SVM 1.9. SVM for regression 1.10. Other considerations
1. Support Vector Machine (SVM)SVM removes the concept of decision threshold by instead selecting a decision boundary which maximizes the distance between itself and the two most difficult examples to classify.These two most difficult examples to classify are called the support vectors.In comparison, logistic regression finds a decision boundary which minimized the negative log loss of the training examples.SVM on the other hand, only focuses on the support vectors.For SVM, the decision boundary (also called the hyperplane) is defined by wTx-b=0 The values of w and b will be optimized when the distance between the decision boundary and the support vectors are maximized.
Figure 1:SVM, decision boundary is the red line.
You can think of SVM is actually making three lines; one line for the hyperplane, defined by equation (1), two more lines (support vectors) with these equations: wTx-b=1wTx-b=-1 Note: All the positive/negative examples lie in the left/right of these lines.The distance between these two support vectors is called the margin2||w||, where ||w||=w12+w22++wn2 . Note: The distance between the decision boundary and the support vector → 1||w||. 1.1. Hard-margin SVM Maximizing the distance between the decision boundary and support vectors means minimizing the denominator of the margin, i.e. minimizing ||w||this ensures the maximum margin between the two support vector. What we can't do is maximize our margin so much so that we actually end up going past the support vectors themselves. Mathematically, this means: min||w||s.t.wTx-b1whenyi=1wTx-b-1whenyi=-1 Equation (3) ensures that SVM produces something greater (or equal) to 1, given that it is in fact a positive example (i.e. yi=1). Similarly for for equation (4).We can summarize equations (3) and (4) into one inequality equation: yi(wTx-b)1 This is a form of constrained optimization problem.Since ||w|| is a linear term, we can use linear programming → however, linear programming won't guarantee us a unique solution and it can be unstable in some cases.Typically, we square the ||w|| to use quadratic programming to minimize ||w||2. Quadratic programming will guarantee us a unique solution.If we use quadratic programming, we'd have something called a hard-margin SVM.
Figure 2:Data becomes linearly separable as quadratic transformation
1.2. Soft-margin SVM Hard-margin SVM is sensitive to outliers. Check out the figure below. Obviously, the solution above (the orange) isn't the best separator.What we'd really like is to be able to give some slack to the hard-margin SVM such that it could assign support vectors like the purple one and ignore the outliers.To ignore the outliers, we're using something called slack. Slack allows us to relax our constraints such that every single data point like the outlier above doesn't have to be to the left of the left support vector.This gives us the soft-margin SVM.We do this by adding an error term to the constraint. Adding the error term means that no longer does every single point have to exist on the correct side of the margin → Instead, we can allow some points to go within the margin. min||w||2+CNeis.t.yi(wTx-b)1-ei The error, ei, is the distance between the new support vector (in purple) and where the old support vector (left orange support vector) would have been in hard-margin SVM.ei=0 → correct side of margin0<ei<1 → still correct but with a penaltyei1 → incorrect (example is on the other side of margin/hyperplane), higher penaltyNote: Here, we are minimizing the errors as well as the the sum of squared weights.Note: The parameter C is a regularization parameter. It indicates how much we want to penalize a particular example lying within the margin. 1.2.1. How to solve soft-margin SVMWe have the error term, ei, in both equations (6) and (7). We can solve for ei in equation (7) and plug it in for ei in equation (6) such that we no longer have any constraints.We can do this as follows,ei1-yi(wTx-b)sinceei0ei=max(0,1-yi(wTx-b))Substitute it in the optimization functionmin||w||2+CNmax(0,1-yi(wTx-b)) This part → max(0,1-yi(wTx-b)) is called hinge loss.Since hinge loss is not a differentiable function, we can't find its gradient.To handle the not-differentiable point, we use a technique called sub-gradient descent, which in particular use Pegasus algorithm for soft-margin SVM → which can solve/optimize equation (9) with gradient descent.The benefit is that we get a guaranteed minimum. 1.3. SVM for non-linear dataTo handle non-linear data, we can project it into additional dimensions.Primarily, we can multiply elements by themselves or have feature interaction terms.Projecting data into higher dimensions make it possible to find a hyperplane that can separate the data. Adding interaction terms gets difficult when we have many features → i.e. it would result in very high dimensions.For example, starting with, say, 100 features, we can end up with thousands of features by adding all the interaction terms.Also, the thing is based on equation (9), we eventually have to dot product all the thousands of features into their weights which results in a single number.Instead, we can doing that, we can do a more clever technique called the kernel trick. 1.4. Kernel trickWhat the kernel trick does is that it allows us to avoid transforming all of our features into these larger dimensions but still allows us to still extract that dot product without performing the feature transformation.The only caveat is that we can't use the kernel trick with the SVM in its primal form, i.e. equation (9).We need to implement the kernel trick on the dual form of SVM.We can use the representer theorem to represent SVM weights by this equation:w=Naiyixiai={1wheniis a support vector0otherwiseNow, we take the Lagrangian dual of the SVM which gives us the new optimization form to represent SVM: maxNai-12N𝛼j𝛼kyjykxjTxk The xjTxk term benefits us in two ways:If the number of examples that you're trying to classify is far less than the number of dimensions that each example has, then this computation will be a lot more efficient compared to computing wTx.Now that we have this new form, equation (11), we can apply the kernel trick. maxNai-12N𝛼j𝛼kyjykΦ(xjTxk) With kernel trick, we can have some dimensions of x. The kernel trick function, Φ(.), allows us to calculate the high dimensional dot product and it will return a singular scalar value.This is massive savings when the number of examples that we have is far less than the dimensions that we want to project our data into.The kernel trick function can be:Linear functionPolynomial functionGaussian (e.g. RBF kernel)etc. Note: Kernel trick is not only for SVMs. For instance, we can take Lagrangian dual of linear regression, and we can get our x terms together → that means we use the kernel trick here as well.y=NaiΦ(xTx)1.4.1. RBF kernelRBF → Radial Basis Function KRBF(x,x)=e-(||x-x||22𝜎2)𝜎is a tuning parameter:if𝜎is too small → overfittingif𝜎is too large → underfitting It represents a separate dimension per data point that you have, i.e. if you have a 100 data points, the RBF kernel would represent your data in 100 dimensions.Why is this useful?RBF kernel assigns every data point to a Gaussian distribution that can have different height or width.Then, it traces a line (or hyperplane) of the sum of the Gaussian distributions.It does it so it can project data points onto that line within its Gaussian distribution.By doing this, we get a linearly separable data. 1.5. The dual problemIn convex optimization, for every primal problem (e.g. equation (9)) we can derive a dual problem.Let 𝛼RN be the dual variables, corresponding to Lagrange multipliers that enforce the N inequality constraints.The generalized Lagrangian is given below,L(w,w0,𝛼)=12wTw-n=1N𝛼n(y~n(wTxn+w0)-1)To optimize this, we must find a stationary point that satisfies(wˆ,w0ˆ,𝛼ˆ)=minw,w0max𝛼L(w,w0,𝛼) We can do this by computing the partial derivatives w.r.t. w and w0 and setting to zero, wL(w,w0,𝛼)=w-n=1N𝛼ny~nxn𝜕𝜕w0L(w,w0,𝛼)=-n=1N𝛼ny~nand hencewˆ=n=1N𝛼~ny~nxn0=n=1N𝛼~ny~nPlugging these into the Lagrangian yields the followingL(wˆ,wˆ0,𝛼)=12wˆTwˆ-n=1N𝛼ny~nwˆTxn-n=1N𝛼ny~nw0+n=1N𝛼n=12wˆTwˆ-wˆTwˆ-0+n=1N𝛼n=-12i=1Nj=1N𝛼i𝛼jy~iy~jxiTxj+n=1N𝛼n This is called the dual form of the objective.We want to maximize this w.r.t. 𝛼 subject to the constraints that n=1N𝛼~ny~n=0 and 𝛼n0 for n=1:N.The above objective is a quadratic problem in N variables.Standard QP solvers take O(n3) time.However, specialized algorithms, such as the sequential minimal optimization (SMO) algorithm, is developed for this problem that takes O(n).Since this is a convex objective, the solution must satisfy the KKT conditions, 𝛼n0y~nf(xn)-10𝛼n(y~nf(xn)-1)=0Hence either 𝛼n=0 (in which case example n is ignored when computing wˆ) or the constraint y~n(wˆTxn+wˆ0)=1 is active.This latter condition means that example n lies on the decision boundary (they're called support vectors). We denote the set of support vectors by S.To perform prediction, we usef(x;wˆ,wˆ0)=wˆTxn+wˆ0=nS𝛼ny~nxnTx+wˆ0To solve for wˆ0, we can use the fact that for any support vector, we have y~nf(x;wˆ,wˆ0)=1. Multiplying both sides by y~n, and exploiting the fact that y~n2=1, we get wˆ0=y~n-wˆTxn. In practice, we get better results by averaging over all the support vectors,wˆ0=1|S|nS(y~n-wˆTxn)=1|S|nS(y~n-mS𝛼my~mxmTxn) 1.5.1. Lagrange multiplierIn mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints.The basic idea is to convert a constrained problem into a form such that the derivative test of an unconstrained problem can still be applied.The method can be summarized as follows: in order to find the maximum or minimum of a function f(x) subjected to the equality constraint g(x)=0, form the Lagrangian function, equation (23), and find the stationary of L considered as a function of x and the Lagrange multiplier 𝜆.L(x,𝜆)=f(x)+𝜆g(x) This means that all partial derivatives should be zero, including the partial derivative w.r.t. 𝜆.The solution corresponding to the original constrained optimization is always a saddle point of the Lagrangian function.The great advantage of this method is that it allows the optimization to be solved without explicit parameterization in terms of the constraints.As a result, the method of Lagrangian multipliers is widely used to solve challenging constrained optimization problems.
Figure 3:The red curve shows the constraint g(x,y)=c. The blue curves are contours of f(x,y). The point wherethe red constraint tangentially touches a blue contour is themaximum of f(x,y) along the constraint, since d1>d2.
1.6. More on kernel trickKernel Definition: A function that takes as its input vectors in the original space and returns the dot product of the vectors in the feature space is called a kernel function.More formally, if we have data x,zX and a map 𝜙:XRN, then k(x,z)=𝜙(x),𝜙(z) is a kernel function. Once we converted our problem into a dual problem form, equation (20), it has N unknowns (𝛼) which (in general) takes O(n3) time to solve, which can be slow. However, the principal benefit of dual problem is that we can replace all inner product operations xTx with a call to a positive definite (Mercer) kernel function, K(x,x) → This is called the kernel trick.In particular, we can rewrite the prediction function, equation (21), as follows,f(x)=wˆTxn+wˆ0=nS𝛼ny~nxnTx+wˆ0=nS𝛼ny~nK(xn,x)+wˆ0We also need to kernelize the bias term. This can be done by kernelizing equation (22),wˆ0=1|S|iS(y~i-(jS𝛼jy~jxj)Txi)=1|S|iS(y~i-jS𝛼jy~jK(xj,xi))This kernel trick allows us to avoid having to deal with an explicit feature representation of our data, and allows us to easily apply classifiers to structured objects, such as strings and graphs. 1.7. Kernel trick - other resourcesPDF screenshots here.
1.8. Multi-class SVMOne way to do this is to create an SVM per class → i.e. classify one class against every other classes → This paradigm is called one-vs-rest.To make a prediction for an example, we feed it to all the trained [one-vs-rest] SVMs and we measure the margin each SVM produces. We choose the prediction that produces the largest margin between that example and the other classes.Another way to handle multi-classes is to create a one-vs-one paradigm.Here, we're creating a pair-wise SVM so that for every single pair of classes we're creating a single SVM.To get a prediction, we simply feed the unseen example through every single SVM and select whichever class was most often predicted for that example.Note: Even though the one-vs-one paradigm requires a lot of SVMs, the data required per each SVM is only two classes → So, this can actually be faster in some cases depending on the data. 1.8.1. Structured SVMHere, instead of the margin being -1 and 1, the margin is actually the distance between the two closest classes. 1.9. SVM for regressionThe idea is very similar to SVM classifier. The only difference is that the goal is to keep all points within the margin.The slack variable (i.e. the error terms) here comes from points that lie outside of the margin. 1.10. Other considerationsSVMs can linearly separate data out of the box → hard-margin SVM.If we add slack variablessoft-margin SVM.We can use sub-gradient descent to optimize soft-margin SVM.We can add interaction terms to separate data even if it's not linearly separable. This is often preferred when you have either a low number of dimensions that you want to project into or if your data is extremely large.However, if your data isn't so large, and you want to project your features into a very high dimensional space, you can use kernel trick, which allows us to avoid computing this actual feature transformation.SVMs are distance-based. So, we have to consider scaling our features as well.Why to use SVM over logistic regression?If you have a low number of examples → just start with a linear SVM because they only focus on the support vectors.If you have a ton of data and not many features → you might be better off starting with logistic regression. Back to Top