In Part 2 we've seen a description of the optimization problem to solve in order to train our SVM
(in other words, in order to find the ideal separation hyperplane).
In this part, we will start looking into how exactly we are going to solve this problem. In textbooks you
will find a lot of mathematical derivations and proofs here, I will try to avoid that, since my focus is on
intuitive understanding rather than mathematical rigor.
So let's start. The first breakthrough that was made here in the development of support vector machines was the realization that if you derive the Lagrangian of the optimization problem to solve, most parameters disappear and you are left with a simpler form, which also happens to open up some interesting doors (namely, the Kernel trick, which we will discuss in a later article).
Lagrange mutlipliers and optimization:
I will start by providing a relatively simple explanation of the Lagrangian technique, how it works and why. Then we will skip to the result (the steps of the Lagrangian derivation can be found in many textbooks if you are interested).
A convex function is a function that graphically looks like this:
Finding the minimum for these functions can be done by taking the derivative of the function, and finding the point at which the derivative equals 0. This is because at the bottom of the convex function, the rate of change of the value suddenly ceases to change (thus a derivative of 0) for a brief moment before it reverses and starts going back up.
But remember that we have a slightly more complicated problem to solve here: we have a set of constraints. This is where Lagrange multipliers come in.
The idea of Lagrange multipliers is as follows:
Suppose you have a constraint:
for all i.
And suppose your objective function is:
The Lagrangian is simply the objective function + your constraint formulated in such as a way that f(x) = 0. For each such constraint term we also must add a coefficient, called the Lagrange multiplier.
Formulating the constraint in f(x) = 0 form:
So for this example, the Lagrangian is:
The point at the derivative of L = 0 corresponds to the lowest point of the convex function where the constraint also happens NOT to be violated.
In other words, when you take the derivatives of L, it's with respect to the objective function's variables AND the newly introduced Lagrange multiplier (the alphas). These derivatives must be set to 0.
It's a bit like a game where you are asked to pick the value of the objective function's variables to maximize your score: the smaller the value of the function, the more points you win (for a minimization problem). But at the same time, your opponent is allowed to choose the Lagrange multiplier arbitrarily to maximize the value of the function, thereby making you lose points. The optimal play is clearly to set w (in this example) such that it is insensitive to the value of alpha (in mathematical terms: such that the derivative with respect to alpha is 0).
To summarize then, the idea is to get as close as possible to the point where the derivative of the objective function equals 0, while making sure that the sensitivity to the Lagrange multiplier is also 0.
The Support Vector Expansion:
Ok, so now we apply the same technique to our SVM optimization problem, calculate the Lagrangian, take the partial derivatives of it with respect to w, b (the parameters to optimize), set those derivatives to 0 and substitute them into the Lagrangian for the reasons just explained (this is where we skip the math), and we get:
Lo and behold, the parameters w and b entirely vanished, and the whole problem is reduced to finding the right values of the Lagrange multipliers alpha! (a few additional mathematical tricks were used to get to this result, but the Lagrangian technique is the most fundamental part of it)
Furthermore, from this result, we can derive (see textbooks) the "Support Vector expansion" results:
Let's think a bit about what that means, and why it makes sense. The math behind this result is undeniable, but it doesn't really help you understand what is going on. First of all, what the Support Vector Expansion is saying is that you can express the separating hyperplane entirely in terms of a linear combination of some data points.
These data points that play a role in determining the hyperplane are the ones for which the corresponding alphas are non-zero. These happen, in fact, to be the Support Vectors! (we'll see later how the alphas are picked)
Ok, but how does it make sense that the w vector (or, equivalently, the decision boundary/hyperplane) can be entirely calculated from a few data points? To understand that, let's look at the simplest case:
We have 2 data points to classify. The best hyperplane in this scenario is perpendicular to the line segment (or vector, if you prefer) formed from point a to point b. It might not be visually obvious that this line is what best maximizes the separating margin, but if you think of Pythagore's theorem, it should be "mathematically evident" that in this case the distance from each point to the decision boundary is larger than, say, if the boundary were placed horizontally or vertically between the points:
(the width of the margin, represented by the line segment c, is, according to Pythagore's theorem, greater than either of the line segments that would be formed, a or b, if we placed the separating hyperplane horizontally or vertically between the two points)
Ok. I'm going to pretend this was convincing enough and keep going. The next logical step is to realize, in this scenario, that w is equal to b - a. It's a simple fact of linear algebra that the vector that connects two points is obtained by differencing these 2 points.
We've just shown, in the simplest case, why the hyperplane can be described as a subtraction of a negative-labeled support vector from a positive-labeled support vector. To generalize this example to cases where you have more than 1 support vector on each side of the decision boundary, realize that the w vector is then the average of the support vectors on one side minus the average of the vectors on the other side. So you are still ultimately producing the w vector as a subtraction of two vectors across the decision boundary, except that now the 2 points each represent the "gravity center" on each side (metaphorically speaking).
So there you have it, by summing and subtracting support vectors, we have expressed, in the general case, the w vector.
The next step:
Now we have expressed the optimization problem only in terms of alphas. We understand that the weight vector w isn't necessary, you can simply express the SVM's separating hyperplane as a function of a linear combination (or weighted vote) of carefully selected data points: the support vectors. Furthermore we understand that the selection mechanism is done via setting the alpha values correctly.
In summary, we have reached the conclusion that the entire business of SVMs is about finding the vectors that contribute the most to the determination of the decision hyperplane (i.e. the Support Vectors) and then "drawing" a vector w through them.
The next step, then, is to figure out how to identify the alpha values. Traditionally this is done through an algorithm called the Sequential Minimal Optimization (or SMO for short) algorithm, developed by John Platt. We will get into how this algorithm works in a later article. Note also that with the era of Big Data there is on-going research into more scalable algorithms (Stochastic Gradient Descent, L-BFGS, etc.) to numerically determine these optimal alphas.