I was invited to present about the spark-timeseries library last week at the Montreal Apache Spark meetup.
The slides of the presentation are available here.
Coming up next: I will be giving a talk on the same topic at the upcoming Spark Summit in Boston, on February 8th!
In part 3 of the SVM series, we saw the theoretical framework behind the solution to the optimal separating hyperplane.
Here we are going to quickly discuss, at a high-level, the traditional method of solving it, i.e. the Sequential Minimal Optimization algorithm, and we will also have a quick look at the more scalable/modern techniques based on the Gradient Descent "family" of algorithms.
The Sequential Minimal Optimization algorithm
In this algorithm, we start with our alphas at some initial values, and then we iteratively update them in pairs, in order to approach the optimal solution. The main reason we update them in pairs is that it allows us to calculate a closed form (analytical) solution to the best direction and magnitude by which to update the alphas in any given iteration.
Remember that the alphas act as weights for a weight average of points on each side of the separating hyperplane. These 2 averages constitute two points forming the vector w that was previously discussed.
So, if we have 3 support vectors on a given side of the separating hyperplane, for example, choosing the value of their alphas is the same as moving the "center point" (the weighted sum) towards an optimal position somewhere within the area covered by a triangle formed by these 3 support vectors. And when you pick two alphas, you are essentially moving the optimal center point somewhere along the line segment formed by the two corresponding support vectors.
As an example of a possible optimization path: in a first step, we pick the alphas for p1 and p2, and optimize the weights relative to each other. We choose 2/3 * p1 and 1/3 * p2. p3 is still at alpha = 0, so there is no p3 component in the resulting center point.
In a second step, we pick the alphas of p1 and p3. We discover that the optimal relative weight between those 2 if half and half. After these 2 steps then, the center point ends up in the middle of the area formed by the 3 support vectors p1, p2 and p3, because their respective alphas are 1/3 * p1 + 1/3 * p2 + 1/3 * p3.
The SMO algorithm's job is to identify the correct values of alphas such that:
In the SMO all we do (in a simplified way) is to keep repeating the following until convergence:
1. we pick two alphas
2. we optimize the value of those two alphas while keeping all of the other ones fixed, using a closed-form solution.
(see more detailed pseudo-code below)
Conceptually then, we keep moving iteratively the 2 center points on each side of the separating hyperplane, along the axis formed by the two selected alphas' corresponding support vectors at each iteration, until we reach the center points that best optimized the formula above.
Let's wrap up this brief overview with some pseudo-code for a simplified version (i.e. minus some performance improvements that complicate things a little) of the SMO algorithm:
This YouTube video is a pretty good introduction to the intuition (and even some of the math) behind the SMO, in its simple form:
Scalable Gradient Descent-based approches
In the era of Big Data we've mostly abandoned the closed form solution approach, in favor of a more stochastic approach. This is because SMO because quite slow in larger data sets.
The principle is the same as SMO, in that we iteratively modify values of alpha to approach the optimal solution. While there are a few variations being used and researched to optimize the SVM (and large-scale machine learning algorithms in general), the fairly "tried and tested" (and common) approach is Stochastic Gradient Descent.
Before even attempting any explanation, I highly recommend this beautiful blog article that displays a few optimization algorithms (including Gradient Descent: the start point to understanding SGD) in an interactive visual manner:
Nothing beats an interactive visual chart, when it comes to intuitive learning. As for the rest, here it is:
Stochastic Gradient Descent is really just Gradient Descent, but applied on randomly selected sub-sets of the data. We do this because statistically speaking, if the parameters are optimal for a randomly sampled sub-set of the data, they should be optimal for the entire data. This assumption allows us to work on much less data, and is thus a performance and scalability improvement over the traditional Gradient Descent algorithm. We can sometimes push this performance improvement to an extreme and calculate error over a single example per iteration of the algorithm, but a more conservative approach is to use a "mini-batch" of a pre-determined size (i.e. a meta-parameter of the algorithm that is selected by the user).
Let's quickly go over Gradient Descent, since it is at the heart of SGD. In Gradient Descent, we start from an initial point of our feature space, and we iteratively move towards some other point in feature space where the error is minimal. The error is iteratively computed by using the feature values represented by the current position in feature space, plugging these into the loss function that is being minimized, and comparing the output of the function (the "error", as it is sometimes called).
The "gradient" part in "Gradient Descent" comes from how we select the next direction in feature space to move to. At each iteration, the algorithm calculates what changes in what features will result in the steepest descent towards a low error in the next step. Here's a code example of a step in the Gradient Descent process, with 2 parameters b and m, and the learning rate meta-parameter which is user controlled (and represents how fast we want to change direction towards the direction of error minimization).
def stepGradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
b_gradient += -(2/N) * (points[i].y - ((m_current*points[i].x) + b_current))
m_gradient += -(2/N) * points[i].x * (points[i].y - ((m_current * points[i].x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
It then, of course, takes a step in the direction of steepest descent. For a more in-depth coverage of the GD familty of optimization algorithms, I recommend this blog post: http://sebastianruder.com/optimizing-gradient-descent/index.html
There are some particularities to implementing this in the context of an SVM problem, but I won't get into those details as the purpose of this series is to be more of a high-level intuitive overview.