In part 3 of the SVM series, we saw the theoretical framework behind the solution to the optimal separating hyperplane. Here we are going to quickly discuss, at a highlevel, the traditional method of solving it, i.e. the Sequential Minimal Optimization algorithm, and we will also have a quick look at the more scalable/modern techniques based on the Gradient Descent "family" of algorithms. The Sequential Minimal Optimization algorithmIn this algorithm, we start with our alphas at some initial values, and then we iteratively update them in pairs, in order to approach the optimal solution. The main reason we update them in pairs is that it allows us to calculate a closed form (analytical) solution to the best direction and magnitude by which to update the alphas in any given iteration. Remember that the alphas act as weights for a weight average of points on each side of the separating hyperplane. These 2 averages constitute two points forming the vector w that was previously discussed. So, if we have 3 support vectors on a given side of the separating hyperplane, for example, choosing the value of their alphas is the same as moving the "center point" (the weighted sum) towards an optimal position somewhere within the area covered by a triangle formed by these 3 support vectors. And when you pick two alphas, you are essentially moving the optimal center point somewhere along the line segment formed by the two corresponding support vectors. Visually: As an example of a possible optimization path: in a first step, we pick the alphas for p1 and p2, and optimize the weights relative to each other. We choose 2/3 * p1 and 1/3 * p2. p3 is still at alpha = 0, so there is no p3 component in the resulting center point. In a second step, we pick the alphas of p1 and p3. We discover that the optimal relative weight between those 2 if half and half. After these 2 steps then, the center point ends up in the middle of the area formed by the 3 support vectors p1, p2 and p3, because their respective alphas are 1/3 * p1 + 1/3 * p2 + 1/3 * p3. The SMO algorithm's job is to identify the correct values of alphas such that: In the SMO all we do (in a simplified way) is to keep repeating the following until convergence: 1. we pick two alphas 2. we optimize the value of those two alphas while keeping all of the other ones fixed, using a closedform solution. (see more detailed pseudocode below) Conceptually then, we keep moving iteratively the 2 center points on each side of the separating hyperplane, along the axis formed by the two selected alphas' corresponding support vectors at each iteration, until we reach the center points that best optimized the formula above. Let's wrap up this brief overview with some pseudocode for a simplified version (i.e. minus some performance improvements that complicate things a little) of the SMO algorithm: This YouTube video is a pretty good introduction to the intuition (and even some of the math) behind the SMO, in its simple form: https://www.youtube.com/watch?v=I73oALP7iWA Scalable Gradient Descentbased approchesIn the era of Big Data we've mostly abandoned the closed form solution approach, in favor of a more stochastic approach. This is because SMO because quite slow in larger data sets.
The principle is the same as SMO, in that we iteratively modify values of alpha to approach the optimal solution. While there are a few variations being used and researched to optimize the SVM (and largescale machine learning algorithms in general), the fairly "tried and tested" (and common) approach is Stochastic Gradient Descent. Before even attempting any explanation, I highly recommend this beautiful blog article that displays a few optimization algorithms (including Gradient Descent: the start point to understanding SGD) in an interactive visual manner: http://www.benfrederickson.com/numericaloptimization/?imm_mid=0eb541&cmp=emdatanananewsltr_20161207 Nothing beats an interactive visual chart, when it comes to intuitive learning. As for the rest, here it is: Stochastic Gradient Descent is really just Gradient Descent, but applied on randomly selected subsets of the data. We do this because statistically speaking, if the parameters are optimal for a randomly sampled subset of the data, they should be optimal for the entire data. This assumption allows us to work on much less data, and is thus a performance and scalability improvement over the traditional Gradient Descent algorithm. We can sometimes push this performance improvement to an extreme and calculate error over a single example per iteration of the algorithm, but a more conservative approach is to use a "minibatch" of a predetermined size (i.e. a metaparameter of the algorithm that is selected by the user). Let's quickly go over Gradient Descent, since it is at the heart of SGD. In Gradient Descent, we start from an initial point of our feature space, and we iteratively move towards some other point in feature space where the error is minimal. The error is iteratively computed by using the feature values represented by the current position in feature space, plugging these into the loss function that is being minimized, and comparing the output of the function (the "error", as it is sometimes called). The "gradient" part in "Gradient Descent" comes from how we select the next direction in feature space to move to. At each iteration, the algorithm calculates what changes in what features will result in the steepest descent towards a low error in the next step. Here's a code example of a step in the Gradient Descent process, with 2 parameters b and m, and the learning rate metaparameter which is user controlled (and represents how fast we want to change direction towards the direction of error minimization). def stepGradient(b_current, m_current, points, learningRate): b_gradient = 0 m_gradient = 0 N = float(len(points)) for i in range(0, len(points)): b_gradient += (2/N) * (points[i].y  ((m_current*points[i].x) + b_current)) m_gradient += (2/N) * points[i].x * (points[i].y  ((m_current * points[i].x) + b_current)) new_b = b_current  (learningRate * b_gradient) new_m = m_current  (learningRate * m_gradient) return [new_b, new_m] It then, of course, takes a step in the direction of steepest descent. For a more indepth coverage of the GD familty of optimization algorithms, I recommend this blog post: http://sebastianruder.com/optimizinggradientdescent/index.html There are some particularities to implementing this in the context of an SVM problem, but I won't get into those details as the purpose of this series is to be more of a highlevel intuitive overview.
0 Comments
In Part 2 we've seen a description of the optimization problem to solve in order to train our SVM (in other words, in order to find the ideal separation hyperplane). In this part, we will start looking into how exactly we are going to solve this problem. In textbooks you will find a lot of mathematical derivations and proofs here, I will try to avoid that, since my focus is on intuitive understanding rather than mathematical rigor. So let's start. The first breakthrough that was made here in the development of support vector machines was the realization that if you derive the Lagrangian of the optimization problem to solve, most parameters disappear and you are left with a simpler form, which also happens to open up some interesting doors (namely, the Kernel trick, which we will discuss in a later article). Lagrange mutlipliers and optimization: I will start by providing a relatively simple explanation of the Lagrangian technique, how it works and why. Then we will skip to the result (the steps of the Lagrangian derivation can be found in many textbooks if you are interested). A convex function is a function that graphically looks like this: Finding the minimum for these functions can be done by taking the derivative of the function, and finding the point at which the derivative equals 0. This is because at the bottom of the convex function, the rate of change of the value suddenly ceases to change (thus a derivative of 0) for a brief moment before it reverses and starts going back up. But remember that we have a slightly more complicated problem to solve here: we have a set of constraints. This is where Lagrange multipliers come in. The idea of Lagrange multipliers is as follows: Suppose you have a constraint: for all i. And suppose your objective function is: The Lagrangian is simply the objective function + your constraint formulated in such as a way that f(x) = 0. For each such constraint term we also must add a coefficient, called the Lagrange multiplier. Formulating the constraint in f(x) = 0 form: So for this example, the Lagrangian is: The point at the derivative of L = 0 corresponds to the lowest point of the convex function where the constraint also happens NOT to be violated. In other words, when you take the derivatives of L, it's with respect to the objective function's variables AND the newly introduced Lagrange multiplier (the alphas). These derivatives must be set to 0. It's a bit like a game where you are asked to pick the value of the objective function's variables to maximize your score: the smaller the value of the function, the more points you win (for a minimization problem). But at the same time, your opponent is allowed to choose the Lagrange multiplier arbitrarily to maximize the value of the function, thereby making you lose points. The optimal play is clearly to set w (in this example) such that it is insensitive to the value of alpha (in mathematical terms: such that the derivative with respect to alpha is 0). To summarize then, the idea is to get as close as possible to the point where the derivative of the objective function equals 0, while making sure that the sensitivity to the Lagrange multiplier is also 0. The Support Vector Expansion: Ok, so now we apply the same technique to our SVM optimization problem, calculate the Lagrangian, take the partial derivatives of it with respect to w, b (the parameters to optimize), set those derivatives to 0 and substitute them into the Lagrangian for the reasons just explained (this is where we skip the math), and we get: Lo and behold, the parameters w and b entirely vanished, and the whole problem is reduced to finding the right values of the Lagrange multipliers alpha! (a few additional mathematical tricks were used to get to this result, but the Lagrangian technique is the most fundamental part of it) Furthermore, from this result, we can derive (see textbooks) the "Support Vector expansion" results: Let's think a bit about what that means, and why it makes sense. The math behind this result is undeniable, but it doesn't really help you understand what is going on. First of all, what the Support Vector Expansion is saying is that you can express the separating hyperplane entirely in terms of a linear combination of some data points. These data points that play a role in determining the hyperplane are the ones for which the corresponding alphas are nonzero. These happen, in fact, to be the Support Vectors! (we'll see later how the alphas are picked) Ok, but how does it make sense that the w vector (or, equivalently, the decision boundary/hyperplane) can be entirely calculated from a few data points? To understand that, let's look at the simplest case: We have 2 data points to classify. The best hyperplane in this scenario is perpendicular to the line segment (or vector, if you prefer) formed from point a to point b. It might not be visually obvious that this line is what best maximizes the separating margin, but if you think of Pythagore's theorem, it should be "mathematically evident" that in this case the distance from each point to the decision boundary is larger than, say, if the boundary were placed horizontally or vertically between the points: (the width of the margin, represented by the line segment c, is, according to Pythagore's theorem, greater than either of the line segments that would be formed, a or b, if we placed the separating hyperplane horizontally or vertically between the two points)
Ok. I'm going to pretend this was convincing enough and keep going. The next logical step is to realize, in this scenario, that w is equal to b  a. It's a simple fact of linear algebra that the vector that connects two points is obtained by differencing these 2 points. We've just shown, in the simplest case, why the hyperplane can be described as a subtraction of a negativelabeled support vector from a positivelabeled support vector. To generalize this example to cases where you have more than 1 support vector on each side of the decision boundary, realize that the w vector is then the average of the support vectors on one side minus the average of the vectors on the other side. So you are still ultimately producing the w vector as a subtraction of two vectors across the decision boundary, except that now the 2 points each represent the "gravity center" on each side (metaphorically speaking). So there you have it, by summing and subtracting support vectors, we have expressed, in the general case, the w vector. The next step: Now we have expressed the optimization problem only in terms of alphas. We understand that the weight vector w isn't necessary, you can simply express the SVM's separating hyperplane as a function of a linear combination (or weighted vote) of carefully selected data points: the support vectors. Furthermore we understand that the selection mechanism is done via setting the alpha values correctly. In summary, we have reached the conclusion that the entire business of SVMs is about finding the vectors that contribute the most to the determination of the decision hyperplane (i.e. the Support Vectors) and then "drawing" a vector w through them. The next step, then, is to figure out how to identify the alpha values. Traditionally this is done through an algorithm called the Sequential Minimal Optimization (or SMO for short) algorithm, developed by John Platt. We will get into how this algorithm works in a later article. Note also that with the era of Big Data there is ongoing research into more scalable algorithms (Stochastic Gradient Descent, LBFGS, etc.) to numerically determine these optimal alphas. Ok, you read Part 1 and you understand how to draw a hyperplane. You're now officially a machine learning expert. The rest is superficial. But let's continue building this support vector machine. The first step is to find the parameters of the hyperplane (i.e. the parameters w and b) that correspond the best hyperplane we could draw in order to properly separate the "+" and "" observations. How do we define the "best" separation? While writing this article I stumbled upon a blog post that explains the answer to this question so well that I couldn't do it better. So I will redirect you to this post, there is no point in me redoing the work: http://www.svmtutorial.com/2014/11/svmunderstandingmathpart1/ Quick summary: the best separating hyperplane is the one that will be right in the middle of the "gap" between the "+" and "" support vectors. By definition, the margin is the distance between that separating hyperplane and the closest data points on each side (i.e. the support vectors). Therefore, we can define the SVM problem (i.e. finding the best separating hyperplane) as a constrained optimization problem: 1) maximize the separating margin. Since the margin is the distance between the closest data points and the separating hyperplane, and we're trying to get the hyperplane to be as far as possible from these closest data points on each side, then the mathematical equivalent of this is to maximize the margin. 2) But because we don't want to maximize the margin past the closest data points, we need to specify a constraint to make it stop at the support vectors. In other words, we need to tell the maximization algorithm when to stop. Mathematically, this means that we minimize the function: Subject to the constraint: Let's first look at the objective function: we're essentially minimizing w (and for the moment let's ignore the less relevant stuff around it). If you remember the conclusions from part 1 of this series, minimizing w is the same as maximizing the separating margin. Also, to be clear, because w is a vector, when I talk about minimizing w, what we're doing is trying to minimize the absolute value of the weights as a whole. The norm of w, written w, is: Where the w's are the components of the w vector for each dimension. In other words, in a twodimensional chart, w1 is the xaxis value, and w2 is the yaxis value.
The other things around the w are there as mathematical devices to simplify derivatives later on  they don't contribute much to the intuitive understanding of what is going on. As for the constraint, it's saying that all vectors in the dataset should be correctly classified as being >= +1 if in the "+" class, and <= 1 if in the "" class. Another way of putting it is that there should be no data points within the gap formed by the margin's edges. Alright, we've successfully formulated the optimization problem that will allow us to determine the ideal separating hyperplane for a given dataset, and hopefully we've understood it as well. Stay tuned for part 3, where we will begin explaining how to solve this optimization problem. If you want a deeper look into the math of the optimization problem we just described, I highly recommend this blog post, by the same author of the previous blog post I linked to. What follows in this series of articles is an explanation of Support Vector Machines that focuses on intuitive understanding, rather than purely mathematical derivation. The target audience is those who have a basic knowledge of machine learning vocabulary, but aren't necessarily Ph.Ds in math. SVMs are binary classifiers (although there exists Support Vector Regression, as well as ways to classify multinominal data  but we won't go into that, at least not yet) whose entire purpose is to draw a line (when the feature set is 2dimensional, or a hyperplane if the feature set is higherdimensional). This line will be the line that separates the "+" class from the "" class observations (we'll call them "+" and "", but technically we use +1 and 1 as the numerical label). The equation for a line or hyperplane, as typically seen in machine learning textbooks is: This should be obvious, if you remember high school math, because a 2d line is y = mx + b, so if we have more than 2 dimensions it's y = m1x1 + m2x2 + ... + b, which is more concisely expressed as y = <w, x> + b, the dot product between all w's and all x's. But here lies a first source of confusion already. Often, next to that formula, in textbooks they'll show a 2D plot with a decision boundary line that separates the "+" observations from the "" observations. Well, plot twist: that decision boundary that you see is NOT w, nor is it <w, x> + b. w is actually a vector that is orthogonal to that decision boundary. The decision boundary that you often see in those graphs is, rather, all values of x that satisfy the following equation: As a reminder, also, keep in mind that the y axis in those graphs is NOT the y or f(x) of the f(x) = <w, x> + b formula. The y axis is in fact the second x variable, because the assumption behind those graphs is always that there are two features, x1 and x2. So what is y = <w, x> + b? It's the decision plane: If y > 0, it classifies the observation as a "+" label, if y < 0 it classifies it as a "" label. It's the function you will be using to perform classification. But to truly visualize this function, it is necessary to produce a 3D graph of the previous 2D one, where the Z axis is the f(x) of the formula. That black line across the center is the decision boundary, where the values of <w,x> + b give 0. Before that, the values of y are below zero, hence classified as "", and beyond that line, the values are above zero, hence classified as "+". Before we delve further into the roles of w and b in that picture, we will introduce the notion of support vectors and separating margin. The Support Vectors are those observations in the picture that are closest to the decision boundary, on each side of it. The separating margin is the distance from the decision boundary to these support vectors. As we estimate the parameters of the SVM, we will be choosing them so that the y value for each of these support vectors is +1 or 1 (depending on if they are "+" examples or "" examples, respectively). The role of b in the function y = <w,x> + b is to decide the positioning of the decision boundary. In other words, it decides when the value of y crosses 0. The effect of increasing or decreasing b, then, has the effect of pulling this boundary back into the  region or pushing it further into the + region or . The role of w is to control the "scaling", i.e. the width of the margin. In other words, it controls how big the step in the x variables needs to be in order to get to a y value +1 or 1. To understand what exactly I mean by that, it suffices to look at two examples: one where the w's are all equal to 1, and one where the w's are all equal to 2. Let's have x1 = 1 and x2 = 1, for simplicity, and b = 0: 1) with w = (1, 1): f(x) = 1 * 1 + 1 * 1 = 2 2) with w = (2, 2): f(x) = 2 * 1 + 2 * 1 = 4 So in the first scenario, with w = 1, the same x values gave a classification of +2, while in the second scenario, the same x values gave y = +4. This means that the width of the margin (i.e. the distance from the decision boundary required to reach +1 and 1) in the first scenario was represented by the x values (0.5, 0.5), while the x values (0.25, 0.25) are enough in the second scenario to reach f(x) = +1. The thing to remember here is that the greater the values of w, the smaller the margin. Ok, that should cover the introductory theory. You're ready to delve into the core of the Support Vector Machines. Stay tuned for part 2.
Quantum Computing is probaby the "next big thing" in computer science, and it will certainly bring AI and Machine Learning to entirely new levels of efficiency and capability. You could even argue that quantum computing (or, at the very least, a new form of fundamentally more powerful computation than the electronic switch) is a prerequisite to achieve the goal of Artificial General Intelligence. With IBM offering access to their quantum cloud computing, I got curious and started looking into this technology more "seriously". I stumbled upon a great website that essentially simulates a vector of QuBits (the quantum world equivalent of a traditional bit) and quantum gate operations on them. It lets you play with them and get an intuitive understanding of how quantum computing works. That being said, the explanations on the website are quite minimalist, and it's not entirely easy to understand what is going on, or what you are looking at. I will explain here what I figured out from spending about 10 minutes playing with the simulator, and it revolves around the Hadamard gate. Disclaimer: I have very little knowledge of the world of quantum computing, and what follows are essentially my baby steps into this world, trying to figure basic things out from scratch. When you open the Hadamard Gate tutorial, you start with a 3D view that looks like this: The first thing to understand is what this 3D image represents. It's a visual representation of the state of the array of QuBits in the simulated quantum register. If we have 8 QuBits (which we do in this example), this means we have a total of 2^8 possible QuBit states, i.e. 256. This can be represented visually in a 16 by 16 matrix, which is what we see. The areas that are elevated indicate a possible state. To understand what I mean by "possible state", we have to understand how a QuBit works. A single QuBit, rather than being 0 or 1, can be both at the same time. That is, it's possible to apply an operation on a QuBit (as it turns out, this operation is the Hadamard Gate, but we'll come back to that) in order to put it into 2 possible simultaneous states. In quantum lingo: a measurement of the superimposed QuBit will have equal probabilities to become 1 or 0. (for more theoretical background, look into the notion of quantum measurement and quantum wave function collapse). In effect, the Hadamard Gate is the Schrodinger's Cat applied to bits: it makes the bits both dead and alive... So, if we come back to the 3D model above, there is only one elevated state: the bottom left corner, which represents the initial state of 0. But now, we're going to "Step into" the Hadamard Gate tutorial code, and run the first gate operation, on bit 0. Suddenly, the first bit has two simultaneous possible values: 0 and 1. We see that the 3D model now looks like this: Which is essentially 2 bars on the bottom left corner, each representing the states 0 and 1, respectively. If we keep stepping further into the code, the second QuBit is "Schrodingered" and now we see that the 3D model displays an elevation on all of the values from 0 to 4, inclusively. When we Hadamard the 3rd QuBit, then it's all values from 0 to 8 (because 2^3 = 8, in case that wasn't obvious). The "for loop" proceeds to Hadamard all 8 bits, and soon enough the entire 3D model gets elevated, forming a cube (this indicates that all 256 values are possible, or simultaneously true). As we step through the next loop (the one that will reapply Hadamard on each QuBit, in the same order as before)  we'll discover another property of the Hadamard gate. So let's step into the next loop, and reHadamard the first bit. Now, the 3D models seems to be sliced up: The reason for this is simple, though possibly not too intuitive at first: reapplying the Hadamard gate on a QuBit will reset it to its original value. So what this means is that we reset the first QuBit to 0, thus removing all odd numbered values from the possible quantum states. This is the reason for the sliced up 3D model. But, why does it reset the bit? One helpful way to think about the Hadamard gate is as a rotation within a spherical model of the QuBit called a Bloch Sphere: Witness my amazing MS Paint skills. In this sphere, you have axes x and y around the "equator", and the z axis that goes vertically. The z axis value is essentially what decides the state of the QuBit. If the state is at the top of the sphere (z value = 1, x and y values = 0), then the QuBit's value is 1. If the state is at the bottom of the sphere, the value is 0. What can happen, however, is that the state can be somewhere along the equator of the bloch sphere, which represents the state of superposition (both 0 and 1) that the Hadamard gate can put it in. Visually, if we start from a QuBit value of 1 (top of sphere) and we apply the Hadamard Gate operator, what we are doing is a rotation of 180 degrees around the axis formed by x and z. Or, an equivalent operation: first rotate by 90 degrees about the Y axis, then rotate by 180 degrees about the X axis. Visually, we get the following: The state is now at the equator, which means that the value is simultaneously 0 and 1. If we apply the Hadamard gate again, we get back to the original state. Similarly, if we started at value 0, and applied the Hadamard gate twice, we would be back at 0.
Hopefully this will save some time for someone curious about Quantum Computing. If you're interested in the other quantum gates, stay tuned, who knows.. I always felt that the concept of crossvalidation is a bit of a hack. It's a fundamentally random process, and not only does that mean that you will get different results from one run to the next, but it also means that you could "accidentally" end up with training or validation sets that aren't representative of the full dataset. This, in turn, can lead to various specific problems, which I will get into below. That being said, statistical tests of significance are a complicated world to navigate when you cannot assume normality. For the sake of simplicity then, I tend to fall back to crossvalidation. Here are some hacks that I apply to the hack that is crossvalidation, in order to make it a bit more reliable. a) The problem of "extreme" values The other day I used crossvalidation on a dataset that contained a handful of extreme values for the predictor (in the 50,000 to 500,000 range) while the rest of the dataset was concentrated in the 0 to 5000 range. To be clear, these are not outliers in the regression, i.e. their Cook's distance isn't significantly different than the rest of the data. The result was that, due to the random nature of the split between the training and validation set, these few extreme values all ended up in the validation set. The training data was modeled with some polynomial that worked well within the training data. But then, when it came to validating the model, the polynomial extrapolated to ridiculous values. The crossvalidated Mean Absolute Error was horrible. Here is what it looked like: Where the blue dots are the raw data (due to the scale being inflated from the bad model, you don't see many dots, but in fact there are several data points on top of each other) and the orange line is the fit of the polynomial model. Another similar situation gave a graph like this one: Ok, so I overfitted and crossvalidation told me about it... how was that broken? Crossvalidation did what it was intended to do, no? Yes, it did. The problem however, is that I failed to model the real underlying relationship, because the training data didn't contain crucial data points. And this failure is a direct result of splitting the data for crossvalidation purposes. In other words, I overfitted because of crossvalidation, and then it came back and slapped me on the wrist for it!
In fact, if we take a step back and look at what happened, we see that I modeled some dataset over a given domain (0 to 5,000ish), and then I validated it by estimating values beyond the domain for which I built the model. This is extrapolation. It is a basic fact of statistics that you cannot extrapolate beyond the domain of a given model (and expect it to give sensible results). Therefore, it follows that one should never produce a validation set containing values of the independent variables that are above or below the domain of the training data set. And, as far as I know, this isn't a caveat commonly mentioned in statistical learning theory textbooks (or maybe I missed the small print!). My solution to this problem was to sort my dataset along the xaxis values and take the top and bottom 5th percentiles of data (the choice of the percentile value is itself arbitrary) to include them immediately in my training set. The validation set would then be selected randomly from the remaining data. This way I guarantee that my extreme values belong to my training set, not my validation set, and that, therefore, the validation process will not extrapolate over the trained domain. I presented at the Montreal Apache Spark meetup on wednesday. The talk was about automatic feature generation, and I went a little deeper into Amdahl's Law and parallelism issues (related to a previous post about Amdahl's law). In particular we looked at MLlib's LASSO regression and its degree of parallelism.
The slides are available here. I recently spent a lot of time trying to figure out how to run Spark code on a YARN cluster from my scala code (a longrunning service). There seem to be a lot of questions and problems about this on the Internet, but few answers and solutions.
I found a github example by Mahmoud Parsian, which has a code example for doing that with the Yarn client class. The code I ended up using for submitting the SparkPi example task is as follows: String[] args = new String[]{ "name", "testSparkPi", "drivermemory", "1000M", "jar", "/usr/lib/spark/lib/sparkexamples.jar", "class", "org.apache.spark.examples.SparkPi", "arg", "10", // 10 decimals, the argument to SparkPi "arg", "yarncluster" }; Configuration config = new Configuration(); System.setProperty("SPARK_YARN_MODE", "true"); SparkConf sparkConf = new SparkConf(); sparkConf.set("spark.yarn.jar", "hdfs:///user/souellette/sparkassembly.jar"); ClientArguments clientArgs = new ClientArguments(args, sparkConf); Client client = new Client(clientArgs, config, sparkConf); client.run(); Notice that I set the spark.yarn.jar parameter (which isn't done in Mahmoud Parsian's example). I found that adding this line was necessary in my case, as well as uploading the sparkassembly.jar to the HDFS path I specified (hdfs dfs copyFromLocal /usr/lib/spark/lib/sparkassembly.jar /user/souellette/sparkassembly.jar). I'm not sure why it was required. Without it I was getting an error: "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging" on the clusterside execution logs. A few other issues I ran into:
My presentation about sparktimeseries in early January at the Big Data Montreal meetup. The presentation slides in PDF format are available here.
When I first started looking into distributed processing, I thought it was magic. I thought I could write an algorithm on top of Spark or some similar framework, and then if I could afford a million CPU cores, my algorithm would run almost a million times faster (minus a small "administrative" overhead for coordinating the distributed processing). That's still true, in theory. That is, if you can write code that is 100% parallelizable, then the above is true. In practice, though, there are (almost) always bottlenecks. And that forces us to write code that is maybe 95% parallelizable... or 90%... or 70%... And what that means, according to Amdahl's law, is that there is an upper theoretical limit to how many cores will provide a speedup on your algorithm. Namely: As we can see, the speedup maximum comes up pretty quickly. Even if you can manage writing an algorithm that is 95% parallelizable, the best speedup you can expect is 20 times, and that comes at the cost of having about 2000 CPU cores!
Which brings me to the question of designing highly parallelizable algorithms. At first, I was writing code that parallelized at a very low level (for example, calling Apache's MLlib functions to do regressions). The problem with this approach is that it tends to introduce a lot of unscalable bottlenecks. The LASSOWithSGD method itself in MLlib is far from 100% parallelizable  so just by calling that you shot yourself in the foot if you needed near100% parallelization (case in the point, a great question/answer on Stack Overflow by someone who clearly went through the same disillusionment I went through: http://stackoverflow.com/questions/33589449/sparkmllibslassowithsgddoesntscale). So what would the alternative be? To give an example, let's take the case of Nectarine, the software I'm developing. Essentially, Nectarine produces derived features from an original set of features. So we end up performing regressions about 120 times if we end up generating 120 features for a given dataset. In this case, instead of calling LASSOWithSGD we could do a mapPartitions on all of the features, and within each "thread" we run a singlethreaded version of the Lasso regression. If we do that, then we have code that is much more parallelizable. The problem with the latter solution is that it makes a crucial assumption: the data (for a feature and the target variable) fits inside a single executor's memory. If that isn't true, we have to fall back on the less parallelizable approach of calling LASSOWithSGD and having our features partitioned across machines. Clearly, the optimal solution depends on the cardinality and the dimensionality of the dataset to analyze. Because in my specific case I cannot make assumptions about the size of the dataset I will have to analyze, I came up with what I call a "Multilane execution strategy". The idea is as follows: 1) Determine the cardinality (number of observations) and dimensionality (number of variables or features to analyze) of your dataset. 2) If the dataset qualifies as "small" (cardinality and dimensionality are both below a certain threshold to be identified empirically), run everything locally using a traditional 3rd party library, for example. You'll avoid the latency cost of setting up parallelized processing. 3) If the dataset qualifies as "medium" (dimensionality is high enough to justify parallelization, and the cardinality is small enough to fit inside the memory of an executor), parallelize the whole algorithm at once. In other words, the parallelization is done at a very high level, encompassing everything. The pattern is something along the lines of: mapPartitions(features => applyAlgorithm(features)) if you want to use Spark. With this approach you will achieve maximum "parallelizability", thus maximum performance. 4) If the dataset qualifies as "big" (as in Big Data  the cardinality is such that even a single feature/target variable pair doesn't fit in executor memory), use the fullblown Spark coding where every data structure is an RDD, and at no point in time is the entire dataset held in driver memory. This implies calling things such as LASSOWithSGD, and suffering the consequences eloquently described in the answer to the above Stack Overflow link: "Problem is that you have to aggregate data locally, serialize, transfer to the driver, deserialize and reduce locally to get a final result after each step. Then you have compute new theta, serialize send back and so on. All of that can be improved by a proper usage of mini batches and some further optimizations but at the end of the day you are limited by a latency of a whole system. It is worth noting that when you increase parallelism on a worker side you also increase amount of work that has to be performed sequentially on a driver and the other way round. One way or another the Amdahl's law will bite you." 
AuthorSimon Ouellette Categories
All
Archives
March 2018
