When I first started looking into distributed processing, I thought it was magic. I thought I could write an algorithm on top of Spark or some similar framework, and then if I could afford a million CPU cores, my algorithm would run almost a million times faster (minus a small "administrative" overhead for coordinating the distributed processing).
That's still true, in theory. That is, if you can write code that is 100% parallelizable, then the above is true.
In practice, though, there are (almost) always bottlenecks. And that forces us to write code that is maybe 95% parallelizable... or 90%... or 70%... And what that means, according to Amdahl's law, is that there is an upper theoretical limit to how many cores will provide a speedup on your algorithm. Namely:
As we can see, the speedup maximum comes up pretty quickly. Even if you can manage writing an algorithm that is 95% parallelizable, the best speedup you can expect is 20 times, and that comes at the cost of having about 2000 CPU cores!
Which brings me to the question of designing highly parallelizable algorithms. At first, I was writing code that parallelized at a very low level (for example, calling Apache's MLlib functions to do regressions). The problem with this approach is that it tends to introduce a lot of unscalable bottlenecks. The LASSOWithSGD method itself in MLlib is far from 100% parallelizable -- so just by calling that you shot yourself in the foot if you needed near-100% parallelization (case in the point, a great question/answer on Stack Overflow by someone who clearly went through the same disillusionment I went through: http://stackoverflow.com/questions/33589449/spark-mllibs-lassowithsgd-doesnt-scale).
So what would the alternative be? To give an example, let's take the case of Nectarine, the software I'm developing. Essentially, Nectarine produces derived features from an original set of features. So we end up performing regressions about 120 times if we end up generating 120 features for a given dataset.
In this case, instead of calling LASSOWithSGD we could do a mapPartitions on all of the features, and within each "thread" we run a single-threaded version of the Lasso regression. If we do that, then we have code that is much more parallelizable.
The problem with the latter solution is that it makes a crucial assumption: the data (for a feature and the target variable) fits inside a single executor's memory. If that isn't true, we have to fall back on the less parallelizable approach of calling LASSOWithSGD and having our features partitioned across machines.
Clearly, the optimal solution depends on the cardinality and the dimensionality of the dataset to analyze. Because in my specific case I cannot make assumptions about the size of the dataset I will have to analyze, I came up with what I call a "Multi-lane execution strategy". The idea is as follows:
1) Determine the cardinality (number of observations) and dimensionality (number of variables or features to analyze) of your dataset.
2) If the dataset qualifies as "small" (cardinality and dimensionality are both below a certain threshold to be identified empirically), run everything locally using a traditional 3rd party library, for example. You'll avoid the latency cost of setting up parallelized processing.
3) If the dataset qualifies as "medium" (dimensionality is high enough to justify parallelization, and the cardinality is small enough to fit inside the memory of an executor), parallelize the whole algorithm at once. In other words, the parallelization is done at a very high level, encompassing everything. The pattern is something along the lines of:
mapPartitions(features => applyAlgorithm(features))
if you want to use Spark. With this approach you will achieve maximum "parallelizability", thus maximum performance.
4) If the dataset qualifies as "big" (as in Big Data -- the cardinality is such that even a single feature/target variable pair doesn't fit in executor memory), use the full-blown Spark coding where every data structure is an RDD, and at no point in time is the entire dataset held in driver memory. This implies calling things such as LASSOWithSGD, and suffering the consequences eloquently described in the answer to the above Stack Overflow link:
"Problem is that you have to aggregate data locally, serialize, transfer to the driver, deserialize and reduce locally to get a final result after each step. Then you have compute new theta, serialize send back and so on.
All of that can be improved by a proper usage of mini batches and some further optimizations but at the end of the day you are limited by a latency of a whole system. It is worth noting that when you increase parallelism on a worker side you also increase amount of work that has to be performed sequentially on a driver and the other way round. One way or another the Amdahl's law will bite you."