A long time ago, back when dinosaurs roamed the earth, there was Hadoop. It allowed people to hunt mammoths in a distributed fashion, and it was great.
But then, some guy called Matei Zaharia decided that writing intermediate output to disk was an unnecessarily inefficient way of doing things (after all, hard disks are several orders of magnitude slower than RAM -- and RAM is cheap nowadays). He came up with Spark, a solution that does exactly(-ish) what Hadoop did, but in memory (unless there isn’t enough of it available, in which case it will use the hard disk as swap space). This was a revolution, and he became a rock star.
Spark is still the industry standard at the moment: it is both mature enough to be embraced by more conservative companies (I said “conservative”, not prehistoric companies) -- and innovative enough not to be officially surpassed in its scalability and performance… yet.
That being said, there are contenders. Apache Flink is one. It’s proposed improvements over Spark are:
Their biggest selling point, from what I understand, is that their streaming system is more efficient, so Flink would be better suited for streaming applications.
The other one, which came out much more recently and is the development that got me excited enough to write this blog post, is called Ray. According to its GitHub commit history, the project was started on February 2016 (while Apache Flink, according to the same source, was started on December 2010). Its aim is to replace Spark for machine learning tasks. It’s not yet at its alpha stage, so it’s not considered seriously for any production quality work, and currently only appeals to hardcore tinkerers and enthusiasts such as myself.
So what’s supposed to be better about it? How is it different from Spark (and Flink)?
First, while Spark supports Java, Scala, Python and R as programming languages, Ray only supports Python (and I assume C++, since it’s written in that language -- but since I’m a civilized human being, I won’t use that).
Spark 1, Ray 0.
Ray replaces the “block synchronous” paradigm that is present in Spark. The latter forced you to design your tasks such that all partitions were going to take more or less the same processing time -- otherwise, your tasks that end more quickly would have to wait for that one looooong task before moving on. This resulted in having to implement sometimes complex strategies just to make sure that your partitions were “balanced”.
It can also handle GPU-based computation natively, which is a great improvement over Spark, especially with regards to deep learning tasks. In fact, it is “Tensorflow-ready” and its focus is ease-of-use for deep learning and machine learning applications.
Overall, while Spark valued throughput over low latency, Ray’s aim is to bring the latter to the world of distributed machine learning.