It is a post-MapReduce world. Last week’s announcement out of Google IO was that Google has retired MapReduce and replaced it with its home-grown cloud analytics system Cloud Dataflow. Rhetorics aside, it is good to see this from the originator of the concept. (I recall a similar declaration from Apache Mahout in April.) But last week’s Spark Summit 2014 brought another announcement.
Monday and Tuesday was the conference part of Spark Summit. I attended Summit with a lot eagerness as my conversations with the Databricks team over the last six months has led me to expect some announcements. I was not disappointed.
In the opener, Databricks CTO Matei Zaharia’s opener provided an update on the community’s growth. But the big unveil came right afterward from Ion Stoica: Databricks Cloud. The Databricks Cloud is an extremely good idea that—from first glance at the demo—can be more simply described as “Elastic Spark” (in the style of Amazon’s Elastic MapReduce).
Despite it being hosted on AWS, the Databricks Cloud is wholly a Databricks effort with no Amazon input. (The big AWS sticker on the stairs was quite the red herring.) A Databricks Cloud user merely specifies the cluster size he or she needs and presto, the cluster is ready to use with a notebook—a workspace-cum scriptable shell. The notebook is suggestive of R-Studio and offers a way to execute/script your computation and then to annotate and render the result. Your data resides in S3 and other cloud storage. Databricks Cloud has been in closed beta and will be available for public beta soon. Hot hot!
The second announcement seemed less obvious in intent. Of course, SAP has been signing up a steady stream of SAP HANA partners. (Recall Hortonworks, the now-defunct IDH, & Cloudera.) And now Databricks has joined the SAP Hana partner party, announced with mild fanfare on Monday morning.
SAP EVP Aiaz Kazi’s Tuesday keynote was broad in vision and trumpeted the immediate availability of a Spark distribution. SAP plays a long game with HANA so this type of partnership still deserves careful analysis.
The third announcement wasn’t quite as dramatic, but cleared up some confusion regarding Shark and Spark SQL. Recall that Spark SQL was announced at February’s Strata Santa Clara as a major initiative within the Spark Core. This led to the obvious question about the future of Shark. At the time, the question was unanswered because the situation was evolving.
But this past week, Databricks clearly articulated that Shark is being retired. All future work will be consolidated into Spark. The difference between Shark and Spark/Spark SQL comes down to a server component for client apps (using a JDBC or ODBC driver) to use. Spark 1.0.1 introduces a JDBC/ODBC server to fill this gap. For more elaboration, refer to Reynold Xin’s blog post.
This year’s Spark Summit was a milestone event. The vigor of a young (but rapidly-maturing) project and the small-but-growing community made the 2014 summit an intimate affair. (Being held in a vintage hotel couldn’t help give the event a lot of charm and character.) With the major advances as outlined above, Apache Spark is truly a competitive heir to Hadoop 1.0.
The next event was teased as Spark Summit East in early 2015. It’ll be one to watch.