What’s in a name? Pardon my paraphrase of a certain bard:

What’s in a name? that which we call a rose
By any other name would smell as sweet;

It was announced at the 2014 Spark Summit that Shark would retire. Since then and despite Reynold’s post on the last day of Summit elaborating on the thinking and plans for Spark SQL, I keep bumping up against people with questions about what Spark SQL is vs. Shark.

So I decide to put this post to clear the air about the practical outcome of this change.

For starter, Spark SQL is not a replacement for Shark. Let me explain why this is the case. Incidentally, Oreilly.com’s new (and in-progress) book Learning Spark was a great help in helping me to understand and hopefully to explain this.

The key factor to understanding Spark is explained in chapter 2:

Introduction to Core Spark ConceptsNow that you have run your first Spark code using the shell, it’s time learn about programming in it in more detail.

At a high level, every Spark application consists of a driver program that launches various parallel operations on a cluster. The driver program contains your application’s main function and defines distributed datasets on the cluster, then applies operations to them. In the examples above, the driver program was the Spark shell itself, and you could just type in the operations you wanted to run.

 Shark started life as a fork of Hive where the MapReduce compiler was replaced with Spark’s RDD engine. Both Hive and Shark are thus Thrift servers (an application) that provide services for client. For example, both JDBC and ODBC drivers can consume the services of such a Thrift server. In Spark parlance, the Shark server is a driver program which make use of the Spark framework. In contrast, Spark SQL is a capability within the Spark framework and nothing more. A framework requires a driver program for it to perform any useful work.

Thus, the recent history in Spark amounts to this:

  1. Up to and including Spark 0.8.1, the Shark project was an implementation of HiveServer (not HiveServer2).
  2. In Spark 0.9.*, Shark was updated to HiveServer2. (Refer to SHARK-117.)
  3. As of Spark 1.0.*, Shark is no more. There is a private branch of 1.0.1 which SAP ships as part of their announcement at Summit that contains a Thrift server that will support JDBC and ODBC. But the public 1.0.1 (and also 1.0.2) do not contain this server.
  4. The plan is for Spark 1.1 to include the Thrift server once more.

Let me know below if you have other questions that I haven’t answered.