Strata Santa Clara 2014 was last week. If you didn’t make it out, there are a few summaries around the net already (here and here). The theme that I saw can be summarized in three letters: S, Q, L. Now that Hadooop is part of the establishment, SQL is a key enabling technology for interoperability. If you dig into the prevailing idea of “data lake” or “data hub”, it’s all about being able to reach from the legacy end (either an application or database) across into Hadoop/HDFS.
On Thursday, I attended a great session by Marcel Kornacker on Impala. The most insightful portion of the session was not necessarily Marcel’s material (which is useful and does illuminate Impala’s internal) but that attendees quizzed Marcel by name of all the options we have in 2014: Hive, Stinger, Presto, Shark, Hawq, and Drill. Marcel did let out that Cloudera is benchmarking Impala against Shark and that they are almost ready to share this.
It’s worth noting that the grand-daddy of them all–Apache Hive–is still very much king of the hill. The fact of the matter is that Hive as the incumbent is not standing still. Hortonworks, Microsoft, et al have been tirelessly advancing Hive under the ‘Stinger initiative’. Alan Gates’ session provided a tour of Stinger’s progress right after lunch. Both Marcel and Alan’s sessions are worth catching on Oreilly.
I submit that there is recognition of Hive as a de facto standard for SQL-on-Hadoop. Exhibit 1: Impala has stated its goal of delivering a high degree of HiveQL compatibility even if its Thrift-based (“wire”) protocol deviates from Hive’s HiveServer and HiveServer2. Exhibit 2: AMPLab/Databrick’s Shark server is functionally a Hive server front-end (query parser and intermediate representation) paired with Spark for back-end.
Finally, I’m finding a lack of up-to-date material on working with and working ‘in’ Hive. So I’m going to address some this with a series of blogs on Hive. I will begin the series with a re-cap of the history, a collection of helpful links, and some useful material. Further along, I want to share some benchmark results of recent changes to HS2 and also offer a glimpse of Hive’s future.