Day 1 of Strata Santa Clara continued the frentic pace of day 0. In light of Stinger and Hawq, I was particularly keen to see the SQLization trend play out. The day started with Rajat Taneja, CTO of EA, sharing how Hadoop and Big Data has changed EA approaches their entire portfolio of games. He summarized it as follows:
- Data is Only a means to an End
- Focus on the Usage and Action of the Data First
- The Right Data at the Right Time is More Valuable than Bigger Data Sets
- Use Data to Create Amazing Customer Experiences
- Single TRUE identity is the crucial glue that connects the power of data insight to the CONSUMER.
Right after, EMC’s Scott Yara stepped up but not to crow about Monday’s Pivotal HD but rather to recognize influential “data people”. But it was the third keynote that I thought was bold and ambitious: Stitch Fix pairs machine algorithm with human curation to deliver (literally, shipped to a customer/subscriber) apparel recommendations. The theme for the keynote is that people matter.
I began the day with Berkeley’s Spark/Shark. I had first seen at last year’s Hadoop Summit. Judging from the attendance which was the highest that I saw for the day, the demand and interest for interactive/low-latency SQL is huge.
I followed up with Tomer Shiran’s introduction to Apache Drill. Tomer and Jacques and the team (now comprising 6-7 companies) are making progress and are looking at a Q2 alpha and Q3 beta. Drill is of great interest to me because of its extensible front and back-end architecture that allows for additional query languages and operators. The questions from the audience suggest that SQL on a scaled-out architecture is definitely on top of mind.
EMC’s session on their Pivotal HD distribution, right after lunch, was no snoozer. The benchmarks number that they shared between Hawq & Hive and Hawq & Impala are amazing. In the former, accelerations of 19X to 648X. In the latter, accelerations was still from 9X to 69X. Of course, this isn’t too surprising given that they are leveraging Greenplum’s well-developed engine and optimizer. Hawq is a game-changer for the “SQL on HDFS” market because of EMC’s position in the storage market and the strength of Hawq.
Hortonwork’s Alan Gates followed with a survey of the various tools in the Hadoop ecosystem. He touched lightly on the new Tez project (pronounced “taze”). But, more importantly, I also got some longstanding misunderstanding about HBase and HCatalog cleared up. Thanks Alan!
Finally, it was the session that I had been waiting all day for: Tim O’Brien’s “The Future of Relational (or Why You Can’t Escape SQL)”. Tim did a breathless re-cap of the last 40 or so years of database technology. Starting with the pre-SQL world, moving onto the early SQL market, the mature SQL market, the rise of NoSQL in the mid-2000s, to today’s diverse Big Data technologies and to what I’ll offered as post-BigData NewSQL world of Impala/Drill/Spanner. From his experience as a database developer, he offers several choices statements about the need/motivation for many a team who choose NoSQL. (You can refer to theregister’s article for those…) But the core point that he offers is that SQL and relational are not joined at the hip. Relational for data modeling may be out. But SQL remains applicable
and valuable: expressiveness, a corp of practitioners with a vast body of knowledge, and widely available and deployed tools. A final curious reason he offered is suggestively summarized as “What Would Google Do”: Google itself has come full-circle back to SQL with Spanner after inventing BigTable/MapReduce.
I ended the day with Citus Data’s Carl Steinbach. Citus took Postgres and tacked on an implementation of SQL/MED that they have prosaically named “foreign data wrapper”. Carl, being a Hive guy, described a foreign data wrapper as a Hive SerDe with extensions for predicate push-down and metrics. Works for me. In the end, it delivers huge improvement even on non-HDFS data.
Summarizing day 1, the themes surrounding SQL are:
- A proven and understood query language—SQL—is not an option: it is table stakes for Big Data to enter any enterprise.
- But the query language must span all workloads. It must not only provide for bulk/ETL scenario where latency is immaterial but also interactive workload where latency is king.
- SQL-like isn’t good enough. SQL compliance is no longer an ideal but necessary: to enable the full gamut of tools to run unmodified; to enable analysts to write unrestricted SQL.