Here’s a re-cap of last week’s Hadoop Summit 2014.
It’s difficult for me to convey my excitement thru the medium of words. As I explained in my previous post, YARN is a giant step forward from MapReduce and makes possible for Hadoop to become a general compute fabric. Together with Docker (which I almost overlooked), Hadoop 2.0 is really really cool for what it affords. The best analogy is that a Docker container is to a virtual machine as a thread is to a process. In the same way that a thread a light-weight context for execution, a Docker (Linux) container is light-weight context of virtualization. The most compelling demonstration for this container technology was seeing a 7-node Hadoop cluster spin up on my laptop in under 20 seconds at the YARN course. As a technology, container is so compelling that Altiscale is capitalizing on the idea and have built an elastic Hadoop 2.0 service on it (in the same way that Amazon EMR is an elastic Hadoop 1.0 service).
I thought the 2014 Summit notable because I’m seeing some critical retrospective on Hadoop.
Merv Adrian opened Summit with a prescient observation on the pace of technology innovation: a successful technology typically has a lifespan of 20 years with mass adoption occurring at the half-way mark. When a technology reaches this milestone, traditional enterprises begin to get serious on using the technology. And then real-world concerns—such as authentication, security and integration—become priorities.
Applying the above rule to Hadoop, Merv observe that Hadoop is running right on schedule. Being created in 2005, Hadoop is about 9 years old. The concern of the day is evident by the latest acquisitions of Hortoworks (XA Secure on security) and Cloudera (Gazzang on security) . Things aren’t so different this time around.
Tony Baer’s session was a great historical perspective on the relatively recent trend of NoSQL. I had spoken on the same subject at the local school so it was particularly gratifying to see industry pundits recognizing and speaking to the same trends.
Finally, David Mariani’s talk on “BI on Hadoop” was the big tease for this Summit. Mariani was the co-author (together with Denny Lee) of a case study on Klout’s data pipeline (circa 2012).
As he describes it, the data pipeline was brilliant but for the limits of Analysis Services.
David has moved on from Klout and is at the startup AtScale working on an OLAP/cube/MDX engine on Hadoop. From what he showed (which didn’t contain any product demonstration) and subsequent Q&A, AtScale features both MDX and SQL and seemed to be positioned as a potential replacement for Analysis Services. David was highlighting the revolutionary improvements that newer SQL-on-Hadoop engines (such as Shark, Impala or Stinger/Hive) has made over the Hive of 2012. Together with various sessions on SQL-on-Hadoop (including Alan Gates’ session on ACID transaction), 2014 is evidence of Hadoop’s solid position within the enterprise IT landscape.
I’m waiting impatiently for the videos to be released to catch up on a handful of sessions from Wednesday that I missed. Watch for an update on that.
Postscript: Ralph Kimball did a couple of sessions reconciling the EDW with Hadoop. Catch them here from my tweet.