B-JIToxCYAA8G8Y

This year’s opening event on the Big Data circuit–Strata Hadoop World–was prefaced the day before by Pivotal’s swath of announcements around the Open Data Platform, the open sourcing of Pivotal’s own SQL-on-Hadoop engine Hawq, etc. You can re-watch/read about it all here. By the time I landed in San Jose Tuesday afternoon, Cloudera’s Doug Cutting had responded as did Curt Monash the next day.  The level of activities in this market makes it difficult to judge the merit of such broad initiative. If I were to interpret the ODP, I will go on a limb and declare it a move to ease future transitions of second-tier Hadoop distributions onto Hortonwork’s HDP.

But let’s get on with Strata. As a non-American, I probably can’t appreciate Obama’s keynote as much as I would otherwise. But I do recognize the significance of Obama’s naming of DJ Patil as the first Chief Data Scientist for the US. Together with Obama’s open government initiative of 2009, the US is fully committed to a data-based future.

But balanced against the excitment of the possibilities are concerns around privacy and security. This was a particularly stark contrast because I also attended the Privacy and SecuritConference the week before Strata. The discussions there highlight how much work is needed on the ‘backside’: government and the public-at-large. Tim Wu’s keynote on net neutrality is particularly enlightening.

In addition to the conference proper, Strata is a convenient anchor for scheduling the meetups that backstops Hadoop’s many open source projects. So for the week, there were also:

  1. HBase (http://www.meetup.com/hbaseusergroup/events/219260093/): Focusing in on yet another SQL-on-Hadoop (more accurately, SQL-on-HBase): HP’s Trafodion.
  2. Spark (http://www.meetup.com/spark-users/events/220031485/): Reynold introduced the new DataFrame API. Databrick’s Andy put a livestream up for this.
  3. Hive (http://www.meetup.com/Hive-User-Group-Meeting/events/219794523/): The most useful session was Szehon’s survey of Hive’s join techniques and the work-in-progress as part of Hive-on-Spark.

Apart from SQL-on-Hadoop, the other trend for this Strata was that of data curation. I am likely swayed by Tamr’s keynote via Michael Stonebraker. But beyond data curation, OLAP and cubes are taking another small step back into the spotlight. Recall that Dave Mariani had already evangelized cubes at a previous Strata. It was gratifying to see Kyvos on the show floor at this Strata with full-blown MDX cubes powered by Hadoop. The return of SQL is just the start; schemas and models matter.

There’s no letting off of the gas in 2015 for Big Data. 451’s Matt Aslett shared his retrospective on 2014 and ends off with the open question “It couldn’t happen again. Could it?” Good question.