I was wary that Carl Steinbach didn’t published the July Hive User Group meetup‘s agenda until the morning of the meetup. But as it turns out, my concerns were unfounded.

Of course, 2013 is a busy summer in Big Data with a lot of work under way on many fronts including Stinger/Tez/YARN, Impala, and other SQLization efforts.

The July Hive User Group was hosted at LinkedIn’s offices. A big thanks go out to LinkedIn’s AV team who recorded and edited the event. (Catch the recording here.)

LinkedIn’s Mohammad Islam started off the evening with a survey of their data architecture. He tag-teamed with Mark Wagner and Karthik Ramasamy to cover LinkedIn’s usage of Hive. LinkedIn’s data sources is composed of three classes: event data (generated by the user’s interaction with the website), account/profile data (hosted in Oracle) and external data (from partners).

(Speaking of LinkedIn, I note that Monica Rogati most recently joined Jawbone as their first VP of Data. Together with my inconsistent sleep, I joined the quantified self fray with the Jawbone Up. More on this in the future as the data accumulates.)

Yahoo’s Chris Drome followed up with a session on Yahoo’s usage of HiveServer2. Chris did a comprehensive survey of the challenges he and his team faced to make Yahoo’s various clusters available to both internal and external teams. There are challenges remaining including HIVE-3746 that I had previously shared at the February meetup. I’m looking forward to working that thru with Chris in the coming months.

Gunther Hagleitner next provided a retrospective on Hive 0.11 and Tez. Alan Gates and his team have been regularly reporting on the progress of  Stinger since the February Strata:





Carter Shanklin had the best teaser of the evening mainly because his demo of geospatial analytic was sunk by wireless networking problems. But he’s since blogged about it so catch it

Arvind Prabhakar closed off the evening with a hot-off-the-press look into Cloudera’s role-based authorization system, Sentry.


If it isn’t obvious, the SQLization of Hadoop is in earnest and Hive, the grand-daddy of all SQL-on/in-Hadoop efforts is evolving with the times.