What Would Google Do, part N…

March 19, 2013 facebooktwittergoogle_plusredditpinterestlinkedinmail by in Big Data   Excel Pivot Tables   Google BigQuery   Hadoop   MDX   Web/Tech  

WWGD. “What Would Google Do?” I’ve seen this refrain enough in the last few weeks but I didn’t know that
there’s a book by that name. Fittingly, a couple of clicks in
Google leads you to a nice summary of the book.

But today, I want to talk about Dremel and the post-MapReduce world. To put things in perspective,
the following diagram (full disclosure: I grabbed it from Doug Cutting’s October Strata NYC keynote) shows the lineage of some of today’s Big Data technologies.

6a00e3932f172e8834017c37df15bb970b-800wi

Everyone’s likely familar with the first three rows. But Dremel, F1 and Spanner are new on the scene.

Let’s consider Dremel first. (Refer here for Google’s whitepaper.) Dremel is used extensively within Google. What is notable is that Google decided to commercialize it for external use via its BigQuery service. In typical Google fashion, it has been steady firming up for last couple of years. Google just announced several important new
features
 (including big JOIN) for BigQuery last week. Previously, JOIN were only supported if all but one of
the tables were “small” (small being at composed 8MB compresses). The new big JOIN feature lifts
this restriction. Concurrently, Tableau is just on the verge of shipping support for BigQuery in it upcoming 8.0 release. And just in time, Tableau 8.0 which includes BigQuery as a first-class datasource just shipped. BigQuery is hitting it in strides.

So what is the open source community’s response to Dremel? In the diagram above, Impala is highlighted in red because the diagram is from Cloudera who led the Impala effort. But if you casted your
eyes wider, there’s another project–Apache Drill–that purports to carry the
flame of Google Dremel:

6a00e3932f172e8834017ee9824e34970d-800wi

In fact, there’s more… there’s also OpenDremel. The good news is that OpenDremel and Drill decided to join force a little while back. So now there’s only one project–Drill–which should simplify things for the future.

Drill is making progress. In fact, there was a design review just last week of its SQL parser. If you want to a summary of Drill’s vision and architecture, Ted Dunning has been touring and making the case. The exciting point I see for Drill is that it anticipates many areas of extensibility including query language and file formats.

6a00e3932f172e8834017c37df24b7970b-800wi

One vision that I foresee is implementing a language that is well established for analytics such as MDX on Drill. To give you a flavour for what this could look like, Simba has implemented an MDX/SQL translation on top of Impala to demonstrate the viability and performance of such a system. In the case of Drill, we can potentially do one better with tighter integration and, hopefully, better performance and functionality.

If you’re sharp, you’ll note that there two other Google projects–F1 and Spanner–that as yet
have no open source equivalence. Based on material I’ve seen to date (including an excellent
presentation in February at Strata Santa Clara), the hallmark of Spanner and F1 is
Paxos/TrueTime–a radical new algorithm for achieving consistency using atomic clocks. Despite what Cloudera’s slide above show, what I have seen of Impala to date is that it doesn’t contain anything along that line. It’s safe to say that
F1 and Spanner is another step in the evolution of the database that Google has up its sleeve.