Evolving Hive

One aspect of the current climate for polyglot development is the wide selection and rapid evolution of tools. I am reminded of this when I saw Facebook’s latest update of Thrift.

As many of you may know, Thrift is used widely including in some prominent Hadoop projects such as Apache Hive. Both HiveServer(1) and HiveServer2 are built using Thrift. One criticism that has been leveled at Hive is that part of its performance is capped by its use of Thrift as a transport/protocol. In particular, it is a synchronous messaging system with weak support on C++.

Facebook’s announcement last month is a nice reset because Facebook chose instead to fork Thrift. They made it clear enough that this isn’t Apache Thrift with the name (fbthrift).

I propose that with fbthrift at hand now, it is a good time to revisit Hive’s protocol. In particular, re-implementing (or more accurately, porting the interface of) HiveServer2 to fbthrift looks to be a good first step to improving Hive performance. Fbthrift includes low-hanging fruits such as compression and better encoding that should come for free. The next step of taking advantage of fbthrift’s asynchronous IO should net even more performance.

Do you have more ideas around this? Do you want to take part in this effort? Drop me a line!