As Hive 0.13 is just about to go out the door, I want to share one area of improvement that we at Simba have worked on: HiveServer2 performance.
Let me begin by explaining that the performance I am referring to here is not about HiveQL query performance. There has been a huge community effort under the umbrella of Stinger that’s been steadily improving Hive. What I’m referring to instead is HiveServer2’s intrinsic performance for conveying results from Hive to a client. For the purpose of the results I’m sharing here, I’ve used ODBC but since the ODBC driver uses Thrift, you can just about extrapolate the results here to similar interfaces (such as JDBC which is also built on Thrift).
Recall that I’d shared back at last April’s Hive User Group meeting that we had observed severe degradation in the driver when changing over from HiveServer(1) to HiveServer2. You can refer to HIVE-3746 for details and thank Navis for the patches.
The problem turned out to be a detail in how unions were being passed.
We wanted to quantify the fix and also to compare the corrected HiveServer2 with HiveServer(1). As we were not concerned with query performance itself, we chose to use the simplest query that would fetched a large enough block of data to minimize timing noise.
Dataset and Query
|Test Dataset denorm_vmart.online_sales_fact_denorm|
|Columns||124 columns (72 STRING, 47 INT, and 5 DOUBLE)|
|Dataset Size||646650 rows (516.97MB)|
All versions of Hive were running on a bare metal single node install (CentOS 6.5) with Intel Core2 Quad CPU (Q6600), 6GB memory, 2TB hard disk drive, and a Gigabit Network Connection.
We ran the query ten times with the following three version of Hive. The traffic between the Hive server and the client was captured using tcpdump.
|HS1-V3||HiveServer(1) based on Hive 0.12, run with hive –service hive server|
|HS2-V3||HiveServer2 based on Hive 0.12 (CLI_SERVICE_PROTOCOL V3)|
|HS2-V6||HiveServer2 containing HIVE-3746 (CLI_SERVICE_PROTOCOL V6). Our results are based on the snapshot with commit hash cb2bd6d.|
It’s clear that HS2-V6 is a big step forward for HiveServer2. Traffic between the Hive server and client has been reduced by half. Fetch time is about a quarter of the original HiveServer2. However, there’s still room for improvement from HiveServer(1).
The ideas from Facebook’s recent fbthrift may be applicable in Apache Thrift ( in particular, compression). Applying these to Apache Thrift and upgrading Hive to use them is an area that I will be pursuing in the future. Drop me a line if you’re interested in this.