Understanding HiveServer2 0.13 Performance

March 20, 2014 facebooktwittergoogle_plusredditpinterestlinkedinmail by in Facebook   Hadoop   Hive   ODBC  

Introduction

As Hive 0.13 is just about to go out the door, I want to share one area of improvement that we at Simba have worked on: HiveServer2 performance.

Let me begin by explaining that the performance I am referring to here is not about HiveQL query performance. There has been a huge community effort under the umbrella of Stinger that’s been steadily improving Hive. What I’m referring to instead is HiveServer2′s intrinsic performance for conveying results from Hive to a client. For the purpose of the results I’m sharing here, I’ve used ODBC but since the ODBC driver uses Thrift, you can just about extrapolate the results here to similar interfaces (such as JDBC which is also built on Thrift).

Recall that I’d shared back at last April’s Hive User Group meeting that we had observed severe degradation in the driver when changing over from HiveServer(1) to HiveServer2. You can refer to HIVE-3746 for details and thank Navis for the patches.

Data

The problem turned out to be a detail in how unions were being passed.

We wanted to quantify the fix and also to compare the corrected HiveServer2 with HiveServer(1). As we were not concerned with query performance itself, we chose to use the simplest query that would fetched a large enough block of data to minimize timing noise.

Dataset and Query

Test Dataset denorm_vmart.online_sales_fact_denorm
Columns   124 columns (72 STRING, 47 INT, and 5 DOUBLE)
Dataset Size   646650 rows (516.97MB)
Query   SELECT * FROM denorm_vmart.online_sales_fact_denorm 

 

Hardware Configuration

All versions of Hive were running on a bare metal single node install (CentOS 6.5) with Intel Core2 Quad CPU (Q6600), 6GB memory, 2TB hard disk drive, and a Gigabit Network Connection.

 

Results

We ran the query ten times with the following three version of Hive. The traffic between the Hive server and the client was captured using tcpdump.

HS1-V3        HiveServer(1) based on Hive 0.12, run with hive –service hive server
HS2-V3      HiveServer2 based on Hive 0.12 (CLI_SERVICE_PROTOCOL V3)
HS2-V6      HiveServer2 containing HIVE-3746 (CLI_SERVICE_PROTOCOL V6). Our results are based on the snapshot with commit hash cb2bd6d.

 

hs2-improvement

hs2-improvement-graph

 Conclusion

It’s clear that HS2-V6 is a big step forward for HiveServer2. Traffic between the Hive server and client has been reduced by half. Fetch time is about a quarter of the original HiveServer2. However, there’s still room for improvement from HiveServer(1).

The ideas from Facebook’s recent fbthrift may be applicable in Apache Thrift ( in particular, compression). Applying these to Apache Thrift and upgrading Hive to use them is an area that I will be pursuing in the future. Drop me a line if you’re interested in this.