Storage plugin extensibility is a key feature of Apache Drill. Drill supports Hive, HBase, and its DFS file system, which encompasses the CSV, TSV, JSON, and Parquet file types. You can configure Drill sources and data types via its web interface. (More details on that in my earlier blog here.) Though the data sources serve similar purposes, developers must take considerations into account with each.


Hive supports a query language (HiveQL) that closely resembles SQL. With the Hive metastore providing metadata, Drill is able to support seamless querying from Hive. You can query Hive tables using Drill much like you would for any other table.

Querying for HBase tables through Drill is a bit more complicated. When using a typical SELECT * FROM query, the columns returned are column families rather than the individual columns. The data returned for each column family is a JSON string containing the individual column and its data. For example, imagine a “students” table with an “account” column family with columns “address”, “gender”, and “name”. A SELECT * FROM hbase.students query will yield:

"row_key", "account"
0x726F7731, "{  "address" : "MTIzNCBXZXN0IDFzdCBBdmVudWU=",  "gender" : "TWFsZQ==",  "name" : "Sm9obg=="}"
0x726F7732, "{  "address" : "Njc4OSBXZXN0IDl0aCBBdmVudWU=",  "gender" : "RmVtYWxl",  "name" : "TGluYQ=="}"

To select for individual columns in a column family, Drill uses a specific syntax: SELECT col_fam[‘col’] FROM table_name. Using the example from above, if we wanted the address and gender columns, we would use the query SELECT account[‘address’], account[‘gender’] FROM hbase.students which gives us something along the lines of:

"EXPR$0", "EXPR$1"
 0x31323334205765737420317374204176656E7565, 0x4D616C65
 0x36373839205765737420397468204176656E7565, 0x46656D616C65

Since HBase stores data as bytes, casting to the corresponding data types would be necessary. Also, the name of the columns would need to be aliased. Again using the students example, the resulting query would be SELECT cast(account[‘address’] as varchar(20)) as `Address`, cast(account[‘gender’] as varchar(6)) as `Gender` FROM hbase.students which would result in:

"Address", "Gender"
"1234 West 1st Avenue", "Male"
"6789 West 9th Avenue", "Female"

To avoid writing long and complex queries for HBase over and over again, a common workflow would be to first create a view containing the useful information and querying off the view instead.

The DFS (“Distributed File System”) is used to query for files types such as Parquet, JSON, CSV, and TSV. For example, if we have a storage plugin registered as “dfs” with a workspace location named “tmp” located at “/tmp”, then to query files from the “/tmp” folder, the query would be similar to SELECT * FROM `dfs.tmp`.`<filename>`.


Querying off parquet files is straightforward. The only issue you need to be aware of is that not all data types are supported by parquet and may come back as varbinary so casting may be necessary.

With JSON files, the tricky part is that they may be nested arbitrarily deep. To query for them, the syntax is similar to HBase, except that it may go many levels deeper. The general syntax is SELECT name1[‘nestedname1’][‘nestednestedname1’] FROM <schema>.<filename>.json. For example, if we had a JSON file as such:

{"menu": {
   "id": "file",
   "value": "File",
   "popup": {
      "menuitem": [
        {"value": "New", "onclick": "CreateNewDoc()"},
        {"value": "Open", "onclick": "OpenDoc()"},
        {"value": "Close", "onclick": "CloseDoc()"}

To query for “menu”, it is as simple as SELECT menu FROM `dfs.tmp`.`menu.json` and the result will be a JSON string. To query for “id”, it will be SELECT menu[‘id’] FROM `dfs.tmp`.`menu.json` and the result will be “file”. Querying for “menuitem” will just be adding on another layer as such: SELECT menu[‘popup’][‘menuitem’] FROM `dfs.tmp`.`menu.json`.

Arrays are slightly different as they are indexed, starting from 0. So to query for the second “onclick”, the query will be “SELECT menu[‘popup’][‘menuitem’][1][‘onclick’] FROM `dfs.tmp`.`menu.json`” and the result will be “OpenDoc()”.

CSVs and TSVs do not have column names. If you use SELECT * FROM <filename>.csv, the result will be one column named “columns” with an array with size corresponding to the number of columns in the CSV file. So if we want to query for specific columns, they need to be indexed, starting from 0. For example, if we had a CSV file with this data:

UBC, 2329 West Mall, V6T 1Z4, Vancouver, Canada
SFU, 8888 University Drive, V5A 1S6, Burnaby, Canada

To query for the city, we would use SELECT columns[3] as ‘City’ FROM `dfs.tmp`.`universities.csv`. Needless to say, aliasing would also be recommended.

Apache Drill currently only supports these storage engines but there may be more in the future.