Storage plugin extensibility is a key feature of Apache Drill. Drill supports Hive, HBase, and its DFS file system, which encompasses the CSV, TSV, JSON, and Parquet file types. You can configure Drill sources and data types via its web interface. (More details on that in my earlier blog here.) Though the data sources serve similar purposes, developers must take considerations into account with each.

DATA SOURCES

Hive
Hive supports a query language (HiveQL) that closely resembles SQL. With the Hive metastore providing metadata, Drill is able to support seamless querying from Hive. You can query Hive tables using Drill much like you would for any other table.

HBase
Querying for HBase tables through Drill is a bit more complicated. When using a typical SELECT * FROM query, the columns returned are column families rather than the individual columns. The data returned for each column family is a JSON string containing the individual column and its data. For example, imagine a “students” table with an “account” column family with columns “address”, “gender”, and “name”. A SELECT * FROM hbase.students query will yield:

To select for individual columns in a column family, Drill uses a specific syntax: SELECT col_fam[‘col’] FROM table_name. Using the example from above, if we wanted the address and gender columns, we would use the query SELECT account[‘address’], account[‘gender’] FROM hbase.students which gives us something along the lines of:

Since HBase stores data as bytes, casting to the corresponding data types would be necessary. Also, the name of the columns would need to be aliased. Again using the students example, the resulting query would be SELECT cast(account[‘address’] as varchar(20)) as Address, cast(account[‘gender’] as varchar(6)) as Gender FROM hbase.students which would result in:

To avoid writing long and complex queries for HBase over and over again, a common workflow would be to first create a view containing the useful information and querying off the view instead.

DFS
The DFS (“Distributed File System”) is used to query for files types such as Parquet, JSON, CSV, and TSV. For example, if we have a storage plugin registered as “dfs” with a workspace location named “tmp” located at “/tmp”, then to query files from the “/tmp” folder, the query would be similar to SELECT * FROM dfs.tmp.<filename>.

FILE TYPES

Parquet
Querying off parquet files is straightforward. The only issue you need to be aware of is that not all data types are supported by parquet and may come back as varbinary so casting may be necessary.

JSON
With JSON files, the tricky part is that they may be nested arbitrarily deep. To query for them, the syntax is similar to HBase, except that it may go many levels deeper. The general syntax is SELECT name1[‘nestedname1’][‘nestednestedname1’] FROM <schema>.<filename>.json. For example, if we had a JSON file as such:

To query for “menu”, it is as simple as SELECT menu FROM dfs.tmp.menu.json and the result will be a JSON string. To query for “id”, it will be SELECT menu[‘id’] FROM dfs.tmp.menu.json and the result will be “file”. Querying for “menuitem” will just be adding on another layer as such: SELECT menu[‘popup’][‘menuitem’] FROM dfs.tmp.menu.json.

Arrays are slightly different as they are indexed, starting from 0. So to query for the second “onclick”, the query will be “SELECT menu[‘popup’][‘menuitem’][1][‘onclick’] FROM dfs.tmp.menu.json” and the result will be “OpenDoc()”.

CSV/TSV
CSVs and TSVs do not have column names. If you use SELECT * FROM <filename>.csv, the result will be one column named “columns” with an array with size corresponding to the number of columns in the CSV file. So if we want to query for specific columns, they need to be indexed, starting from 0. For example, if we had a CSV file with this data:

To query for the city, we would use SELECT columns[3] as ‘City’ FROM dfs.tmp.universities.csv. Needless to say, aliasing would also be recommended.

Apache Drill currently only supports these storage engines but there may be more in the future.