This demo walks you through the steps required to setup Simba Hive ODBC driver and connect Microsoft Power BI Designer to Hive data source. You will also see a short demo of Power BI Designer data cleanup features and the use of native Hive Query Language.

Sample connection strings

  • SQL-92 translation:
    driver={Simba Hive ODBC Driver};host=hive01.local;port=10000;schema=movielens;AuthMech=2;HiveServerType=2;
  • Native HiveQL:
    driver={Simba Hive ODBC Driver};host=hive01.local;port=10000;schema=movielens;AuthMech=2;HiveServerType=2;UseNativeQuery=1;
  • Please refer to Apache Hive ODBC Driver User Guide for description of the driver configuration options you will need to create your own connection string.
  • You can also define a system DSN using ODBC Administrator and refer to it by name in your connection string. E.g. for the DSN named “HIVE32”:
    dsn=HIVE32

Links from the demo

Dataset Preparation

1 million row MovieLens dataset was used for the demo. Please refer to the following page for additional steps used to prepare the dataset for import (change of delimiters and addition of occupations table).

If you use different MovieLens dataset you may need to pre-process files differently and adjust Hive schema definition.

Hive schema definition

create database if not exists movielens;
use movielens;

CREATE EXTERNAL TABLE ratings (
  userid INT,
  movieid INT,
  rating INT,
  tstamp STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
LOCATION '/dataset/movielens/ratings';

CREATE EXTERNAL TABLE movies (
  movieid INT,
  title STRING,
  genres ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
COLLECTION ITEMS TERMINATED BY "|"
LOCATION '/dataset/movielens/movies';

CREATE EXTERNAL TABLE users (
  userid INT,
  gender STRING,
  age INT,
  occupation INT,
  zipcode STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
LOCATION '/dataset/movielens/users';

CREATE EXTERNAL TABLE occupations (
  id INT,
  occupation STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
LOCATION '/dataset/movielens/occupations';

Bash script to upload/refresh the dataset in Hadoop

#!/bin/bash

# data files are location
DATA_FOLDER=/Data/MovieLens/ml-1m

# data file extension e.g. csv or t
EXT=t

# hadoop dataset name e.g. /dataset/movielens
DS=/dataset/movielens

echo ------------------------------------------------------------
echo Loading .$EXT files from $DATA_FOLDER into $DS...
echo Make sure schema exists before loading the data
echo

# hdfs dfs -rmdir --ignore-fail-on-non-empty $DS
hdfs dfs -mkdir -p $DS

for filename in $DATA_FOLDER/*.$EXT; do
	echo Reloading $(basename $filename .$EXT)...
	hadoop fs -rm -r $DS/$(basename $filename .$EXT)
	hadoop fs -put $filename $DS/$(basename $filename .$EXT)
	echo
done

echo All done