With changes in big data and the continuing trend towards separation of storage & compute, Amazon Simple Storage Service (S3) is gaining prominence. S3 is an object storage service and is used by customers of all sizes and industries to store and protect any amount of data for a range of use cases, such as websites, static websites (using Angular or React JS), Android & iOS mobile applications, backup and restore, data archive, enterprise applications, IoT devices, data lakes and big data analytics. S3 is also used with ELT (extract, load, transform) approach which is changing the old ETL paradigm.
S3 itself does not have a query engine of its own. But there are multiple methods for accessing data & executing SQL queries on S3 data with BI applications using ODBC, some of which we will cover below.
- Presto query engine leads when it comes to BI type queries. Presto with the Hive metastore is widely used to access S3 data natively. Presto is interactive and can query faster than Hive if the query has multiple stages. Using the Simba Presto ODBC driver users can analyze data in S3 files without extraction, using their preferred BI application.
- If you are already using Hive, you could use it to connect & query S3 data. Hive uses MapReduce and can be used if throughput and support for large queries is a key requirement. The high-level steps to connect Hive to S3 are similar to the steps for connecting Presto using a Hive metastore. Further using the Hive ODBC driver BI apps can connect to & query data in S3 files.
- With AWS Redshift; you can store data in Redshift & also use Redshift spectrum to query data in S3. With this approach, typically live data is stored in Redshift and less frequently used data in S3. Redshift Spectrum allows you to run queries on external tables which can read from S3. Which means you can run standard SQL queries on data stored in format like CSV, TSV, Parquet in S3. Redshift external tables however do not support delete, updates only select, join, sort queries are supported. Using the Redshift ODBC driver you can then connect to a BI application.
- Athena which is an interactive query service could be used to query & analyze data in Amazon S3 using standard SQL. Athena itself uses both Presto for queries & Hive for create, alter tables. Athena charges by the amount of data scanned for each query. Scanned data can be reduced by partitioning, converting to columnar formats like Parquet. The Athena ODBC driver allows data connectivity to your BI application.
- If you would rather browse & extract files in S3 buckets; the S3 ODBC driver does provide multiple stored procedures including procedures for uploading, downloading files along with SQL access to buckets.
Presto on S3
We will cover the high-level steps & basic configuration needed to connect & query data in S3 files using AWS EMR Presto. If you are using on-prem Presto, the steps should be similar.
- Initiate a Presto cluster in AWS EMR with Hive Metastore.
- In the EC2 instance which runs the Hive Metastore process, you would need to manually start HiveServer2 instance. Sample command:
sudo nohup hive –service hiveserver2 &
- After hiveserver2 is started, verify by using beeline. hivesever2 uses 10000 port as the default.
beeline -u jdbc:hive2://localhost:10000 -n hive
- Now that hiveserver2 is started, next step is to add S3 credentials to corresponding hive & Hadoop config files.
Here we edit: /etc/hive/conf/hive-site.xml & /etc/hadoop/conf/hdfs-site.xml
<property> <name>fs.s3a.access.key</name> <value>AccessKey</value> <description>AWS access key ID. Omit for Role-based authentication. </description> </property> <property> <name>fs.s3a.secret.key</name> <value>secret key</value> <description>AWS secret key. Omit for Role-based authentication. </description> </property>
Alternatively AWS IAM Roles can be used and configured to grant the required access. You could add more configuration params to fine tune the connection.
- You may need to restart HDFS Namenode & HiveServer2 for the updated configuration to be effective:
sudo restart hadoop-hdfs-namenode
sudo nohup hive --service hiveserver2 &
- Verify if S3 files can be accessed using HDFS commands like ‘ls’
hdfs dfs -ls s3://testpresto123/test/
- Now that Hive is connected to S3. You need to create external tables. Using beeline create table/s corresponding to the S3 files. Here we create one table for CSV file in S3 which has Car data in City,County,Make format.
CREATE EXTERNAL TABLE cars(City STRING, County STRING, Make STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://testpresto123/test/';
- Verify if S3 data is accessible & can be queried using SQL: select * from cars;
Now that Presto can query S3 file content, you are all set to visualize data in your preferred BI application using the Presto ODBC driver.