Reducing application development time by connecting Presto to Apache Accumulo

L-R: Dan Sun, Adam Shook, Skand Gupta

Apache Accumulo is based on Google’s BigTable design, built on the Hadoop, Zookeeper, and Thrift projects (also from Apache), with strong support for data security built right in. Accumulo is a scalable, sorted, distributed key/value store. It stores relational rows of data as a collection of key/value pairs, which are sorted on their keys. Accumulo provides very fast retrieval of data when specifying either an individual key or a small range of keys.

We use Accumulo in a number of applications at Bloomberg Vault, as a database of communication events, as a triple store for entity relationships, and as a file store abstraction over HDFS. Applications that leverage data in Accumulo are typically written in the Java programming language. The Accumulo API is relatively simple to use, but is lacking a robust query framework. Applications are limited to writing data using Accumulo’s Mutation object and reading data via iterating over Scanner or BatchScanner objects, extracting information from the raw key/value entries in the Accumulo table. This requires a lot of complex Java code and low-level data management, which is often similar in structure across many applications using Accumulo for data storage and retrieval. We implemented an Accumulo connector for Presto to address these issues and to reduce the application development time.

We recently published our Presto-Accumulo connector, and this whitepaper covers how it can be used to retrieve data from Accumulo using SQL. It also looks at some performance metrics from the TPC-H benchmark suite, and wraps up by providing an overview of the functionality supported by the Presto-Accumulo connector.

Presto is a distributed ANSI SQL query engine for running interactive queries over very large data sets — from gigabytes to petabytes. Originally built by Facebook, Presto supports a pluggable storage layer, allowing users to implement a connector to virtually any data storage system, big or small.

By using the Presto connector for Apache Accumulo, users are able to execute efficient ANSI SQL queries against relational data sets for rapid exploration and production analytics. This drastically reduces the development time to extract data from Accumulo. The methodologies used by the Presto connector follow common design patterns used by Accumulo application developers today.

We’re working with the Presto community to integrate this connector into future Presto releases, so that all Accumulo users will have this functionality easily available.