PrestoDB: Running PrestoDB on Amazon EMR

A weeks ago, Facebook released a new open source project called PrestoDB which they billed as a market improvement over Hive and Hadoop. According to the PrestoDB site, Presto is a real time query engine that supports a SQL like syntax, similar to Hive. However, unlike Hive, Presto doesn’t execute queries using MapReduce jobs but instead uses its own internal distribution mechanism. According to the Presto site and current users, most queries will see an order of magnitude speedup compared to Hive. And the best part? PrestoDB can read metadata from Hive’s metastore and read files off HDFS just like Hive – pretty wild.

Anyway, since I love new toys (who doesn’t!?) I decided to try setting up PrestoDB on Amazon EMR to see how difficult it was and also experience the speedups. Turns out, once you have an Amazon EMR cluster running getting PrestoDB up is almost trivial. Just follow the PrestoDB deploying directions to get yourself situated. Make sure you create *all* the files or you’ll get some necessarily cryptic errors along the way.

The config files I ended up using were:

You’ll need to create the “/mnt/presto” directory and also make it accessible to whatever user you plan to run the daemon under.

The one huge gotcha I ran into was that I couldn’t figure out what port Hive’s Thrift service was running on. For some reason, it’s notably absent from Amazon’s documentation and I couldn’t find the hive-site.xml file on the EMR EC2. Completely randomly, I ran across this manual page from Jaspersoft enumerating which ports different versions of Hive run Thrift on when you use EMR. Turns out, its different per Hive version but 0.11.0 will use 10004.

Once you have everything configured, just follow the docs to start the server and you’ll be ready to query. One thing to note though is that you’ll need to setup PrestoDB manually on the rest of your machines and also enable the discovery service for this to “really” work.

Anyway, happy querying!

Posted In: Big Data

Tags: ,

  • DJElbow

    Any pointers on how to deploy and configure Presto specifically on EMR? Did you use a custom script and “Bootstrap Actions” to get Presto on all your EMR nodes? Or did you do this another way?

  • I actually only ran a single node Presto instance. I used EMR to avoid having to manually setup HDFS and Hadoop.

    The best way to get it across all the nodes might be to do the following:
    – Set “Auto-terminate” to “No” when you launch your EMR cluster so that it persists
    – Grab the hostnames for your machines
    – SSH into the “master” and setup Presto there
    – Then, just write a bash script to replicate the steps to install and scp in a config file to configure the other nodes
    – Use SSH to execute this bash script across your other nodes – http://stackoverflow.com/questions/305035/how-to-use-ssh-to-run-shell-script-on-a-remote-machine

    Let me know how it goes!

  • DJElbow

    From my experience you can’t ssh into the other nodes of an EMR cluster, so I am guessing SCP won’t work either. I am still hoping that I am wrong about this :) However, I just ran across this interesting Ruby example of installing Presto on EMR using bootstrap actions: https://github.com/AmazonEMR/bootstrap.actions/blob/master/presto/install

  • Ahh gotcha. That script looks pretty nifty – thanks for sharing.