PrestoDB: Running PrestoDB on Amazon EMR

A weeks ago, Facebook released a new open source project called PrestoDB which they billed as a market improvement over Hive and Hadoop. According to the PrestoDB site, Presto is a real time query engine that supports a SQL like syntax, similar to Hive. However, unlike Hive, Presto doesn’t execute queries using MapReduce jobs but instead uses its own internal distribution mechanism. According to the Presto site and current users, most queries will see an order of magnitude speedup compared to Hive. And the best part? PrestoDB can read metadata from Hive’s metastore and read files off HDFS just like Hive – pretty wild.

Anyway, since I love new toys (who doesn’t!?) I decided to try setting up PrestoDB on Amazon EMR to see how difficult it was and also experience the speedups. Turns out, once you have an Amazon EMR cluster running getting PrestoDB up is almost trivial. Just follow the PrestoDB deploying directions to get yourself situated. Make sure you create *all* the files or you’ll get some necessarily cryptic errors along the way.

The config files I ended up using were:

You’ll need to create the “/mnt/presto” directory and also make it accessible to whatever user you plan to run the daemon under.

The one huge gotcha I ran into was that I couldn’t figure out what port Hive’s Thrift service was running on. For some reason, it’s notably absent from Amazon’s documentation and I couldn’t find the hive-site.xml file on the EMR EC2. Completely randomly, I ran across this manual page from Jaspersoft enumerating which ports different versions of Hive run Thrift on when you use EMR. Turns out, its different per Hive version but 0.11.0 will use 10004.

Once you have everything configured, just follow the docs to start the server and you’ll be ready to query. One thing to note though is that you’ll need to setup PrestoDB manually on the rest of your machines and also enable the discovery service for this to “really” work.

Anyway, happy querying!

Musing: Should everyone learn to code?

Last week, President Obama made headlines by suggesting that every American in school should learn how to code. Predictably, the comment sparked some heated discussion across the web from Fred Wilson’s blog to several threads on Hacker News. Surprisingly, some of the viewpoints were extremely polarized ranging from “its useless, some people will never get it” to “of course!”. Personally, I think everyone should definitely be exposed to some form of programming while they’re in school.

An inescapable reality is that in 2013 computers are a part of everyone’s personal and professional day to day. From non-technical roles in technical fields like account managers or project managers to traditionally non-technical jobs, like teachers, everyone is ultimately interacting with computers on a daily basis. With that in mind, having a basic understanding of how computing abstractions and programming work will benefit everyone. From being able to modify a VBA macro to construct a complex Gmail search query, having a basic understanding of how the pieces fit together certainly can’t hurt.

Looking back at high school, drawing an analogy between studying programming and studying a foreign language isn’t really accurate. A better analogy is really the general experience people have studying math in middle and high school. For people that don’t take a math class in college, that’ll normally be the last time they study math in an academic setting. Although most people forget most of the details they learned, they still retain the overarching fundamentals of how things like algebra and geometry work. Because of this, when people are faced with a basic math problem they generally know what they need to look up in order to solve it. Extending this, if people were introduced to basic programming early on they’d have a sense that there might be an easier way to approach certain tasks. Need to format a list of names in Excel? There might be a function for that.

So how can we make this happen? The good news is there’s already a push to make high quality, programming focused education material available to everyone. There are already dozens of masively online open course projects including Khan Academy, Coursera, and Code Academy providing free, interactive, computer science resource for everyone. The next step is pushing states and school systems to actively adopt CS education for their middle school and high school students. Hopefully it’ll prove and easy and effective step to keeping everyone competitive in an increasingly technology powered workplace.

Hive: Hive in 15 minutes on Amazon EMR

As far as “big data” solutions go, Hive is probably one of the more recognizable names. Hive basically offers the end user an abstraction layer to run “SQL like” queries as MapReduce jobs across data that they have in HDFS. Concretely, say you had several hundred million rows of data and you wanted to count the number of unique IDs Hive would let you do that. One of the issues with Hadoop and by proxy Hive is that it’s notably difficult to setup a cluster to try things out. Tools like Whirr exist to make things easier they’re, a bit rough around the edges and in my experience hit up against “version hell”. One alternative that I’m surprised isn’t more popular is using Amazon’s Elastic Map Reduce to bootstrap a Hadoop cluster to experiment with.

Fire up the cluster

The first thing you’ll need to do is fire up an EMR cluster from the AWS backend. It’s mostly just point and click but the settings I used were:

  • Termination protection? No
  • Logging? Disabled
  • Debugging? Off since no logging
  • Tags – None
  • AMI Version: 2.4.2 (latest)
  • Applications to be installed:
  • Hive 0.11.0.1
  • Pig 0.11.1.1
  • Hardware Configuration:
  • One m1.small for the master
  • Two m1.small for the cores

The “security and access” section is important, you need to select an existing key pair that you have access to so that you can SSH into your master node to use the Hive CLI client.

Then finally, under Steps since you’re not specifying any pre-determined steps make sure you mark “Auto-terminate” as “No” so that the cluster doesn’t terminate immediately after it boots.

Click “Create Cluster” and you’re off to the races.

Pull some data, and load HDFS

Once the cluster launches, you’ll see a dashboard screen with a bunch of information about the cluster including the public DNS address for the “Master”. SSH into this machine using the user “hadoop” and whatever key you launched the cluster with:

Once you’re in, you’ll want to grab some data to play with. I pulled down Wikipedia Page View data since it’s just a bunch of gzipped text files which are perfect for Hive. You can pull down a chunk of files using wget, be aware though that the small EC2s don’t have much storage so you’ll need to keep an eye on your disk space.

Once you have some data (grab a few GB), the next step is to push it over to HDFS, Hadoop’s distributed filesystem. As an aside, Amazon EMR is tightly integrated with Amazon S3 so if you already have a dataset in S3 you can copy directly from S3 to HDFS. Anyway, to push your files to HDFS just run:

Build some tables, query some data!

And finally, it’s time to query some of the pageview data using Hive. The first step is to let Hive know about your data and what format it’s stored in. To do this, you need to create an external table that points to the location of the files that you just pushed to HDFS. Start the Hive client by running “hive” and then do the following:

Now select some data from your newly created table!

Pretty sweet huh? Now feel free to run any arbitrary query against the data. Note: since we used m1.small EC2s the performance of Hive/Hadoop is going to be pretty abysmal. But hey, give it a shot:

Anyway, don’t forget to tear down the cluster once you’re done. As always, let me know if you run into any issues!

Hive: How to write a custom SerDe class

We’ve been using Hive a bit lately to help clients tackle some of their data needs and without a doubt one of the most powerful features is Hive’s SerDe functionality. Taking a step back, Hive is an open source Apache project that lets you run “SQL Like” queries using Hadoop on data that you have in HDFS. It’s a lot of moving pieces but what it fundamentally comes down to is that Hive will let you run what look like SQL queries across the text files that you have in HDFS. A typical use case would be using Hive to run ad-hoc queries across web server (like nginx) logs. Want to a breakdown of response times by frontend web server? Hive would let you do that.

SerDe?

SerDe is actually short for Serialize/Deserialize and its the mechanism that Hive uses to make sense of your text files in HDFS. Lets take a typical nginx log line:

Now the magic comes in how Hive uses a SerDe to translate a line like that into something that’s queryable. This is contrived but lets assume that for some reason we’re interested in querying on the client IP address and the request size of each log line. So we’d be interested in creating a table that looks like:

Turns out, Hive makes this particularly easy. You’d end up using the RegexSerDe to match a regular expression and then extract the two fields you’re interested in.

A custom one

The next step after extraction is to do some transformation during the extraction stage and this is where the custom SerDe comes in. For example, lets say that you wanted to geocode the client’s IP address and also convert your dates into Unix timestamps. So your table would be something like:

Your custom SerDe would let you do exactly this. You’d be able to use something like the MaxMind database to geocode your IP addresses and then use some extra Java to convert your timestamps.

Unfortunately, there doesn’t seem to be too much documentation on how to actually write a custom class so here’s a couple of tidbits I’ve picked up:

  • It looks like at some point the SerDe class was refactored so depending on what Hive version you’re using you’ll need to extend a different class. On Hive 0.11 the class you’ll want to extend is “org.apache.hadoop.hive.serde2.SerDe”
  • You’ll need to include a couple of JARs in order to get the class to build. I had to include commons-logging-1.0.4.jar, hadoop-0.20.1-core.jar, hive-contrib-0.10.0-cdh4.4.0.jar, hive-exec-0.10.0-cdh4.4.0.jar, junit-4.5.jar
  • As noted above, you need to pull the specific versions of the JARs that you’re going to end up running this SerDe against
  • Make sure you target the right Java JRE version. If your servers are running Java 1.6 and you target 1.7 you end up getting really cryptic error messages.
  • If you create a table using your SerDe, you’ll need to have that JAR available to drop that table

The best way I’ve found to bootstrap this is to create an Eclipse project, include the necessary JARs, and then get the RegExSerDe to build inside the project. Once that works, test the JAR by creating a table using it and then you’ll be able to modify the class from there.

Even with my awful Java, the RegexSerDe class was easy enough to grok and then modify as needed.

Stuck? Need Help?

Drop me a comment or shoot me an email and I’ll do my best to help you out.

SQL Join Checker, Making Sure Your Joins Are Right

Recently on a project we came across the need to generate a bunch of different reports from the database.  Due to different requirements we weren’t able to use the ORM (Doctrine2 on the specific project), so we wrote the queries by hand.  As we continued to build the different reports we noticed sometimes we’d typo a join, for example join something on `id` versus `user_id`.  These small typos would cause the reports to still run fine, however have the incorrect data, often it was difficult to pinpoint the exact issue in the given report, as only certain conditions could reproduce the results.  After a while Ashish said it’d be great if we had some sort of sanity checker to make sure the queries we were writing were going across the proper joins.  To me, this was:

challenge-accepted

 

At first I thought about just using Regular Expressions to parse out the join parts of the SQL queries.  However, I found http://code.google.com/p/php-sql-parser/ which appears to do the job.  I downloaded it and wrote a class which uses it and some expressions to discover FK’s in the database.  I ended up with something which, albeit not the most elegant, gets the job done.  Here is an example output of it:

Basically it will run through whatever query you give it, and make sure that the columns you are joining on are defined in the DB. If you are trying to join on a column that is defined as a constraint, it will output the part of the join that failed the check as well as what FK’s currently do exist. Another issue this may help with, is if your database is missing a constraint (FK) that should be defined it will point it out.

I wrote this really quickly, so let me know (or make a pull request) if you find any bugs. I’ve put the code up on Github. Let me know if it helps out!