Hive: Hive in 15 minutes on Amazon EMR

As far as “big data” solutions go, Hive is probably one of the more recognizable names. Hive basically offers the end user an abstraction layer to run “SQL like” queries as MapReduce jobs across data that they have in HDFS. Concretely, say you had several hundred million rows of data and you wanted to count the number of unique IDs Hive would let you do that. One of the issues with Hadoop and by proxy Hive is that it’s notably difficult to setup a cluster to try things out. Tools like Whirr exist to make things easier they’re, a bit rough around the edges and in my experience hit up against “version hell”. One alternative that I’m surprised isn’t more popular is using Amazon’s Elastic Map Reduce to bootstrap a Hadoop cluster to experiment with.

Fire up the cluster

The first thing you’ll need to do is fire up an EMR cluster from the AWS backend. It’s mostly just point and click but the settings I used were:

  • Termination protection? No
  • Logging? Disabled
  • Debugging? Off since no logging
  • Tags – None
  • AMI Version: 2.4.2 (latest)
  • Applications to be installed:
  • Hive 0.11.0.1
  • Pig 0.11.1.1
  • Hardware Configuration:
  • One m1.small for the master
  • Two m1.small for the cores

The “security and access” section is important, you need to select an existing key pair that you have access to so that you can SSH into your master node to use the Hive CLI client.

Then finally, under Steps since you’re not specifying any pre-determined steps make sure you mark “Auto-terminate” as “No” so that the cluster doesn’t terminate immediately after it boots.

Click “Create Cluster” and you’re off to the races.

Pull some data, and load HDFS

Once the cluster launches, you’ll see a dashboard screen with a bunch of information about the cluster including the public DNS address for the “Master”. SSH into this machine using the user “hadoop” and whatever key you launched the cluster with:

Once you’re in, you’ll want to grab some data to play with. I pulled down Wikipedia Page View data since it’s just a bunch of gzipped text files which are perfect for Hive. You can pull down a chunk of files using wget, be aware though that the small EC2s don’t have much storage so you’ll need to keep an eye on your disk space.

Once you have some data (grab a few GB), the next step is to push it over to HDFS, Hadoop’s distributed filesystem. As an aside, Amazon EMR is tightly integrated with Amazon S3 so if you already have a dataset in S3 you can copy directly from S3 to HDFS. Anyway, to push your files to HDFS just run:

Build some tables, query some data!

And finally, it’s time to query some of the pageview data using Hive. The first step is to let Hive know about your data and what format it’s stored in. To do this, you need to create an external table that points to the location of the files that you just pushed to HDFS. Start the Hive client by running “hive” and then do the following:

Now select some data from your newly created table!

Pretty sweet huh? Now feel free to run any arbitrary query against the data. Note: since we used m1.small EC2s the performance of Hive/Hadoop is going to be pretty abysmal. But hey, give it a shot:

Anyway, don’t forget to tear down the cluster once you’re done. As always, let me know if you run into any issues!

Hive: How to write a custom SerDe class

We’ve been using Hive a bit lately to help clients tackle some of their data needs and without a doubt one of the most powerful features is Hive’s SerDe functionality. Taking a step back, Hive is an open source Apache project that lets you run “SQL Like” queries using Hadoop on data that you have in HDFS. It’s a lot of moving pieces but what it fundamentally comes down to is that Hive will let you run what look like SQL queries across the text files that you have in HDFS. A typical use case would be using Hive to run ad-hoc queries across web server (like nginx) logs. Want to a breakdown of response times by frontend web server? Hive would let you do that.

SerDe?

SerDe is actually short for Serialize/Deserialize and its the mechanism that Hive uses to make sense of your text files in HDFS. Lets take a typical nginx log line:

Now the magic comes in how Hive uses a SerDe to translate a line like that into something that’s queryable. This is contrived but lets assume that for some reason we’re interested in querying on the client IP address and the request size of each log line. So we’d be interested in creating a table that looks like:

Turns out, Hive makes this particularly easy. You’d end up using the RegexSerDe to match a regular expression and then extract the two fields you’re interested in.

A custom one

The next step after extraction is to do some transformation during the extraction stage and this is where the custom SerDe comes in. For example, lets say that you wanted to geocode the client’s IP address and also convert your dates into Unix timestamps. So your table would be something like:

Your custom SerDe would let you do exactly this. You’d be able to use something like the MaxMind database to geocode your IP addresses and then use some extra Java to convert your timestamps.

Unfortunately, there doesn’t seem to be too much documentation on how to actually write a custom class so here’s a couple of tidbits I’ve picked up:

  • It looks like at some point the SerDe class was refactored so depending on what Hive version you’re using you’ll need to extend a different class. On Hive 0.11 the class you’ll want to extend is “org.apache.hadoop.hive.serde2.SerDe”
  • You’ll need to include a couple of JARs in order to get the class to build. I had to include commons-logging-1.0.4.jar, hadoop-0.20.1-core.jar, hive-contrib-0.10.0-cdh4.4.0.jar, hive-exec-0.10.0-cdh4.4.0.jar, junit-4.5.jar
  • As noted above, you need to pull the specific versions of the JARs that you’re going to end up running this SerDe against
  • Make sure you target the right Java JRE version. If your servers are running Java 1.6 and you target 1.7 you end up getting really cryptic error messages.
  • If you create a table using your SerDe, you’ll need to have that JAR available to drop that table

The best way I’ve found to bootstrap this is to create an Eclipse project, include the necessary JARs, and then get the RegExSerDe to build inside the project. Once that works, test the JAR by creating a table using it and then you’ll be able to modify the class from there.

Even with my awful Java, the RegexSerDe class was easy enough to grok and then modify as needed.

Stuck? Need Help?

Drop me a comment or shoot me an email and I’ll do my best to help you out.

SQL Join Checker, Making Sure Your Joins Are Right

Recently on a project we came across the need to generate a bunch of different reports from the database.  Due to different requirements we weren’t able to use the ORM (Doctrine2 on the specific project), so we wrote the queries by hand.  As we continued to build the different reports we noticed sometimes we’d typo a join, for example join something on `id` versus `user_id`.  These small typos would cause the reports to still run fine, however have the incorrect data, often it was difficult to pinpoint the exact issue in the given report, as only certain conditions could reproduce the results.  After a while Ashish said it’d be great if we had some sort of sanity checker to make sure the queries we were writing were going across the proper joins.  To me, this was:

challenge-accepted

 

At first I thought about just using Regular Expressions to parse out the join parts of the SQL queries.  However, I found http://code.google.com/p/php-sql-parser/ which appears to do the job.  I downloaded it and wrote a class which uses it and some expressions to discover FK’s in the database.  I ended up with something which, albeit not the most elegant, gets the job done.  Here is an example output of it:

Basically it will run through whatever query you give it, and make sure that the columns you are joining on are defined in the DB. If you are trying to join on a column that is defined as a constraint, it will output the part of the join that failed the check as well as what FK’s currently do exist. Another issue this may help with, is if your database is missing a constraint (FK) that should be defined it will point it out.

I wrote this really quickly, so let me know (or make a pull request) if you find any bugs. I’ve put the code up on Github. Let me know if it helps out!

End of Movembend-of-movemberer,
Just Under 2k Raised

November 30th marked the end of Movember for us.  We raised a total of $1,848 dollars for men’s health!  I think also over the month, those of us that participated and their significant others, have learned that we just aren’t built for mustaches.   Here are some before and after shots of each of us from the month.  Thanks for donating!

ashish ashish

jaredjared

daum
daum

Let us know if you have any suggestions for our next run at it! Until next Movember, stay healthy.

Big Data: Black Friday & Twitter Streaming API

It’s that time of year again. Lines forming outside the most popular retailers filled with turkey-gorged shoppers eagerly awaiting this years biggest Black Friday deals. In efforts to curb their boredom, these shoppers take to Twitter to pass the time in line and share their shopping experiences. Since we’re not big shoppers ourselves, and certainly not fans of waiting in lines, we took a different approach to participating in Black Friday.

We decided to flex our big data muscles and hook into Twitter’s streaming API sample which represents a random sampling of twitter’s 400 million tweets per day and recorded all tweets mentioning Black Friday.  In order to handle the streaming data from Twitter, we set up a Storm cluster which processed close to 1 million Black Friday related tweets,  and then saved the data in a MySQL database we spun up on AWS.

For those of you not familiar, Storm is an open source distributed real-time computation system which can be used to reliably process unbounded streams of data.  If you’re interested in the technical details, stay tuned because we’ll be putting out a separate blog post that will walk you through what we did. Also, if you’d like a copy of the mySQL table with the tweet data, you can download it here.

We put together the below infographic based on the data we collected over the 24 hour period beginning Thurs 8pm EST to Friday 8pm EST. We hope you enjoy.

black_friday_infographic_setfive_consulting