As far as “big data” solutions go, Hive is probably one of the more recognizable names. Hive basically offers the end user an abstraction layer to run “SQL like” queries as MapReduce jobs across data that they have in HDFS. Concretely, say you had several hundred million rows of data and you wanted to count the number of unique IDs Hive would let you do that. One of the issues with Hadoop and by proxy Hive is that it’s notably difficult to setup a cluster to try things out. Tools like Whirr exist to make things easier they’re, a bit rough around the edges and in my experience hit up against “version hell”. One alternative that I’m surprised isn’t more popular is using Amazon’s Elastic Map Reduce to bootstrap a Hadoop cluster to experiment with.
Fire up the cluster
The first thing you’ll need to do is fire up an EMR cluster from the AWS backend. It’s mostly just point and click but the settings I used were:
- Termination protection? No
- Logging? Disabled
- Debugging? Off since no logging
- Tags – None
- AMI Version: 2.4.2 (latest)
- Applications to be installed:
- Hive 0.11.0.1
- Pig 0.11.1.1
- Hardware Configuration:
- One m1.small for the master
- Two m1.small for the cores
The “security and access” section is important, you need to select an existing key pair that you have access to so that you can SSH into your master node to use the Hive CLI client.
Then finally, under Steps since you’re not specifying any pre-determined steps make sure you mark “Auto-terminate” as “No” so that the cluster doesn’t terminate immediately after it boots.
Click “Create Cluster” and you’re off to the races.
Pull some data, and load HDFS
Once the cluster launches, you’ll see a dashboard screen with a bunch of information about the cluster including the public DNS address for the “Master”. SSH into this machine using the user “hadoop” and whatever key you launched the cluster with:
Once you’re in, you’ll want to grab some data to play with. I pulled down Wikipedia Page View data since it’s just a bunch of gzipped text files which are perfect for Hive. You can pull down a chunk of files using wget, be aware though that the small EC2s don’t have much storage so you’ll need to keep an eye on your disk space.
Once you have some data (grab a few GB), the next step is to push it over to HDFS, Hadoop’s distributed filesystem. As an aside, Amazon EMR is tightly integrated with Amazon S3 so if you already have a dataset in S3 you can copy directly from S3 to HDFS. Anyway, to push your files to HDFS just run:
Build some tables, query some data!
And finally, it’s time to query some of the pageview data using Hive. The first step is to let Hive know about your data and what format it’s stored in. To do this, you need to create an external table that points to the location of the files that you just pushed to HDFS. Start the Hive client by running “hive” and then do the following:
Now select some data from your newly created table!
Pretty sweet huh? Now feel free to run any arbitrary query against the data. Note: since we used m1.small EC2s the performance of Hive/Hadoop is going to be pretty abysmal. But hey, give it a shot:
Anyway, don’t forget to tear down the cluster once you’re done. As always, let me know if you run into any issues!