Hive: Hive in 15 minutes on Amazon EMR

As far as “big data” solutions go, Hive is probably one of the more recognizable names. Hive basically offers the end user an abstraction layer to run “SQL like” queries as MapReduce jobs across data that they have in HDFS. Concretely, say you had several hundred million rows of data and you wanted to count the number of unique IDs Hive would let you do that. One of the issues with Hadoop and by proxy Hive is that it’s notably difficult to setup a cluster to try things out. Tools like Whirr exist to make things easier they’re, a bit rough around the edges and in my experience hit up against “version hell”. One alternative that I’m surprised isn’t more popular is using Amazon’s Elastic Map Reduce to bootstrap a Hadoop cluster to experiment with.

Fire up the cluster

The first thing you’ll need to do is fire up an EMR cluster from the AWS backend. It’s mostly just point and click but the settings I used were:

  • Termination protection? No
  • Logging? Disabled
  • Debugging? Off since no logging
  • Tags – None
  • AMI Version: 2.4.2 (latest)
  • Applications to be installed:
  • Hive 0.11.0.1
  • Pig 0.11.1.1
  • Hardware Configuration:
  • One m1.small for the master
  • Two m1.small for the cores

The “security and access” section is important, you need to select an existing key pair that you have access to so that you can SSH into your master node to use the Hive CLI client.

Then finally, under Steps since you’re not specifying any pre-determined steps make sure you mark “Auto-terminate” as “No” so that the cluster doesn’t terminate immediately after it boots.

Click “Create Cluster” and you’re off to the races.

Pull some data, and load HDFS

Once the cluster launches, you’ll see a dashboard screen with a bunch of information about the cluster including the public DNS address for the “Master”. SSH into this machine using the user “hadoop” and whatever key you launched the cluster with:

ashish@ashish:~$ ssh hadoop@ec2-107-20-21-245.compute-1.amazonaws.com
Linux (none) 3.2.30-49.59.amzn1.x86_64 #1 SMP Wed Oct 3 19:54:33 UTC 2012 x86_64
--------------------------------------------------------------------------------
Welcome to Amazon Elastic MapReduce running Hadoop and Debian/Squeeze.
Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
/mnt/var/log/hadoop/steps for diagnosing step failures.
The Hadoop UI can be accessed via the following commands:
JobTracker lynx http://localhost:9100/
NameNode lynx http://localhost:9101/
--------------------------------------------------------------------------------
Last login: Thu Dec 12 02:53:54 2013 from 50.136.18.114
hadoop@ip-10-29-191-137:~$

Once you’re in, you’ll want to grab some data to play with. I pulled down Wikipedia Page View data since it’s just a bunch of gzipped text files which are perfect for Hive. You can pull down a chunk of files using wget, be aware though that the small EC2s don’t have much storage so you’ll need to keep an eye on your disk space.

hadoop@ip-10-29-191-137:~$ wget -r --no-parent --reject "index.html*" http://dumps.wikimedia.org/other/pagecounts-raw/2013/

Once you have some data (grab a few GB), the next step is to push it over to HDFS, Hadoop’s distributed filesystem. As an aside, Amazon EMR is tightly integrated with Amazon S3 so if you already have a dataset in S3 you can copy directly from S3 to HDFS. Anyway, to push your files to HDFS just run:

hadoop@ip-10-29-191-137:~/dumps.wikimedia.org/other/pagecounts-raw/2013$ cd ~/dumps.wikimedia.org/other/pagecounts-raw/2013/
hadoop@ip-10-29-191-137:~/dumps.wikimedia.org/other/pagecounts-raw/2013$ hadoop dfs -mkdir /mnt/pageviews-2013-12
hadoop@ip-10-29-191-137:~/dumps.wikimedia.org/other/pagecounts-raw/2013$ hadoop dfs -put * /mnt/pageviews-2013-12/

Build some tables, query some data!

And finally, it’s time to query some of the pageview data using Hive. The first step is to let Hive know about your data and what format it’s stored in. To do this, you need to create an external table that points to the location of the files that you just pushed to HDFS. Start the Hive client by running “hive” and then do the following:

hadoop@ip-10-29-191-137:~/dumps.wikimedia.org/other/pagecounts-raw/2013$ hive
Logging initialized using configuration in file:/home/hadoop/.versions/hive-0.11.0/conf/hive-log4j.properties
Hive history file=/mnt/var/lib/hive_0110/tmp/history/hive_job_log_hadoop_21902@ip-10-29-191-137.ec2.internal_201312120412_1458262789.txt
hive> CREATE EXTERNAL TABLE page_views (
> project STRING, title STRING,
> req_count STRING, pg_size STRING
> ) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ' '
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '/mnt/pageviews-2013-12';
OK
Time taken: 0.313 seconds

Now select some data from your newly created table!

hive> select * from page_views limit 10;
OK
*.b Ingl\xC3\x01\x00\x00\x00\x00Z\xBB\xB9\x01\x00\x00\x00\x00;kGS\xF6\xC3aH\x0D\x00\x00 1 325
* 100013-11-30T23:59:57.3 1 325
* 100\x00\x00\x00\x00\x00\x00\x00\xA1\x0D\xA1\x01\x00\x00\x00\x00\x10\x00\x00\x00\x00 1 325
* \x5Cx-\x22\x5Cx/(\x5Cx/*\x5Cx/(\x5Cx/\x5Cx/0\x82\x01\xB3\x06\x09`\x86H\x01\x86\xFDl\x01\x010\x82\x01\xA40:\x06\x08+\x06\x01\x05\x05\x07\x02\x01\x16.http://www.digicert.com/ss 1 325
* \x5Cx-\x22\x5Cx/(\x5Cx/*\x5Cx/(\x5Cx/\x5Cx/\x00\x00\x00\x00\xDE~\xC4\xD1\x91\xA5\x09\x00\x0A\x00\x00\x00\x00\x00\x00\x00\x19%~\x01\x00\x00\x00\x00\x0A\x00\x00\x00\x00\x00\x00\x00%%~\x01\x00\x00\x00\x004Qm\x01\x00\x00\x00\x00\x08\xB5'\xAB\x00\x00\x00 1 325
AR %D9%82%D8%A7%D8%A6%D9%85%D8%A9_%D8%A3%D9%84%D8%B9%D8%A7%D8%A8_%D8%A5%D9%8A_%D8%A2%D9%8A%D9%87_%D9%84%D9%88%D8%B3_%D8%A3%D9%86%D8%AC%D9%84%D9%88%D8%B3 1 44480
De.mw De 1 10302
De Customer-Relationship-Management 3 96858
De Include 3 24679
EN.mw EN 1 4693
Time taken: 14.766 seconds, Fetched: 10 row(s)

Pretty sweet huh? Now feel free to run any arbitrary query against the data. Note: since we used m1.small EC2s the performance of Hive/Hadoop is going to be pretty abysmal. But hey, give it a shot:

hive> select count(*) AS c, title from page_viewsn group by title order by c desc limit 1000;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 6
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>

Anyway, don’t forget to tear down the cluster once you’re done. As always, let me know if you run into any issues!