I apologize for the buzzword heavy title but it was the best I could do. I couldn’t find a good quick start explaining how to get started with Hive so I thought I’d share my experiences.

Anyway, a client of ours came to us needing to analyze a dataset that was about ~200 million rows over 6 months and is currently growing at about 10 million rows a week and increasing. From a reporting standpoint, they were looking to run aggregate counts and group bys over the data and then display the results on charts. Additionally, they were also looking to select subsets of the data and use them later – basically SELECT * FROM table WHERE x AND y AND z.

Obviously, doing the calculations in real time was out of the question so we knew we were looking for a solution that would be easy to use, support the necessary requirements and that would predictably scale with the increasing generation rate of data.

On the surface, MySQL looks like a decent approach but it presents a couple of issues pretty quickly:

  • In order for the SUM, GROUP BY, and COUNT queries to be at all useful the MySQL tables would have to be heavily indexed. Unfortunately, due to the write heavy workload of the app this would mean having to copy data into an indexed MySQL database before running any reports.
  • Even with indexes, MySQL was pretty awful at selecting subsets of the data from a performance perspective.
  • And probably the biggest issue with MySQL is that it doesn’t scale linearly in the sense that if the data is growing at 500 million rows a week you can’t simply “throw more hardware” at it and be done with it.

With requirements in hand we hit the Internet and finally arrived at Hive running on top of Hadoop. Per Wikipedia,

From our perspective, this stack fits our requirements nicely since it doesn’t rely on keeping a second “reporting” MySQL database available, it will handle both sum/count/group by and selecting subets, and probably most importantly it will allow us at least in the near term to scale with the increasing rate of data generation.

“Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.”

To grossly over simplify, Hadoop provides a framework that allows you to break up a data intensive task into discrete pieces, run the pieces in a distributed fashion, and then combine the results giving you the results of the completed task. The quintessential example of a task that can be parallelized in this fashion is sorting a *really* big list since the list can be sorted in pieces and then the results can be combined at the end. See Merge Sort

The second piece of the tool chain is Hive. Again via Wikipedia,

“Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.”

Basically, Hive is a tool that leverages the Hadoop framework to provide reporting and query capabilities using a syntax similar to SQL.

That just leaves Sqoop, the app with a funny name and no Wikipedia entry. Sqoop was originally developed by Cloudera and basically serves as an import tool for Hadoop. For my purposes, it allowed me to easily import the data from my MySQL database into Hadoop’s HDFS so I could use it in Hive.

The rest of this post walks you through setting up Hadoop+Hive and analyzing some MySQL data.

Now that you know the players, lets figure out what we’re actually trying to do.

  1. We want to start a Hadoop cluster to use Hive on.
  2. Load our data from a MySQL database into this Hadoop cluster.
  3. Use Hive to run some reports on this data.
  4. Warehouse the results of this data in MySQL so we can graph it (not that exciting).

Starting the cluster

Theres actually one more tool you’ll need to get this to work – Apache Whirr. Whirr is actually really cool, it lets you automatically start cluster services (Hadoop, Voldermont, etc.) at a handful of cloud platforms (AWS, Rackspace, etc.)

NOTE: We exclusively use AWS for our hosting so everything described here is specific to AWS.

Fisrt, download the latest copy of Whirr – http://www.fightrice.com/mirrors/apache//incubator/whirr/ to your local machine. Whirr should work everywhere but these directions will match up against Linux/OSX the best.

The first thing you’ll need is a Whirr configuration file describing the cluster you want to build. Create a file called hadoop.properties and paste in the following:

whirr.cluster-name=hadoop
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,2 hadoop-datanode+hadoop-tasktracker
whirr.hadoop-install-function=install_cdh_hadoop
whirr.hadoop-configure-function=configure_cdh_hadoop

whirr.provider=aws-ec2
whirr.identity=[REPLACE THIS WITH YOUR AWS ID]
whirr.credential=[REPLACE THIS WITH YOUR AWS SECRET]

whirr.hardware-id=m1.large
whirr.image-id=us-east-1/ami-68ad5201
whirr.location-id=us-east-1

whirr.private-key-file=~/.ssh/id_rsa
whirr.public-key-file=~/.ssh/id_rsa.pub

There isn’t a ton going on in the file but you’ll need to switch out the credential lines for your AWS credentials. Also, you’ll need to double check that the ssh paths are accurate for your account.

The next step, is to actually launch the cluster. To do this run this command – double check the path to your hadoop.properties file is accurate:

./bin/whirr launch-cluster --config hadoop.properties

Just give it a few minutes, you’ll see a bunch of debug info scrolling across your terminal and hopefully a success message once its done. At this point, you’ll have a fully built Hadoop cluster with 3 nodes as described in your properties file ( 1 hadoop-namenode+hadoop-jobtracker,2 hadoop-datanode+hadoop-tasktracker ).

You can see all your nodes by checking out your Whirr cluster directory.

cat ~/.whirr/hadoop/instances

Prepping and loading the cluster

Now that the cluster is up, you’ll need to prep it and then load your data with Sqoop.

One of the most irritating “gotchas” I stumbled across was that Whirr adds the firewall rules necessary for Hadoop to its AWS security group.

Before you do anything, open your EC2 control panel and modify the new Whirr security group (#jcloud-something) so that all of your nodes can connect to each other on port 3306 (MySQL)

The next step is to install mysql-client across the entire cluster since Sqoop uses mysqldump to get at your data. You could manually ssh into every machine but Whirr provides a convenient “run-script” command to do just that.

Create a file called “prepCluster.sh” and put “sudo apt-get -q -y install mysql-client” in it. Then make sure the paths are right and run,

./bin/whirr run-script --script prepCluster.sh --config hadoop.properties

Once its done, you’ll see the aptitude output from all your nodes as they downloaded the MySQL client.

The next step is to install mysql-server, hive, and sqoop on the jobtracker. Doing this is pretty straightforward, look at the .whirr/hadoop/instances file from above and copy the namenode hostname.

Next, ssh in to that machine using your current username as the username. Once you’re in, just run the following to install everything:

sudo apt-get -q -y install sqoop
sudo apt-get -q -y install hadoop-hive
sudo apt-get -q -y install screen

NOTE: You’ll also need the MySQL ConnectorJ library so that Sqoop can connect to MySQL. Download it here and place it in “/usr/lib/sqoop/lib/”

Once everything is done installing, you’ll most likely want to move your MySQL data directories from their default location onto the /mnt partition since it’s much larger. Check out this article for a good walk through. Don’t forget to update AppArmour or MySQL won’t start. Once MySQL is setup, load the data you want to crunch.

Now, you’ll need to use Sqoop to load the data from the MySQL database into Hadoop’s HDFS. While logged into the jobtracker node you can just run the following to do that. You’ll need to swap out the placeholders in the command and change the u/p.

sqoop-import-all-tables --connect jdbc:mysql://[IP of your jobtracker]/[your db_name] --username root --password root --verbose --hive-import

Once it completes, Sqoop will have copied all your MySQL data into Hadoop’s HDFS file system and initialized Hive for you.

Crunching the data

Run “hive” on the jobtracker and you’ll be ready to start crunching your data.

Check out the Hive language manual for more info on exactly what queries you can write.

Once you’ve narrowed down how to write your queries, you can use Hive’s “INSERT OVERWRITE LOCAL DIRECTORY” command to output the results of your query into a local directory.

Then, the next step would be to TAR up these results and use scp to copy the results back to your local machine to analyze or warehouse.

Shutting it down

The final thing you’ll need to do is shut down the cluster. Whirr makes it pretty easy:

./bin/whirr destroy-cluster --config hadoop.properties

Give it a few minutes and Whirr will shutdown the cluster and clean up the EC2 security group as well.

Anyway, hope this walk through proves useful for someone. As always, feedback, questions, and comments are all more than welcome.

Earlier this week I was completely dumbfounded by a PHP script that launched a Java app that seemed to work fine when it was run from the command line but kept failing when it was run from a cron.

The Java app in question was “ec2-describe-group” out of the Amazon EC2 API Tools package.  Basically, the ec2-describe-group tool hits the EC2 API and returns information about your account’s currently configured security groups.

The issue I was having was that when the PHP script was launched from a cron ec2-describe-group would keep returning an empty string, but when the script was launched from the CLI ec2-describe-group behaved normally.

After some poking around, I found this StackOverflow post which points out that most the environment variables your shell has aren’t available in a cronjob.

With that in mind, I tried adding JAVA_HOME as well as EC2_HOME to my crontab. Doing this is pretty straight forward, just add these two lines above any of your scheduled jobs:

EC2_HOME=/opt/ec2-api-tools-1.3.36506
JAVA_HOME=/etc/java-config-2/current-system-vm

Unfortunately, this still didn’t resolve the issue. On a whim, I decided to check what type of file ec2-describe-group actually is and discovered that its a Bash script not a Java JAR. Looking at the Bash, the file is actually just executing “EC2_HOME/bin/ec2-cmd DescribeGroups” but it utilizes other environment variables that my cron didn’t have.

For simplicity’s sake, I decided to just switch the PHP script to run ec2-cmd directly and finally everything started working as expected.

For a recent project, I had to extract the text out of a PDF so that I could save it into a database table.

Normally, I would of used the popular pdftotext program but it wasn’t available in the particular environment I was working in. I contact support and they advised that the XPDF package has several X windows dependencies and that’s why they had not installed it. Fair enough.

I poked around a bit and found Apache’s PDFBox library. I downloaded the package and looked at the examples. Sure enough there was a program called “ExtractText” that did exactly what I wanted.

Using ExtractText is similar to pdftotext – just pass in the PDF file and the text comes back. Awesome.

Anyway, hats off to Ben Litchfield who wrote the ExtractText example. I rebuilt the ExtactText.java file as a standalone project and packaged it as a JAR.

I’ve attached the JAR and Eclipse project if anyone wants a copy of either.

The JAR

The Eclipse Project

The other night at a bar, we started talking about evolution which somehow sparked a discussion about the law of large numbers and the probability that humanity is just a cosmic fluke. Eventually, someone brought up the “monkeys on a typewriter” argument which caused uproar among the philosophers in the group.

This morning, I decided to see what Wikipedia had to say about monkeys and typewriters and eventually stumbled across an article about the “Weasel program” which Richard Dawkins wrote to demonstrate “random variation and non-random cumulative selection in natural and artificial evolutionary systems.” Basically, it simulates the monkeys on a typewriter to produce a line from Hamlet. At this point, I was hooked – I wanted to make one.

I’d experimented with genetic algorithms in a class I took at Tufts and I’ve been increasingly curious since the “evolving Mona Lisa” code got out on the web.

Anyway, I decided to use the Jenes library to whip up some code to “evolve” strings. The Jenes library is absolutely fantastic. It is easy to setup, easy to use, and the documentation is well written and easy to follow.

My implementation is online at: http://setfive.com/evolve.php

And it evolves Dawkin’s Hamlet line in about 3 seconds – link

The code to run the genetic algorithms is written in Java and uses a Jetty container to accept and processes HTTP requests. Using an embedded Jetty container proved to be seamless and the application server seems to running pretty smoothly.

A zip file containing an Eclipse project for the code is available here.

Additionally, a self contained JAR for the server is available here . Start it with java -jar wordga-jetty.jar

As always, questions and comments are welcome.

In the last few weeks the battle and buzz over the smart phone market seems to have seriously intensified.

First there was the usual iPhone buzz, news about the Android powered HTC Magic, the Windows Mobile marketplace, and of course the obligatory ridiculousness at Microsoft.

I’d been considering experimenting with a mobile platform for sometime and finally decided to take the plunge. I decided to give Android a whirl primarily because I don’t have easy access to OSX or Visual Studio and my Java is less rusty than my .NET.

Anyway, getting going with Android was deliciously simple – download the SDK+Emulator and Eclipse plugin and you’re off.

After the necessary “Hello World” application I tried to write something a bit more substantive. Personally, one of the coolest facets of mobile development is the ability for applications to be location aware (GPS). Mix this together with some openly available geo tagged data and the result is probably going to be interesting.

With this in mind, the plan became to mash together Android’s GPS coordinates with flickr’s geotagged photos.

Getting access to Android’s location service is fairly straightforward. You basically register to receive updates either when the device moves a certain distance or on some time interval:

The biggest “gotcha” with this is that you NEED to remember to modify the default Application security settings to allow you to access the device’s location. In Eclipse, edit AndroidManifest.xml and add “UsesPermission” for the following: android.permission.ACCESS_MOCK_LOCATION, android.permission.ACCESS_COARSE_LOCATION, android.permission.ACCESS_FINE_LOCATION

So on to part II – using the device’s location to pull down Flickr photos. I’d used the Flickr API before so I knew how to do it but I’d never used it from Java. I tried loading the JAR for the flickrj client library but the Android JVM was having some strange issues with it. I was under the impression you can link to external JARs from Eclipse but I may be wrong (anyone?).

Anyway, the Flickr requests were un-authenticated and pretty straightforward so I decided to use Java’s URL class. Accessing sockets was another “gotcha” – Android requires your application to have the “UsesPermission” android.permission.INTERNET to use sockets. The exception when the permission isn’t set is notably cryptic – “unknown error”.

I decided to download all the Flickr photos to the device so that the UX would be generally smoother. This introduced threading to the project so that the UI wouldn’t freeze up while the photos were downloading. Android threads work just like traditional Java threads and the process was generally painless:

With the photos pulled down the final task was displaying them. After poking around the Android documentation I discovered the Gallery widget. It basically allows you to display a set of items in a list and specify a “renderer” for the gallery. I’m not sure if there is a default way to make it “fisheye” (like on an iPhone) but I rolled a quick n dirty solution for that. I also couldn’t get it to look really sexy but that’s also probably possible.

So that’s about it. Here are some screen shots of the application running in the emulator:

And without further a due here is the code as an Eclipse project.
geoflickr

Anyway, before the bashing starts – I know I’m a terrible Java programmer and that this project isn’t really engineered beautifully. It was just supposed to be a way to get my (and anyone else’s) feet wet with Android. Any comments/thoughts/improvements are of course welcome!