Hive: How to write a custom SerDe class - {5} Setfive

We’ve been using Hive a bit lately to help clients tackle some of their data needs and without a doubt one of the most powerful features is Hive’s SerDe functionality. Taking a step back, Hive is an open source Apache project that lets you run “SQL Like” queries using Hadoop on data that you have in HDFS. It’s a lot of moving pieces but what it fundamentally comes down to is that Hive will let you run what look like SQL queries across the text files that you have in HDFS. A typical use case would be using Hive to run ad-hoc queries across web server (like nginx) logs. Want to a breakdown of response times by frontend web server? Hive would let you do that.

SerDe?

SerDe is actually short for Serialize/Deserialize and its the mechanism that Hive uses to make sense of your text files in HDFS. Lets take a typical nginx log line:

- - 66.249.66.164 - - [09/Dec/2013:22:11:49 -0500] "GET /feed/ HTTP/1.1" 200 5281 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"

Now the magic comes in how Hive uses a SerDe to translate a line like that into something that’s queryable. This is contrived but lets assume that for some reason we’re interested in querying on the client IP address and the request size of each log line. So we’d be interested in creating a table that looks like:

ip, request_size
123.125.71.51, 16060
54.224.19.57, 30075
180.76.5.18, 6971

Turns out, Hive makes this particularly easy. You’d end up using the RegexSerDe to match a regular expression and then extract the two fields you’re interested in.

A custom one

The next step after extraction is to do some transformation during the extraction stage and this is where the custom SerDe comes in. For example, lets say that you wanted to geocode the client’s IP address and also convert your dates into Unix timestamps. So your table would be something like:

ip, city, state, country, request_size, timestamp
123.125.71.51, Cambridge, MA, US, 16060, 1386645919
54.224.19.57, Cambridge, MA, US, 30075, 1386645919
180.76.5.18, Cambridge, MA, US, 6971, 1386645919

Your custom SerDe would let you do exactly this. You’d be able to use something like the MaxMind database to geocode your IP addresses and then use some extra Java to convert your timestamps.

Unfortunately, there doesn’t seem to be too much documentation on how to actually write a custom class so here’s a couple of tidbits I’ve picked up:

It looks like at some point the SerDe class was refactored so depending on what Hive version you’re using you’ll need to extend a different class. On Hive 0.11 the class you’ll want to extend is “org.apache.hadoop.hive.serde2.SerDe”
You’ll need to include a couple of JARs in order to get the class to build. I had to include commons-logging-1.0.4.jar, hadoop-0.20.1-core.jar, hive-contrib-0.10.0-cdh4.4.0.jar, hive-exec-0.10.0-cdh4.4.0.jar, junit-4.5.jar
As noted above, you need to pull the specific versions of the JARs that you’re going to end up running this SerDe against
Make sure you target the right Java JRE version. If your servers are running Java 1.6 and you target 1.7 you end up getting really cryptic error messages.
If you create a table using your SerDe, you’ll need to have that JAR available to drop that table

The best way I’ve found to bootstrap this is to create an Eclipse project, include the necessary JARs, and then get the RegExSerDe to build inside the project. Once that works, test the JAR by creating a table using it and then you’ll be able to modify the class from there.

Even with my awful Java, the RegexSerDe class was easy enough to grok and then modify as needed.

Stuck? Need Help?

Drop me a comment or shoot me an email and I’ll do my best to help you out.

Questions or comments about this? Email us at contact@setfive.com.

SerDe?

A custom one

Stuck? Need Help?

Read more

The $4,000 Polling Loop

MySQL: Composite indexes in 5 minutes

An Edge Case of Time in AWS PHP SDK