Hive: How to write a custom SerDe class

We’ve been using Hive a bit lately to help clients tackle some of their data needs and without a doubt one of the most powerful features is Hive’s SerDe functionality. Taking a step back, Hive is an open source Apache project that lets you run “SQL Like” queries using Hadoop on data that you have in HDFS. It’s a lot of moving pieces but what it fundamentally comes down to is that Hive will let you run what look like SQL queries across the text files that you have in HDFS. A typical use case would be using Hive to run ad-hoc queries across web server (like nginx) logs. Want to a breakdown of response times by frontend web server? Hive would let you do that.

SerDe?

SerDe is actually short for Serialize/Deserialize and its the mechanism that Hive uses to make sense of your text files in HDFS. Lets take a typical nginx log line:

Now the magic comes in how Hive uses a SerDe to translate a line like that into something that’s queryable. This is contrived but lets assume that for some reason we’re interested in querying on the client IP address and the request size of each log line. So we’d be interested in creating a table that looks like:

Turns out, Hive makes this particularly easy. You’d end up using the RegexSerDe to match a regular expression and then extract the two fields you’re interested in.

A custom one

The next step after extraction is to do some transformation during the extraction stage and this is where the custom SerDe comes in. For example, lets say that you wanted to geocode the client’s IP address and also convert your dates into Unix timestamps. So your table would be something like:

Your custom SerDe would let you do exactly this. You’d be able to use something like the MaxMind database to geocode your IP addresses and then use some extra Java to convert your timestamps.

Unfortunately, there doesn’t seem to be too much documentation on how to actually write a custom class so here’s a couple of tidbits I’ve picked up:

  • It looks like at some point the SerDe class was refactored so depending on what Hive version you’re using you’ll need to extend a different class. On Hive 0.11 the class you’ll want to extend is “org.apache.hadoop.hive.serde2.SerDe”
  • You’ll need to include a couple of JARs in order to get the class to build. I had to include commons-logging-1.0.4.jar, hadoop-0.20.1-core.jar, hive-contrib-0.10.0-cdh4.4.0.jar, hive-exec-0.10.0-cdh4.4.0.jar, junit-4.5.jar
  • As noted above, you need to pull the specific versions of the JARs that you’re going to end up running this SerDe against
  • Make sure you target the right Java JRE version. If your servers are running Java 1.6 and you target 1.7 you end up getting really cryptic error messages.
  • If you create a table using your SerDe, you’ll need to have that JAR available to drop that table

The best way I’ve found to bootstrap this is to create an Eclipse project, include the necessary JARs, and then get the RegExSerDe to build inside the project. Once that works, test the JAR by creating a table using it and then you’ll be able to modify the class from there.

Even with my awful Java, the RegexSerDe class was easy enough to grok and then modify as needed.

Stuck? Need Help?

Drop me a comment or shoot me an email and I’ll do my best to help you out.

Posted In: Big Data

Tags: , , ,

  • mungeol

    Sample of MySQL Slow query log

    # Time: 140530 0:57:58

    # User@Host: root[root] @ localhost []

    # Query_time: 2.887626 Lock_time: 0.000109 Rows_sent: 871 Rows_examined: 871

    SET timestamp=1401379078;

    show tables;

    # Time: 140530 0:58:32

    # User@Host: root[root] @ localhost []

    # Query_time: 6.112399 Lock_time: 0.000089 Rows_sent: 1 Rows_examined: 2749891

    SET timestamp=1401379112;

    select count(*) from test;

    My question is that is there any way to load this kind of log into hive?

    Do I have to write a custom ‘RegexSerDe’ class?

    and, Does Pig can handle this problem?

    I tried to use the ‘RegexSerDe’ class like below, but it is not work even if the regular expressions I used are tested at ‘http://www.regexplanet.com/advanced/java/index.html’ and worked. so I checked source code and figured out it only match one row by each time.

    ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’

    WITH SERDEPROPERTIES (

    ‘input.regex’=’regular expression as a java string’

    )

  • You should be able to do this with the normal RegEx SerDe that’s included with Hive.

    One “gotcha” is that you need to escape your RegEx as if it were being used directly in a piece of Java source. See http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-load-data.html for an example.

    Hive/Pig both ingest data by mapping a single line to a single row. You’d have to combine the lines/rows in a separate transformation step once it’s already in HDFS.

  • mungeol

    Thank you for your quick response.

    “You should be able to do this with the normal RegEx SerDe that’s included with Hive.

    One “gotcha” is that you need to escape your RegEx as if it were being used directly in a piece of Java source. See http://docs.aws.amazon.com/get… for an example.”

    Actually, I succeeded to parse Apache access log with RegExSerDe of Hive, like mentioned at ‘http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-load-data.html’.

    This is because Apache access log produces one line logs at each time, so I was able to match one line by using regular expression.

    “Hive/Pig both ingest data by mapping a single line to a single row. You’d have to combine the lines/rows in a separate transformation step once it’s already in HDFS.”

    Does this means I have to make some lines to be one line?

    ==========
    # Time: 140530 0:57:58
    # User@Host: root[root] @ localhost []
    # Query_time: 2.887626 Lock_time: 0.000109 Rows_sent: 871 Rows_examined: 871
    SET timestamp=1401379078;
    show tables;
    # Time: 140530 0:58:32
    # User@Host: root[root] @ localhost []
    # Query_time: 6.112399 Lock_time: 0.000089 Rows_sent: 1 Rows_examined: 2749891
    SET timestamp=1401379112;
    select count(*) from test;
    ==========

    E.g. I need to transform the logs showed above to the logs showed below, before using RegexSerDe to parse it. So I can use one line match like I used for Apache access log.
    Do I understand your point right?

    ==========
    # Time: 140530 0:57:58 # User@Host: root[root] @ localhost [] # Query_time: 2.887626 Lock_time: 0.000109 Rows_sent: 871 Rows_examined: 871 SET timestamp=1401379078; show tables;
    # Time: 140530 0:58:32 # User@Host: root[root] @ localhost [] # Query_time: 6.112399 Lock_time: 0.000089 Rows_sent: 1 Rows_examined: 2749891 SET timestamp=1401379112; select count(*) from test;
    ==========

  • mungeol

    Maybe, this question makes more sense to you. “http://stackoverflow.com/questions/17935200/hive-regexserde-multiline-log-matching/24926906#24926906”

  • Combining the lines before hand so that you can use the “regular” RegEx class is definitely an option. Although at that point, you might as well just transform the lines into a CSV/TSV format so you can use them directly in HDFS.

    If you have to much data to make this feasible, if you’re sticking in just the Hive/Pig universe I think your only option is going to be to load the lines into a table using the RegEx SerDe and then run something to combine the lines and create a new table.

    You have a few options here:

    = Assuming that the lines you want to combine all have say a query ID, you could use a Hive JOIN to just combine the matching lines and then insert them into a table

    = Write a “normal” Hadoop MapReduce job to combine the rows and then output the results into HDFS

    What exactly are you trying to do? How big is the slow query log?

  • mungeol

    The original purpose of mine is same with this question ‘http://stackoverflow.com/questions/17935200/hive-regexserde-multiline-log-matching/24926906#24926906’. Thank you for your suggestions. I will try them when I run out of my options.