Hive: How to write a custom SerDe class

We’ve been using Hive a bit lately to help clients tackle some of their data needs and without a doubt one of the most powerful features is Hive’s SerDe functionality. Taking a step back, Hive is an open source Apache project that lets you run “SQL Like” queries using Hadoop on data that you have in HDFS. It’s a lot of moving pieces but what it fundamentally comes down to is that Hive will let you run what look like SQL queries across the text files that you have in HDFS. A typical use case would be using Hive to run ad-hoc queries across web server (like nginx) logs. Want to a breakdown of response times by frontend web server? Hive would let you do that.

SerDe?

SerDe is actually short for Serialize/Deserialize and its the mechanism that Hive uses to make sense of your text files in HDFS. Lets take a typical nginx log line:

Now the magic comes in how Hive uses a SerDe to translate a line like that into something that’s queryable. This is contrived but lets assume that for some reason we’re interested in querying on the client IP address and the request size of each log line. So we’d be interested in creating a table that looks like:

Turns out, Hive makes this particularly easy. You’d end up using the RegexSerDe to match a regular expression and then extract the two fields you’re interested in.

A custom one

The next step after extraction is to do some transformation during the extraction stage and this is where the custom SerDe comes in. For example, lets say that you wanted to geocode the client’s IP address and also convert your dates into Unix timestamps. So your table would be something like:

Your custom SerDe would let you do exactly this. You’d be able to use something like the MaxMind database to geocode your IP addresses and then use some extra Java to convert your timestamps.

Unfortunately, there doesn’t seem to be too much documentation on how to actually write a custom class so here’s a couple of tidbits I’ve picked up:

  • It looks like at some point the SerDe class was refactored so depending on what Hive version you’re using you’ll need to extend a different class. On Hive 0.11 the class you’ll want to extend is “org.apache.hadoop.hive.serde2.SerDe”
  • You’ll need to include a couple of JARs in order to get the class to build. I had to include commons-logging-1.0.4.jar, hadoop-0.20.1-core.jar, hive-contrib-0.10.0-cdh4.4.0.jar, hive-exec-0.10.0-cdh4.4.0.jar, junit-4.5.jar
  • As noted above, you need to pull the specific versions of the JARs that you’re going to end up running this SerDe against
  • Make sure you target the right Java JRE version. If your servers are running Java 1.6 and you target 1.7 you end up getting really cryptic error messages.
  • If you create a table using your SerDe, you’ll need to have that JAR available to drop that table

The best way I’ve found to bootstrap this is to create an Eclipse project, include the necessary JARs, and then get the RegExSerDe to build inside the project. Once that works, test the JAR by creating a table using it and then you’ll be able to modify the class from there.

Even with my awful Java, the RegexSerDe class was easy enough to grok and then modify as needed.

Stuck? Need Help?

Drop me a comment or shoot me an email and I’ll do my best to help you out.

Doctrine2: Using ResultSetMapping and MySQL temporary tables

Note: I haven’t actually tried this in production, it’s probably a terrible idea.

We’ve been using MySQL temporary tables to run some analytics lately and it got me wondering how difficult would it be to hydrate Doctrine2 objects from these tables? We’ve primarily been using MySQL temporary tables to allow us to break apart complicated SQL queries, cache intermediate steps, and generally make debugging analytics a bit easier. Anyway, given that use case this is a bit of a contrived example but it’s still an interesting look inside Doctrine.

For arguments sake, lets say we’re using the FOSUserBundle and we have a table called “be_user” that looks something like:

Now, for some reason we’re going to end up creating a separate MySQL table (temporary or otherwise) with a subset of this data but identical columns:

So now how do we load data from this secondary table into Doctrine2 entities? Turns out it’s relatively straightforward. By using Doctrine’s createNativeQuery along with ResultSetMapping you’ll be able to pull data out of the alternative table and return regular User entitites. One key point, is that by using DisconnectedClassMetadataFactory it’s actually possible to introspect your Doctrine entities at runtime so that you can add the ResultSetMapping fields dynamically.

Anyway, my code inside a Command to test this out ended up looking like:

Musings: Could you leverage Twitter to make some money this holiday season?

A few days ago, I was browsing my Feedly dashboard and ran across this AdWeek post describing how big retailers are gearing up to poach their competitors customers this holiday season. The article goes into some specifics, but the idea is basically that brands are planning to monitor Twitter for relevant conversations and then “at” message potential customers with special offers, product details, or even local store inventory information.

So imagine @MikeBruins65 from Boston tweeting “Wtf! @BestBuy offering 25% off all 4K TVs in-store…except nothing in stock.” and then @target replying “Cheer up @MikeBruins65! We have 4K TVs in-stock in Everett, MA! Grab coupons at http://bit.ly/target-4k-ma”. Since these brands are certainly leveraging powerful tools like Radian6 or even the full Twitter Firehose, it seems like it would be straightforward for them to execute strategies like this around high value markets. But what about as an individual, could you employ a similar strategy to make a few bucks?

Amazon Associates Links

The most obvious, least risky, and least lucrative approach would be to monitor Twitter for tweets that sounded like they were from frustrated buyers and then message them Amazon associates links for the product they’re looking for. Looking at Amazon’s fee structure, you’d want to target high margin categories with moderately expensive products and then hopefully end up doing a decent amount of volume. So imagine searching for Tweets from users frustrated that they can’t checkout on a small eCommerce site, finding the product they’re searching for on Amazon, and then Tweeting them the link to buy with your Associates link.

Dropshipping

More risky and potentially more upside. I’m not entirely sure how feasible this would be, but I think the idea would be to use a SaaS eCommerce platform like Shopify to setup an eCommerce shop and then dynamically list items which you’ll later dropship. The challenge would be two fold, using Twitter to identify which previously obscure items are starting to trend and then figuring out how to introduce enough margin so that you end up profiting on the sale. It might be feasible though, with the explosion of small, boutique eCommerce sites it might be possible to negotiate a “I’ll buy 400 for 50% off!” type deal quickly enough to introduce a profitable sale. The bigger challenge would probably be identifying these items as they start trending, but that could be solved by….

Pinterest

Recent member of the billion dollar boys club and frequent target of “haters”, it’s current traction and latent purchase intent potentially make it the perfect place for affiliate marketing. Beyond that, the wealth of potential gift pins and the follower/repin graph might hold the key to identifying relatively obscure products right before they begin to go viral. Anyway, I don’t have any concrete ideas on how you could leverage Pinterest but it definitely seems like the ingredients for success are there.

Totally coincidentally, this article just came across TechCrunch – A Pin On Pinterest Is Worth 25% More In Sales Than Last Year, Can Drive Visits & Orders For Months

Anyway, are any of these actually feasible? Who knows, but I’d love to hear any other ideas.

First LinkedIn Intro, then BonzyBuddy 2.0

Last week, LinkedIn published an indepth technical explanation of how their new LinkedIn Intro mobile product works on iOS. What Intro does is basically display LinkedIn data about your contacts directly in your email client – similar to what Rapportive did for gmail. It’s a cool app but the implementation details LinkedIn shared ignited an Internet firestorm, especially among the startup/hacker crowd.

How Intro works is it basically modifies the users normal iOS email client so that it connects through a LinkedIn proxy server instead of interacting with their webmail provider directly. What this does, is allow LinkedIn to dynamically modify a user’s email before it reaches their mail client, depending on if the user is connected to the sender on LinkedIn. From a IT security standpoint, introducing a third party that would sit between a user and the mail server they’re connecting to undoubtedly introduces a new attack vector but what really caught my interest was how LinkedIn was achieving this. In order to smoothly update the user’s proxy settings, LinkedIn is using a iOS feature known as Configuration Profiles.

I’m not familiar with the iOS SDK or APIs so this was the first time I’d heard about Configuration Profiles. In short, what they allow an app to do is install a set of settings on an iOS device – from email and web proxy settings to additional credentials and SSL keys. Configuration profiles are typically used in enterprise environments to allow a company’s IT department to quickly configure the settings on an employee’s iOS device. When provisioning a new device, IT would basically use the configuration profile to install things like a VPN, internal credentials, etc. So what’s the problem?

Well according to the LinkedIn post and comments from users that have used profiles before, the user experience of installing a profile which radically alters your iOS system settings is surprisingly unassuming. As a user, you click through a couple of prompts and boom, all of a sudden Safari is using a proxy server to fetch websites. So what nefarious things could you do by routing iOS mobile traffic through a proxy server? Unsolicited injected display advertising.

On the desktop web, unscrupulous extension developers have been monetizing their install base by injecting display ads into the browsing experience of their users for years. From companies like Bonzi Buddy to newer companies like PageRage, the model is tried, true, and profitable. However, on mobile there isn’t an obvious opportunity to inject ads and get access to the rapidly growing number of mobile web impressions. It seems like using configuration profiles would be the perfect vector to change this. Crapware iOS developers could quietly prompt their users to install a configuration profile to get access to “hot new features” and then surreptitiously start injecting display ads into websites on the proxy server.

I’m not familiar enough with iOS development to speak to how easy developing an app like this would be or if it would get past the app store approval process, but if it’s feasible someone is certainly going to do it. If anyone is familiar with an app already doing this, I’d love to know about it.

Symfony2: Using FOSUserBundle with multiple EntityManagers

Last week, we were looking to setup one of our Symfony2 projects to use a master/slave MySQL configuration. We’d looked into using the MasterSlaveConnection Doctrine2 connection class, but unfortunately it doesn’t really work the way you’d expect. Anyway, the “next best” way to set up master/slave connections seemed to be creating two separate EntityManagers, one pointing at the master and one at the slave. Setting up the Doctrine configurations for this is pretty straightforward, you’ll end up with YAML that looks like:

At face value, it looked like everything was working fine but it turns out they weren’t – the FOSUserBundle entities weren’t getting properly setup on the slave connection. Turns out, because FOSUserBundle uses Doctrine2 superclasses to setup it’s fields there’s no way to natively use FOSUserBundle with multiple entity managers. The key issue is that since the UserProvider checks the class of a user being refreshed, you can’t just copy the FOSUserBundle fields directly into your entity:

So how do you get around this? Turns out, you need to add a custom UserProvider to bypass the instance class check. My UserProvider ended up looking like:

And then the additional YAML configurations you need are:

The last step is copying all the FOSUserBundle fields directly into your User entity and update it to not extend the FOSUserBundle base class. Anyway, that’s it – two EntityManagers and one FOSUserBundle.