FanFeedr Widgets Are Live!

Over the past few weeks we had the opportunity to work with FanFeedr to put together some widgets for their sports news platform. Previously, FanFeedr had been using Sprout to build their widgets but this required someone to hand build a Flash widget for every “resource” on FanFeedr (there are a lot). In addition, since the Sprout widgets are Flash they aren’t easily crawled by search engines.

Our widgets are different. They allow FanFeedr to generate widgets on the fly for any of their pages and allow users to customize the color schemes. Check out a widget builder for the NY Yankees here.

Basically, our widget builder works by allowing users to customize the size and colors used in the widget. This data is serialized as a JSON object and then base64 encoded so that it can be sent to the “generator” on the server. Then, the server unpacks the payload and builds a widget according to the data specified in the JSON object. In addition, our embed code includes a noscript tags so that search engines pick up the links in the widget as well.

Anyway, working with FanFeedr was a great experience and we hope to continue our relationship moving forward. Go build yourself a widget!

The Redline Challenge

For one reason or another we decided to sponsor a pub crawl this weekend. The plan was hatched over some beers at Underbones on Thursday night for a Saturday morning go time. We knew we basically needed three things: a list of bars, some swag (tshirt?), and obviously a website. We decided that the route of the crawl should follow the MBTA Redline so that we could start downtown and then finish in Somerville. This made picking bars pretty simple, gave us some branding, and of course we registered
REDLINECHALLENGE.COM.

We wanted the website to have some informative information, live location updates, and of course pictures of the debauchery. The biggest problem was that neither Daum nor I have location aware phones. To get around this, we decided to update Twitter with our current location along with a “#loc” hashtag and then have the site update based on that. Since we were all ready using Twitter, we decided to use Twitpic to allow us to post pictures to twitter on the fly. Additionally, we took advantage of Verizon Wireless’s email to SMS service and allowed people to contact us via the website. All told, we built the site in about 3 hours and it proved to be pretty useful. People used it to find us on the crawl and to contact us while we were out. Everyone also got a kick of seeing a live photo stream.

What’s next? Clearly, The Greenline Challenge.

FOSS Saturday: sfFbConnectGuardPlugin – sfGuard meets FB Connect

I was slaving over a hot keyboard all Friday!

But at last it is done – FBConnect for sfGuard.

Get it here http://www.symfony-project.org/plugins/sfFbConnectGuardPlugin

A detailed explanation of how to install it and use it is on the Symfony site.

Anyway, the plugin basically just introduces a new table to keep track of Facebook IDs <---> sfGuardUserIds

Here’s a fun nugget. One of the problems with using FB Connect is that you can’t mug a user’s email address from Facebook. Obviously this is a smart move on Facebook’s part but it makes life hard for my Nigerian spammer friends. If you want to snag a user’s email address (or anything else for that matter) while still using Facebook Connect here’s a sketch of how to do it.

Everything is the same except you can’t use Facebook’s FBML to render the FB Connect button. What you want to do instead is trigger the “connect” event by hand. Here is basically how we do it:

  1. The user requests to sign up.
  2. We pop up a Lightbox using Thickbox
  3. We ask the user for their email address and verify that is valid and unique via AJAX in the background.
  4. The validation routing sets an attribute on the user using setAttribute() that contains the entered email address.
  5. We close the Lightbox and initiate a Facebook Connect request with FB.Connect.requireSession
  6. In our createFbUser() method we get the attribute back and save it with the new user

Bam. Got the user’s email address and logged them in via FB Connect.

Words of Congress: Fun with Hadoop

For the last few weeks we’ve been working on a project that involved dealing with bills in the US House and Senate. Naturally, I decided it was time to make a word cloud from the frequencies of the words in the bills!

Checkout the final product here.

I decided to use only the bills from the 111th congress (the current one), all the bills (6703 of them) were downloaded from the THOMAS library at http://thomas.loc.gov/home/gpoxmlc111/ The files are XML documents that have the full text of the bills along with some meta data.

Not really to many files but I decided to use Hadoop and try and Map/Reduce the bills to count up the word frequencies. Getting Hadoop to run locally was pretty straightforward – just tell it where JAVA_HOME is and I was off to the races. Fortunately enough, one of the pre-canned examples was a word frequency counter so I decided to modify that for what I wanted.

The example map/reduce was written to process plain text files so I had to modify it to work with the XML documents. What this involved was writing a custom InputFormat class to open each bill, extract the appropriate plain text from the XML, and then pass this back as the “data”. I also modified the word counter to ignore words shorter than 6 characters.

I tested locally with a small subset of bills and everything seemed to be working fine. The trouble started when I tried to bring up Daum’s machine as a slave to my machine. After some finagling and hair pulling I finally got it working. The takeaways were:

  • You can’t run your DataNode on localhost, it needs to be your computer’s hostname to accept connections.
  • Hostnames are important. If you don’t have a DNS server make sure your hostnames are aliased in /etc/hosts
  • If your HDFS set up is showing 100% utilization but you know it isn’t true, try rm’ing the data file and then re-formatting your namenode.
  • If a copy or reduce step fails in distributed mode the error messages are usually really cryptic – check the actual logs.
  • When something throws an exception during a map or reduce operation, the error won’t be reported to STDOUT

Anyway, it was a slightly frustrating but rewarding experience – I even got to code some Java! The visualization of the word frequencies is here.

Might be about time to process one of the Amazon datasets with EC2