Fun: The AVC Word Cloud

Happy 2014! In between celebrating Christmas, hanging with family, and ringing in the New Year I managed to put together a visualization of the words used on avc.com. AVC, written by Fred Wilson, is probably one of the most popular “start up” blogs on the Internet. It covers a wide array of topics from “MBA Mondays”, USV portfolio companies and of course general startup and technology news. Given the range of topics and and that the blog has been active since 2003, it naturally seemed like generating a word cloud would produce interesting results. With the goal of generating word clouds in mind, I set off the day after Christmas.

Checkout the finished product at http://symf.setfive.com/d3_avc_blog_cloud/. I actually decided to use Scala to scrape and process the data, look for a follup post on coming to Scala from PHP.

Taking a quick glance at the clouds, a few things do jump out:

  • “Android” enters the top 100 in 2010 and has remained there since.
  • Amazon is surprisingly absent past 2007
  • Apple hasn’t made the top 100 in any year.
  • It’s interesting to see when USV portfolio companies like Disqus and Zemanta enter and exit.
  • Bitcoin makes the list for 2013
  • Blackberry, one and done
  • Facebook peaked in 2007 and then steadily declines until it drops out this year
  • Google hits the list for every year
  • Twitter gets in at 2007 and sticks through this year

Words of Congress: Fun with Hadoop

For the last few weeks we’ve been working on a project that involved dealing with bills in the US House and Senate. Naturally, I decided it was time to make a word cloud from the frequencies of the words in the bills!

Checkout the final product here.

I decided to use only the bills from the 111th congress (the current one), all the bills (6703 of them) were downloaded from the THOMAS library at http://thomas.loc.gov/home/gpoxmlc111/ The files are XML documents that have the full text of the bills along with some meta data.

Not really to many files but I decided to use Hadoop and try and Map/Reduce the bills to count up the word frequencies. Getting Hadoop to run locally was pretty straightforward – just tell it where JAVA_HOME is and I was off to the races. Fortunately enough, one of the pre-canned examples was a word frequency counter so I decided to modify that for what I wanted.

The example map/reduce was written to process plain text files so I had to modify it to work with the XML documents. What this involved was writing a custom InputFormat class to open each bill, extract the appropriate plain text from the XML, and then pass this back as the “data”. I also modified the word counter to ignore words shorter than 6 characters.

I tested locally with a small subset of bills and everything seemed to be working fine. The trouble started when I tried to bring up Daum’s machine as a slave to my machine. After some finagling and hair pulling I finally got it working. The takeaways were:

  • You can’t run your DataNode on localhost, it needs to be your computer’s hostname to accept connections.
  • Hostnames are important. If you don’t have a DNS server make sure your hostnames are aliased in /etc/hosts
  • If your HDFS set up is showing 100% utilization but you know it isn’t true, try rm’ing the data file and then re-formatting your namenode.
  • If a copy or reduce step fails in distributed mode the error messages are usually really cryptic – check the actual logs.
  • When something throws an exception during a map or reduce operation, the error won’t be reported to STDOUT

Anyway, it was a slightly frustrating but rewarding experience – I even got to code some Java! The visualization of the word frequencies is here.

Might be about time to process one of the Amazon datasets with EC2