For the last few weeks we’ve been working on a project that involved dealing with bills in the US House and Senate. Naturally, I decided it was time to make a word cloud from the frequencies of the words in the bills!
Checkout the final product here.
I decided to use only the bills from the 111th congress (the current one), all the bills (6703 of them) were downloaded from the THOMAS library at http://thomas.loc.gov/home/gpoxmlc111/ The files are XML documents that have the full text of the bills along with some meta data.
Not really to many files but I decided to use Hadoop and try and Map/Reduce the bills to count up the word frequencies. Getting Hadoop to run locally was pretty straightforward – just tell it where JAVA_HOME is and I was off to the races. Fortunately enough, one of the pre-canned examples was a word frequency counter so I decided to modify that for what I wanted.
The example map/reduce was written to process plain text files so I had to modify it to work with the XML documents. What this involved was writing a custom InputFormat class to open each bill, extract the appropriate plain text from the XML, and then pass this back as the “data”. I also modified the word counter to ignore words shorter than 6 characters.
I tested locally with a small subset of bills and everything seemed to be working fine. The trouble started when I tried to bring up Daum’s machine as a slave to my machine. After some finagling and hair pulling I finally got it working. The takeaways were:
- You can’t run your DataNode on localhost, it needs to be your computer’s hostname to accept connections.
- Hostnames are important. If you don’t have a DNS server make sure your hostnames are aliased in /etc/hosts
- If your HDFS set up is showing 100% utilization but you know it isn’t true, try rm’ing the data file and then re-formatting your namenode.
- If a copy or reduce step fails in distributed mode the error messages are usually really cryptic – check the actual logs.
- When something throws an exception during a map or reduce operation, the error won’t be reported to STDOUT
Anyway, it was a slightly frustrating but rewarding experience – I even got to code some Java! The visualization of the word frequencies is here.
Might be about time to process one of the Amazon datasets with EC2…