Fun: The AVC Word Cloud

Happy 2014! In between celebrating Christmas, hanging with family, and ringing in the New Year I managed to put together a visualization of the words used on avc.com. AVC, written by Fred Wilson, is probably one of the most popular “start up” blogs on the Internet. It covers a wide array of topics from “MBA Mondays”, USV portfolio companies and of course general startup and technology news. Given the range of topics and and that the blog has been active since 2003, it naturally seemed like generating a word cloud would produce interesting results. With the goal of generating word clouds in mind, I set off the day after Christmas.

Checkout the finished product at http://symf.setfive.com/d3_avc_blog_cloud/. I actually decided to use Scala to scrape and process the data, look for a follup post on coming to Scala from PHP.

Taking a quick glance at the clouds, a few things do jump out:

  • “Android” enters the top 100 in 2010 and has remained there since.
  • Amazon is surprisingly absent past 2007
  • Apple hasn’t made the top 100 in any year.
  • It’s interesting to see when USV portfolio companies like Disqus and Zemanta enter and exit.
  • Bitcoin makes the list for 2013
  • Blackberry, one and done
  • Facebook peaked in 2007 and then steadily declines until it drops out this year
  • Google hits the list for every year
  • Twitter gets in at 2007 and sticks through this year

Visualizing the Startup Institute Spring ’13 Class

Last week, we got our hands on the class list for the Spring ’13 Startup Institute class. I had some time to burn so I decided to throw together a visualization using the names and email addresses of the members of the class. You can check it out at http://symf.setfive.com/d3_startup_school/

How it works

Basically, the visualization represents every student with a 3×3 color grid by using various attributes of their names and email addresses. The various squares are calculated with the following formulas:

  • Top left: Calculated by taking the first letter of the first name (say C) and then converting it to a % for how far down the alphabet the letter is. So C would come out to 3 / 25 = 12% Then, this percentage value is applied to the “lightness” component of a HSL color tuple for “hsl(40,100%,92%)”
  • Top middle: Calculated by taking the length of the first name and then calculating a % for how long it is compared to the other names in the list. So basically, it finds the length of the longest name and then divides the current student’s name by that value for a %. The % is then used in the lightness component of “hsl(340,100%,73%)”
  • Top right: A color generated using the metaphone of the first name. The metaphone is generated, then split up into 3 pieces, and then the ASCII values of those 3 components are summed. Then, the 3 parts are mapped to HSL values depending on the % maximum they are for the entire sample size.
  • The second row is identical to the first except using the last name.
  • Bottom left: Calculated depending on the “track” that the user is in.
  • Bottom middle and right: These use the same metaphone algorithm except using the email address and email domain name respectively.

Technically, the squares are drawn using the d3 library and the page layout is done with Bootstrap.

Anyway, as always comments and feedback are welcome.

D3: Taking a dip

D3 is a “newish” visualization library that has been getting a lot of attention recently. The New York Times has been using it extensively to create visualizations, and in fact its creator is currently employed by the NYT. I’d been meaning to take D3 for a spin for a while but couldn’t find a dataset I wanted to play with until a few weeks ago.

At the end of November, the LA Times published a dataset titled Capital appreciation bonds which highlighted how various California school districts were funding various projects with extremely high interest rate bonds. The LA Times described the data as:

Hun­dreds of Cali­for­nia school and com­munity col­lege dis­tricts have fin­anced con­struc­tion pro­jects with cap­it­al ap­pre­ci­ation bonds that push re­pay­ment far in­to the fu­ture and ul­ti­mately cost many times what the dis­trict bor­rowed. Gov­ern­ment fin­ance ex­perts con­sider bonds im­prudent if the total cost is more than four times the money bor­rowed or the ma­tur­ity peri­od is great­er than 25 years.

Anyway, you can check out my attempt at a visualization here.