Net Neutrality: A recap and some cliffnotes

Net Neutrality has been all over the news lately and I’ve been fielding a couple of questions related to it. At Setfive, we think it’s a critically important issue, both to startups and the technology infrastructure of the United States as a whole. Because of that, we decided to pull together an overview, some history, and key outcomes surrounding the Net Neutrality debate. As always, questions or comments welcome!

What is Net Neutrality?
First coined by Columbia Law professor Tim Wu, network neutrality, or net neutrality for short, states that internet service providers (such as Verizon and Comcast) and governments should provide you with access to content and data regardless of where it came from equally. Internet service providers (ISPs) are not allowed to discriminate and slow speeds for one company in favor of its competitor.

Essentially, net neutrality maintains a free, open, and fair internet.

The Lead Up To January 14, 2014

  • In 2002, the FCC had the opportunity to regulate ISPs as it had done for the phone companies. Ultimately though, the FCC chose not to at all citing that ISPs are “information services”, completely different than the telecommunication services phone companies provide.
  • However a few years later, the FCC began to notice the enormous power and strength that ISPs had accumulated over the years. In an attempt to curb and regulate them, the FCC created the Open Internet Rules in 2010

The Open Internet Rules established:

  • Enforced transparency of ISPs operations and management of their networks
  • Prohibited ISPs from obstructing access to legal content and applications
  • Maintained an equal and fair playing field online by preventing ISPs from giving preference to one company over another. Essentially becoming the core of net neutrality

In response to these rules, Verizon brought the FCC to court in 2013 on the charge that the agency had no authority to use the Open Internet rules to regulate ISPs.

Fast forward to January 14, 2014

  • On this day, a DC circuit court determined in the Verizon Communications Inc. vs FCC case that portions of the Open Internet Rules especially the ones pertaining to an equal and fair internet could not be applied to ISPs.
  • The reasoning was that portions of the rules apply only to common carriers, which provide telecommunication services. But since ISPs are classified by the FCC as providers of information services, they’re not considered under the law as common carriers.

What does this ruling mean?
It eliminated the only existing rules protecting net neutrality. As a result, ISPs can now:

  • Charge companies fees for “premium” access to their consumers. Think Verizon charging Netflix to stream to their customers at better rates.
  • Selectively prioritize one source of traffic over another. Think Comcast prioritizing delivering its Xfinity onDemand service over HBO Go.
  • And of course, create “slow lanes” and “fast lanes” paving the way to charging for ala carte Internet packages, just like TV. Imagine seeing errors like: “Sorry! You need to subscribe to the ‘social package’ to access this site.”

What’s the president’s stance on all this?
He’s pro net neutrality and has urged the FCC to establish strong rules that would protect it. However since the FCC is an independent government agency, Obama has no direct influence. Additionally, in a bitterly divided congress some hardline Republicans are taking an anti-Net Neutrality stance to pander to their base. See The Oatmeal on Ted Cruz.

What’s next?
The FCC does have the power to reclassify ISPs as telecommunication service providers and thus subject them to the Open Internet Rules. What it decided to do instead is to create a new net neutrality framework that would hold up in court while at the same time satisfy both sides.

Right now, everyone is in a holding pattern waiting for the FCC to make a final announcement.

Friday Links: Apple Pay, SaaS, and Net Neutraility

Welcome to the weekend! We’ve rounded up some interesting reading to carry you through the till Monday. Fire up your iPad, grab some cider, and snuggle up with a blanket:

Boston: Who’s hiring PHP developers?

Last week, I was catching up with some friends when one of them asked an interesting question – Which Boston area companies are currently hiring PHP developers? Surprisingly, I didn’t really have a good answer so I decided to find out. To figure this out, I searched job posts that were specifically looking for PHP developers and started pulling together a spreadsheet about the posts. As I was looking at the data, I decided to put together a graphic which is available below along with the list of companies. As always, questions or comments welcome!

Company City Company City
Acquia Burlington ADTRAN Burlington
ADTRAN Burlington Allen & Gerritsen Boston
Allen & Gerritsen Boston Applause Framingham
Applause Framingham Arbor Networks Burlington
Arbor Networks Burlington Berklee College Of Music Boston
Berklee College Of Music Boston Biogen Idec Cambridge
Biogen Idec Cambridge Black Duck Software Burlington
Black Duck Software Burlington Blue State Digital Boston
Blue State Digital Boston Brafton Inc. Boston
Brafton Inc. Boston Brigham And Women’s Hospital Wellesley
Brigham And Women’s Hospital Wellesley Brightcove Boston
Brightcove Boston Catalina Marketing Boston
Catalina Marketing Boston Comsol Burlington
Comsol Burlington Constant Contact Waltham
Constant Contact Waltham ContentLEAD Boston
ContentLEAD Boston D50 Media Wellesley
D50 Media Wellesley Demandware Burlington
Demandware Burlington Desire2Learn (D2L) Boston
Desire2Learn (D2L) Boston Dew Softech (contract Position) Boston
Dew Softech (contract Position) Boston Digital Bungalow Salem
Digital Bungalow Salem Dynatrace Waltham
Dynatrace Waltham Egenerationmarketing Boston
Egenerationmarketing Boston FASTHockey Boston
FASTHockey Boston Flipkey, Inc. Boston
Flipkey, Inc. Boston Genscape, Inc. Boston
Genscape, Inc. Boston Harvard Medical School Boston
Harvard Medical School Boston Harvard School Of Public Health Boston
Harvard School Of Public Health Boston Hill Holliday Boston
Hill Holliday Boston Hubspot, Inc. Cambridge
Hubspot, Inc. Cambridge Integrated Computer Solutions Bedford
Integrated Computer Solutions Bedford Intersystems Cambridge
Intersystems Cambridge Mediamath Cambridge
Mediamath Cambridge Medtouch Cambridge
Medtouch Cambridge MIT Cambridge
MIT Cambridge Modo Labs Cambridge
Modo Labs Cambridge Motus (crs) Boston
Motus (crs) Boston Namemedia Waltham
Namemedia Waltham Nanigans Boston
Nanigans Boston Northeastern University Boston
Northeastern University Boston Northpoint Digital Boston
Northpoint Digital Boston Nutraclick Boston
Nutraclick Boston Pegasystems Cambridge
Pegasystems Cambridge Placester Boston
Placester Boston Polar Design Woburn
Polar Design Woburn Sevone, Inc. Boston
Sevone, Inc. Boston Silversky Boston
Silversky Boston Smartertravel.com Boston
Smartertravel.com Boston Source Of Future Technology, Inc. Cambridge
Source Of Future Technology, Inc. Cambridge Studypoint Boston
Studypoint Boston Surfmerchants LLC Boston
Surfmerchants LLC Boston Tatto Media Boston
Tatto Media Boston Tufts University Boston
Tufts University Boston Umass Boston Boston
Umass Boston Boston Unitrends Burlington
Unitrends Burlington Wayfair Boston
Wayfair Boston Zipcar Boston

Friday Links: Fitness^3

It’s been a long week but you’ve made it, it’s Friday! Nothing goes better with Fridays than a couple of fresh links for your ride home and of course a cold beer. We can’t help you with that beer but we’ve got you covered on those links. A slew of new wearable health products were released this week and here they are:

Planning on picking up a fitness tracker? Let us know in the comments!

PHP: Using Gearman for a MapReduce inspired workflow

Over the last few weeks we’ve been utilizing Gearman to help us do some realtime stream processing. In production, what we’ve basically been doing is reading messages off an Amazon Kinesis stream, creating jobs in Gearman for anything that’s computationally expensive, and then gathering up the processed data for a batched insert into Amazon Redshift on a Gearman job as well. Conceptually, this workflow is reasonably similar to how MapReduce works where a series of input jobs is transformed by “mappers” and then results are collected in a “reduce” step.

From a practical point of view, using Gearman like this offers some interesting benefits:

  • Adding additional “map” capacity is relatively straightforward since you can just add additional machines that connect to the Gearman server.
  • Developing and testing the “map” and “reduce” functionality is easy since nothing is shared and you can run the code directly, independently of Gearman.
  • In our experience so far, the Gearman server can handle a high volume of jobs/minute – we’ve pushed ~300/sec without a problem.
  • Since Gearman clients exist for dozens of languages, you could write different pieces of the system in whatever language fits best.

Overview

OK, so how does all of this actually work. For the purposes of a demonstration, lets assume you’ve been tasked with scraping the META keywords and descriptions from a few hundred thousand sites and counting up word frequencies across all the sites. Assuming you were doing this in straight PHP, you’d end up with code that looks something like this.

The problem is that since you’re making the requests sequentially, scraping a significant number of URLs is going to take an intractable amount of time. What we really want to do is fetch the URLs in parallel, extract the META keywords, and then combine all that data in a single data structure.

To keep the amount of code down, I used the Symfony2 Console component, Guzzle and Monolog to provide infastructure around the project. Walking through the files of interest:

  • GearmanCommand.php: Command to execute either the “node” or the “master” Gearman workers.
  • StartScrapeCommand.php: Command to create the Gearman jobs to start the scrapers
  • Master.php: The code to gather up all the extracted keywords and maintain a running count.
  • Node.php: Worker code to extract the meta keywords from a given URL

Setup

Taking this for a spin is straightforward enough. Fire up an Ubuntu EC2 and then run the following:

OK, now that everything is setup lets run the normal PHP implementation.

Looks like about 10-12 seconds to process 100 URLs. Not terrible but assuming linear growth that means processing 100,000 URLs would take almost 2.5 hours which is a bit painful. You can verify it worked by looking at the “bin/nogearman_keyword_results.json” file.

Now, lets look at the Gearman version. Running the Gearman version is straightforward, just run the following:

You’ll eventually get an output from the “master” when it finishes with the total elapsed time. It’ll probably come in somewhere around 15ish seconds again because we’re still just using a single process to fetch the URLs.

Party in parallel

But now here’s where things get interesting, we can start adding multiple “worker” processes to do some of the computation in parallel. In my experience, the easiest way to handle this is using Supervisor since it makes starting and stopping groups of processes easy and also handles collecting their output. Run the following to copy the config file, restart supervisor, and verify the workers are running:

And now, you’ll want to run “application.php setfive:gearman master” in one terminal and in another run “php setfive:start-scraper 100sites.txt” to kick off the jobs.

Boom! Much faster. We’re still only doing 100 URLs so the effect of processing in parallel isn’t that dramatic. Again, you can check out the results by looking at “bin/keyword_results.json”.

The effects of using multiple workers will be more apparent when you’ve got a larger number of URLs to scrape. Inside the “bin” directory there’s a file named “quantcast_site_lists.tar.gz” which has site lists of different sizes up to the full 1 million from Quantcast.

I ran a some tests on the lists using different numbers of workers and the results are below.

0 Workers 10 Workers 25 Workers
100 URLs 12 sec. 12 sec. 5 sec.
1000 URLs 170 sec. 34 sec. 33 sec.
5000 URLs 1174 sec. 195 sec. 183 sec.
10000 URLs 2743 sec. 445 sec. 424 sec.

One thing to note, is if you run:

And notice that “processUrl” has zero jobs but there’s a lot waiting for “countKeywords”, you’re actually saturating the “reducer” and adding additional worker nodes in Supervisor isn’t going to increase your speed. Testing on a m3.small, I was seeing this happen with 25 workers.

Another powerful feature of Gearman is that it makes running jobs on remote hosts really easy. To add a “remote” to the job server, you’d just need to start a second machine, update the IP address in Base.php, and user the same Supervisor config to start a group of workers. They’d automatically register to your Gearman server and start processing jobs.

Anyway, as always questions and comments appreciated and all the code is on GitHub.