Boston: Who’s hiring PHP developers?

Last week, I was catching up with some friends when one of them asked an interesting question – Which Boston area companies are currently hiring PHP developers? Surprisingly, I didn’t really have a good answer so I decided to find out. To figure this out, I searched job posts that were specifically looking for PHP developers and started pulling together a spreadsheet about the posts. As I was looking at the data, I decided to put together a graphic which is available below along with the list of companies. As always, questions or comments welcome!

Company City Company City
Acquia Burlington ADTRAN Burlington
ADTRAN Burlington Allen & Gerritsen Boston
Allen & Gerritsen Boston Applause Framingham
Applause Framingham Arbor Networks Burlington
Arbor Networks Burlington Berklee College Of Music Boston
Berklee College Of Music Boston Biogen Idec Cambridge
Biogen Idec Cambridge Black Duck Software Burlington
Black Duck Software Burlington Blue State Digital Boston
Blue State Digital Boston Brafton Inc. Boston
Brafton Inc. Boston Brigham And Women’s Hospital Wellesley
Brigham And Women’s Hospital Wellesley Brightcove Boston
Brightcove Boston Catalina Marketing Boston
Catalina Marketing Boston Comsol Burlington
Comsol Burlington Constant Contact Waltham
Constant Contact Waltham ContentLEAD Boston
ContentLEAD Boston D50 Media Wellesley
D50 Media Wellesley Demandware Burlington
Demandware Burlington Desire2Learn (D2L) Boston
Desire2Learn (D2L) Boston Dew Softech (contract Position) Boston
Dew Softech (contract Position) Boston Digital Bungalow Salem
Digital Bungalow Salem Dynatrace Waltham
Dynatrace Waltham Egenerationmarketing Boston
Egenerationmarketing Boston FASTHockey Boston
FASTHockey Boston Flipkey, Inc. Boston
Flipkey, Inc. Boston Genscape, Inc. Boston
Genscape, Inc. Boston Harvard Medical School Boston
Harvard Medical School Boston Harvard School Of Public Health Boston
Harvard School Of Public Health Boston Hill Holliday Boston
Hill Holliday Boston Hubspot, Inc. Cambridge
Hubspot, Inc. Cambridge Integrated Computer Solutions Bedford
Integrated Computer Solutions Bedford Intersystems Cambridge
Intersystems Cambridge Mediamath Cambridge
Mediamath Cambridge Medtouch Cambridge
Medtouch Cambridge MIT Cambridge
MIT Cambridge Modo Labs Cambridge
Modo Labs Cambridge Motus (crs) Boston
Motus (crs) Boston Namemedia Waltham
Namemedia Waltham Nanigans Boston
Nanigans Boston Northeastern University Boston
Northeastern University Boston Northpoint Digital Boston
Northpoint Digital Boston Nutraclick Boston
Nutraclick Boston Pegasystems Cambridge
Pegasystems Cambridge Placester Boston
Placester Boston Polar Design Woburn
Polar Design Woburn Sevone, Inc. Boston
Sevone, Inc. Boston Silversky Boston
Silversky Boston Smartertravel.com Boston
Smartertravel.com Boston Source Of Future Technology, Inc. Cambridge
Source Of Future Technology, Inc. Cambridge Studypoint Boston
Studypoint Boston Surfmerchants LLC Boston
Surfmerchants LLC Boston Tatto Media Boston
Tatto Media Boston Tufts University Boston
Tufts University Boston Umass Boston Boston
Umass Boston Boston Unitrends Burlington
Unitrends Burlington Wayfair Boston
Wayfair Boston Zipcar Boston

Friday Links: Fitness^3

It’s been a long week but you’ve made it, it’s Friday! Nothing goes better with Fridays than a couple of fresh links for your ride home and of course a cold beer. We can’t help you with that beer but we’ve got you covered on those links. A slew of new wearable health products were released this week and here they are:

Planning on picking up a fitness tracker? Let us know in the comments!

PHP: Using Gearman for a MapReduce inspired workflow

Over the last few weeks we’ve been utilizing Gearman to help us do some realtime stream processing. In production, what we’ve basically been doing is reading messages off an Amazon Kinesis stream, creating jobs in Gearman for anything that’s computationally expensive, and then gathering up the processed data for a batched insert into Amazon Redshift on a Gearman job as well. Conceptually, this workflow is reasonably similar to how MapReduce works where a series of input jobs is transformed by “mappers” and then results are collected in a “reduce” step.

From a practical point of view, using Gearman like this offers some interesting benefits:

  • Adding additional “map” capacity is relatively straightforward since you can just add additional machines that connect to the Gearman server.
  • Developing and testing the “map” and “reduce” functionality is easy since nothing is shared and you can run the code directly, independently of Gearman.
  • In our experience so far, the Gearman server can handle a high volume of jobs/minute – we’ve pushed ~300/sec without a problem.
  • Since Gearman clients exist for dozens of languages, you could write different pieces of the system in whatever language fits best.

Overview

OK, so how does all of this actually work. For the purposes of a demonstration, lets assume you’ve been tasked with scraping the META keywords and descriptions from a few hundred thousand sites and counting up word frequencies across all the sites. Assuming you were doing this in straight PHP, you’d end up with code that looks something like this.

The problem is that since you’re making the requests sequentially, scraping a significant number of URLs is going to take an intractable amount of time. What we really want to do is fetch the URLs in parallel, extract the META keywords, and then combine all that data in a single data structure.

To keep the amount of code down, I used the Symfony2 Console component, Guzzle and Monolog to provide infastructure around the project. Walking through the files of interest:

  • GearmanCommand.php: Command to execute either the “node” or the “master” Gearman workers.
  • StartScrapeCommand.php: Command to create the Gearman jobs to start the scrapers
  • Master.php: The code to gather up all the extracted keywords and maintain a running count.
  • Node.php: Worker code to extract the meta keywords from a given URL

Setup

Taking this for a spin is straightforward enough. Fire up an Ubuntu EC2 and then run the following:

OK, now that everything is setup lets run the normal PHP implementation.

Looks like about 10-12 seconds to process 100 URLs. Not terrible but assuming linear growth that means processing 100,000 URLs would take almost 2.5 hours which is a bit painful. You can verify it worked by looking at the “bin/nogearman_keyword_results.json” file.

Now, lets look at the Gearman version. Running the Gearman version is straightforward, just run the following:

You’ll eventually get an output from the “master” when it finishes with the total elapsed time. It’ll probably come in somewhere around 15ish seconds again because we’re still just using a single process to fetch the URLs.

Party in parallel

But now here’s where things get interesting, we can start adding multiple “worker” processes to do some of the computation in parallel. In my experience, the easiest way to handle this is using Supervisor since it makes starting and stopping groups of processes easy and also handles collecting their output. Run the following to copy the config file, restart supervisor, and verify the workers are running:

And now, you’ll want to run “application.php setfive:gearman master” in one terminal and in another run “php setfive:start-scraper 100sites.txt” to kick off the jobs.

Boom! Much faster. We’re still only doing 100 URLs so the effect of processing in parallel isn’t that dramatic. Again, you can check out the results by looking at “bin/keyword_results.json”.

The effects of using multiple workers will be more apparent when you’ve got a larger number of URLs to scrape. Inside the “bin” directory there’s a file named “quantcast_site_lists.tar.gz” which has site lists of different sizes up to the full 1 million from Quantcast.

I ran a some tests on the lists using different numbers of workers and the results are below.

0 Workers 10 Workers 25 Workers
100 URLs 12 sec. 12 sec. 5 sec.
1000 URLs 170 sec. 34 sec. 33 sec.
5000 URLs 1174 sec. 195 sec. 183 sec.
10000 URLs 2743 sec. 445 sec. 424 sec.

One thing to note, is if you run:

And notice that “processUrl” has zero jobs but there’s a lot waiting for “countKeywords”, you’re actually saturating the “reducer” and adding additional worker nodes in Supervisor isn’t going to increase your speed. Testing on a m3.small, I was seeing this happen with 25 workers.

Another powerful feature of Gearman is that it makes running jobs on remote hosts really easy. To add a “remote” to the job server, you’d just need to start a second machine, update the IP address in Base.php, and user the same Supervisor config to start a group of workers. They’d automatically register to your Gearman server and start processing jobs.

Anyway, as always questions and comments appreciated and all the code is on GitHub.

Three projects we’d love to build!

I was catching up with a friend of mine recently and she was asking what I “wanted” to build. As we started talking about it, I realized I didn’t really have a “go to” list of what I’d love to try building. At this point, web apps have become a bit boring, I mean how many times can you really write:

Inspired by Y Combinator’s Startup Ideas We’d Like to Fund and more recently Spotify’s Design Lead on Why Side Projects Should Be Stupid here’s a list of projects we’d love to build.

Bitcoin Hardware Integration

Inspired by Bitcoin: How would you build a parlor game?, I think it would be awesome to build some sort of hardware Bitcoin integration. Any interesting angle would be to take something familiar like a casino game, jukebox, or vending machine and then make it “Bitcoin powered”. Also, with the MIT Bitcoin airdrop in the Fall it seems like Central Square and Kendall Square are the perfect places to roll something like this out.

Real Time Stream Processing

From NodeJS to Websockets, “real time” seems to be all the rage but most applications aren’t really dealing with processing streaming data at scale. Processing data streams, even at moderate scale, seems like it would be a fun challenge and would open up the development of interesting solutions. In the last few years, projects like Apache Storm and Akka have significantly lowered the barriers to development so it seems like the perfect time to jump in. We’d love to leverage these tools to analyze click streams, sensor data, or financial “tick” data.

Badass Visualizations

As the price of storage has decreased, organizations are recording and retaining more data than ever before. Unfortunately, most companies are hesitant to experiment with different visualization options and end up with a handful of charts and tables. I’d love to have the opportunity to really take d3js out for a spin and build some awesome visualizations. From visualizing multi-modal data to helping uncover patterns, I’d love to help organizations get the most out of their data.

Anyway, if anyone wants to build any of these definitely get in touch! Would also love to hear everyone else’s ideas in the comments.

Four takeaways from Mary Meeker’s 2014 Internet trends

As has become tradition, Mary Meeker of Kleiner Perkins published her anual “Internet trends” report a couple of days ago. As always, its a great read and you should check it out here. Digging through the slides, a couple of things jumped out to me personally:

Tablets are far from dead

Over the past few months, there’s been a slew of blog posts bemoaning the death of the tablet. Arguments basically ranged from the point that tablet hardware is too immature to the fact that most people don’t have a strong table usecase. Looking at the report though, the # of shipped tablets is clearly exploding and is trending upwards. That said, the report doesn’t break out what kinds of tablets are being shipped – we’ll have to check in with Ben Evans for that.

Print media spend is heavily over indexed

If I worked in print media, slide 15 would be giving me some serious heartburn. Compared to other channels, print is receiving a disproportionate percentage of ad spend compared to how much time consumer spend consuming print media. Obvious outcome here is that advertising dollars are going to continue to shift digital, with mobile picking up the biggest chunk. The buying spree around mobile ad networks is already signalling this shift with the clear example being Twitter buying MoPub.

Music is changing

There was a flurry of attention and armchair quarterbacking around Apple’s acquisition of Beats and slide 50 highlights how music is changing. With consumers favoring streaming over purchasing, the market opportunity for “winning” music is certainly shrinking. Couple that with streaming services being available on multiple devices and the Apple/Beats deal looks even stranger. In any case, the future will definitely be streamed.

The battle for the living room is heating up

Looks like the hotly anticipated “battle for the living room” is heating up. With the proliferation of set top boxes like the Roku and AmazonTV along with release of next gen game consoles, consumers are changing how they consume traditional TV. Coupled with “apps” like HBO Go, there are fewer reasons to avoid “cutting the cord” and abandoning traditional cable TV service. Obviously the content owners still hold the keys but as Netflix proved with House of Cards, making awesome TV as an outsider isn’t impossible.

Anyway, as always would love any thoughts or comments!