PHP: Using Gearman for a MapReduce inspired workflow

Over the last few weeks we’ve been utilizing Gearman to help us do some realtime stream processing. In production, what we’ve basically been doing is reading messages off an Amazon Kinesis stream, creating jobs in Gearman for anything that’s computationally expensive, and then gathering up the processed data for a batched insert into Amazon Redshift on a Gearman job as well. Conceptually, this workflow is reasonably similar to how MapReduce works where a series of input jobs is transformed by “mappers” and then results are collected in a “reduce” step.

From a practical point of view, using Gearman like this offers some interesting benefits:

  • Adding additional “map” capacity is relatively straightforward since you can just add additional machines that connect to the Gearman server.
  • Developing and testing the “map” and “reduce” functionality is easy since nothing is shared and you can run the code directly, independently of Gearman.
  • In our experience so far, the Gearman server can handle a high volume of jobs/minute – we’ve pushed ~300/sec without a problem.
  • Since Gearman clients exist for dozens of languages, you could write different pieces of the system in whatever language fits best.

Overview

OK, so how does all of this actually work. For the purposes of a demonstration, lets assume you’ve been tasked with scraping the META keywords and descriptions from a few hundred thousand sites and counting up word frequencies across all the sites. Assuming you were doing this in straight PHP, you’d end up with code that looks something like this.

The problem is that since you’re making the requests sequentially, scraping a significant number of URLs is going to take an intractable amount of time. What we really want to do is fetch the URLs in parallel, extract the META keywords, and then combine all that data in a single data structure.

To keep the amount of code down, I used the Symfony2 Console component, Guzzle and Monolog to provide infastructure around the project. Walking through the files of interest:

  • GearmanCommand.php: Command to execute either the “node” or the “master” Gearman workers.
  • StartScrapeCommand.php: Command to create the Gearman jobs to start the scrapers
  • Master.php: The code to gather up all the extracted keywords and maintain a running count.
  • Node.php: Worker code to extract the meta keywords from a given URL

Setup

Taking this for a spin is straightforward enough. Fire up an Ubuntu EC2 and then run the following:

OK, now that everything is setup lets run the normal PHP implementation.

Looks like about 10-12 seconds to process 100 URLs. Not terrible but assuming linear growth that means processing 100,000 URLs would take almost 2.5 hours which is a bit painful. You can verify it worked by looking at the “bin/nogearman_keyword_results.json” file.

Now, lets look at the Gearman version. Running the Gearman version is straightforward, just run the following:

You’ll eventually get an output from the “master” when it finishes with the total elapsed time. It’ll probably come in somewhere around 15ish seconds again because we’re still just using a single process to fetch the URLs.

Party in parallel

But now here’s where things get interesting, we can start adding multiple “worker” processes to do some of the computation in parallel. In my experience, the easiest way to handle this is using Supervisor since it makes starting and stopping groups of processes easy and also handles collecting their output. Run the following to copy the config file, restart supervisor, and verify the workers are running:

And now, you’ll want to run “application.php setfive:gearman master” in one terminal and in another run “php setfive:start-scraper 100sites.txt” to kick off the jobs.

Boom! Much faster. We’re still only doing 100 URLs so the effect of processing in parallel isn’t that dramatic. Again, you can check out the results by looking at “bin/keyword_results.json”.

The effects of using multiple workers will be more apparent when you’ve got a larger number of URLs to scrape. Inside the “bin” directory there’s a file named “quantcast_site_lists.tar.gz” which has site lists of different sizes up to the full 1 million from Quantcast.

I ran a some tests on the lists using different numbers of workers and the results are below.

0 Workers 10 Workers 25 Workers
100 URLs 12 sec. 12 sec. 5 sec.
1000 URLs 170 sec. 34 sec. 33 sec.
5000 URLs 1174 sec. 195 sec. 183 sec.
10000 URLs 2743 sec. 445 sec. 424 sec.

One thing to note, is if you run:

And notice that “processUrl” has zero jobs but there’s a lot waiting for “countKeywords”, you’re actually saturating the “reducer” and adding additional worker nodes in Supervisor isn’t going to increase your speed. Testing on a m3.small, I was seeing this happen with 25 workers.

Another powerful feature of Gearman is that it makes running jobs on remote hosts really easy. To add a “remote” to the job server, you’d just need to start a second machine, update the IP address in Base.php, and user the same Supervisor config to start a group of workers. They’d automatically register to your Gearman server and start processing jobs.

Anyway, as always questions and comments appreciated and all the code is on GitHub.

Three projects we’d love to build!

I was catching up with a friend of mine recently and she was asking what I “wanted” to build. As we started talking about it, I realized I didn’t really have a “go to” list of what I’d love to try building. At this point, web apps have become a bit boring, I mean how many times can you really write:

Inspired by Y Combinator’s Startup Ideas We’d Like to Fund and more recently Spotify’s Design Lead on Why Side Projects Should Be Stupid here’s a list of projects we’d love to build.

Bitcoin Hardware Integration

Inspired by Bitcoin: How would you build a parlor game?, I think it would be awesome to build some sort of hardware Bitcoin integration. Any interesting angle would be to take something familiar like a casino game, jukebox, or vending machine and then make it “Bitcoin powered”. Also, with the MIT Bitcoin airdrop in the Fall it seems like Central Square and Kendall Square are the perfect places to roll something like this out.

Real Time Stream Processing

From NodeJS to Websockets, “real time” seems to be all the rage but most applications aren’t really dealing with processing streaming data at scale. Processing data streams, even at moderate scale, seems like it would be a fun challenge and would open up the development of interesting solutions. In the last few years, projects like Apache Storm and Akka have significantly lowered the barriers to development so it seems like the perfect time to jump in. We’d love to leverage these tools to analyze click streams, sensor data, or financial “tick” data.

Badass Visualizations

As the price of storage has decreased, organizations are recording and retaining more data than ever before. Unfortunately, most companies are hesitant to experiment with different visualization options and end up with a handful of charts and tables. I’d love to have the opportunity to really take d3js out for a spin and build some awesome visualizations. From visualizing multi-modal data to helping uncover patterns, I’d love to help organizations get the most out of their data.

Anyway, if anyone wants to build any of these definitely get in touch! Would also love to hear everyone else’s ideas in the comments.

Four takeaways from Mary Meeker’s 2014 Internet trends

As has become tradition, Mary Meeker of Kleiner Perkins published her anual “Internet trends” report a couple of days ago. As always, its a great read and you should check it out here. Digging through the slides, a couple of things jumped out to me personally:

Tablets are far from dead

Over the past few months, there’s been a slew of blog posts bemoaning the death of the tablet. Arguments basically ranged from the point that tablet hardware is too immature to the fact that most people don’t have a strong table usecase. Looking at the report though, the # of shipped tablets is clearly exploding and is trending upwards. That said, the report doesn’t break out what kinds of tablets are being shipped – we’ll have to check in with Ben Evans for that.

Print media spend is heavily over indexed

If I worked in print media, slide 15 would be giving me some serious heartburn. Compared to other channels, print is receiving a disproportionate percentage of ad spend compared to how much time consumer spend consuming print media. Obvious outcome here is that advertising dollars are going to continue to shift digital, with mobile picking up the biggest chunk. The buying spree around mobile ad networks is already signalling this shift with the clear example being Twitter buying MoPub.

Music is changing

There was a flurry of attention and armchair quarterbacking around Apple’s acquisition of Beats and slide 50 highlights how music is changing. With consumers favoring streaming over purchasing, the market opportunity for “winning” music is certainly shrinking. Couple that with streaming services being available on multiple devices and the Apple/Beats deal looks even stranger. In any case, the future will definitely be streamed.

The battle for the living room is heating up

Looks like the hotly anticipated “battle for the living room” is heating up. With the proliferation of set top boxes like the Roku and AmazonTV along with release of next gen game consoles, consumers are changing how they consume traditional TV. Coupled with “apps” like HBO Go, there are fewer reasons to avoid “cutting the cord” and abandoning traditional cable TV service. Obviously the content owners still hold the keys but as Netflix proved with House of Cards, making awesome TV as an outsider isn’t impossible.

Anyway, as always would love any thoughts or comments!

Bitcoin: How would you build a “parlor game”?

Last weekend, the boys and I were at a bar and we started playing on one of those “parlor game” machines. If you’re unfamiliar, they’re basically a computer with a touch screen in a glorified case. They’re loaded up with dozens of games like “spot the difference”, traditional casino games, and of course X-rated versions of everything. As we were burning cash playing, we started thinking out loud that the game would be “cooler” if you could actually win money back. Games like this do exist in Vegas but they’re consequently heavily regulated by gambling related laws. As we were discussing this, someone mentioned building the game using Bitcoin in hopes of skirting gambling regulations. So, how could you go about building a real world game using Bitcoin for payments?

Hardware

The natural choice here would be to run with a WiFi enabled RaspberryPi along with a capacitive touchscreen. I’ve never done any work with touchscreens but it looks like TigerDirect has 22″ Planar ones for under $300. Couple that with a WiFi RaspberryPi kit for $50 and we should be in business for <$500. You could obviously grab a smaller screen but go big or go home right?

Authentication

Now things get interesting. Philosophically, I think you do want to enforce authentication so that users can save scores, carry a BTC balance, and generally make the experience “stickier”. The straightforward approach would be to offer a traditional sign up using an email address and password but that would totally sacrifice leveraging social media. Allowing users to easily “sign in” with Twitter or Facebook would allow you leverage their social graphs to find their friends and drive awareness of the game. An issue I’ve run into in the past is that “kiosks” always ask you to enter you Facebook or Twitter credentials directly on the kiosk – which is a non-starter for most people.

So how can we avoid this? The big idea is that you really just need the user’s OAuth tokens for whatever service they want to use – not their credentials. To do this, you’d need to either distribute an app or have a website where the user could “sign in” while already logged in on Facebook or Twitter on their mobile phone. Then, you’d be able to capture their OAuth tokens and log them in to the kiosk.

Getting the Bitcoin

Ok, so the user is logged in, now we need to get them some Bitcoin to play with. I think what you’d want to do is generate a wallet for each authenticated user and let them transfer in to an address as need be. So since their account is unfunded, they’d be presented with a QR code for an address that they can use to fund it. Another interesting idea would be to let users buy BTC in cash from the bartender which would in turn automatically fund their account. To support this, you’d need an app or website where the bartender could transfer BTC into the user’s wallet after receiving cash on site.

In terms of software to facilitate this, it seems like bitcoinj would be the best option. Written in Java, bitcoinj is a “client node only” implementation of the Bitcoin protocol that allows it to run without a full local copy of the Blockchain. Because of this, bitcoinj will run in resource constrained environments like the RaspberryPi. In addition, bitcoinj supports “simplified payment verification” so you’d be able to clear transactions instantly without waiting for confirmations to settle.

The Games

Graphically and conceptually, the games are all pretty straightforward so your best bet is probably implementing with JavaScript with HTML and CSS. The benefits would be you’re development cycles would be short and you’d be able to develop the games independently of the device itself. With HTML5 hitting it’s stride, you’d also be able to enable “multiplayer” versions using WebSockets.

One interesting consideration is that without being able to definitely vouch for the physical security of the kiosk how can you verify that the running code is authentic?

Sending the Bitcoin

The final step in the dance (and what makes it special) is allowing the user to cash BTC out of their account. I think you’d want to offer the option of allowing the use to generate a “send to” address if they have a mobile wallet and also allow them to cash out in person. If they can generate a “send to” address, they’d have to use the kiosk to scan a QR code where their BTC balance would get sent. To facilitate this, you’d need a webcam attached to the RaspberryPi and then use something like ZXing to decode the data in the QR code.

For the “in-person” option, you’d need to build out functionality to allow the bartender to initiate a transfer from the user’s wallet back to the bar from the web or mobile app. After it settles, they’d simply hand the user back some cash and the account would be settled.

Is this legal?

Therein lays the rub. It probably is but I’m not 100% sure. Eliminating the option to exchange BTC for cash might help the case for legality. Along with that, if you can side step “making odds” as well it might help your case. Would love any thoughts on this!

Musing: Thoughts on Seattle vs. Uber and an AirBnB orgy

In the last two weeks, there’s been two frontpage stories coming out of the “sharing economy” space. First up, news broke that Seattle passed new regulation to limit the number of drivers that Uber or Lyft can have on the road at any time, effectively hamstringing both services. Then, over the weekend a story started circulating about an Airbnb stay gone awry involving an orgy, Twitter, and of course the cops. While the stories are wildly different, they both sit at the intersection of the new “sharing economy” and the role of government regulation. Because of this, both stories sparked an intense debate everywhere from The Wall Street Journal to Hacker News. The attitudes and viewpoints of the discourse were interesting and revealing about people’s attitudes towards regulation in the taxi and hospitality space.

The opinion towards the new regulations in Seattle were overwhelmingly negative, ranging from claims of stifling innovation to accusations of outright corruption. Empirically, it seems that most people, even outside of early adopters, have generally positive feelings about Uber and Lyft. It might be because of the horrible experiences people have had in normal cabs or because of the perception of hackney companies as entrenched monopolies but I think it’s primarily due to people’s perception of risk surrounding Uber or Lyft. As a non-user, the perception of how “risky” it is to have Uber drivers operating in your city is probably near zero. Given how small a percentage of total drivers they’ll make up and that “licensed cab drivers” typically already operate in the area, I don’t think many people feel threatened by having additional drivers potentially ferrying people around their city. Because of this, it’s been easier for people to take hold of the “sharing economy” narrative where Lyft or Uber who were empowered to make a living are suddenly shut out by corrupt, entrenched interests.

Contrast this with the reactions from the same people to the Airbnb story, which ranged from disgust that someone would rent out their condo to strangers through feelings that Airbnb should be held liable for the actions of an independent third party. There’s no argument that the Airbnb story is significantly more disturbing, but it isn’t the first time something like this has happened and it certainly won’t be the last. So why such a different, visceral reaction? It’s perceived risk. As a non-user, the perceived risk to having Airbnb operate in your area is undeniably significant. Take a typical condo apartment building where every occupant is either a member of a condo association or a vetted rental tenant. Introducing the possibility of short term, “random occupants” certainly sounds risky and unnerving to anyone living in the building. Anyone considering the situation immediately evaluates worst case scenarios, “what if they’re criminals?” or “drug dealers?” and so on. Compared to driving, where interactions with strangers is a given, introducing “random interactions” into a scenario where it’s unexpected seems to push people towards favoring legislation.

As a whole, the emergence of the new “sharing economy” is probably a net positive. Despite that, companies are certainly thumbing their nose at government regulation and entrenched players which is going to cause ruffled feathers along the way. To win the hearts and minds, companies will definitely need to manage the perception of how risky their services are both to users and non-users alike.