Friday Links: Fitness^3

It’s been a long week but you’ve made it, it’s Friday! Nothing goes better with Fridays than a couple of fresh links for your ride home and of course a cold beer. We can’t help you with that beer but we’ve got you covered on those links. A slew of new wearable health products were released this week and here they are:

Planning on picking up a fitness tracker? Let us know in the comments!

Symfony2: Outputting form checkboxes in a hierarchy

Recently when I was working on a client project we had a bunch of permissions which had a hierarchy (or tree structure). For example, you needed Permission 1 to have Permission 1a and Permission 1b. In the examples below lets assume `$choices` is equal to the following:

At first, I used the built in in optgroups of a the select box to output the form, so it was clear what permissions fell where. My form would look similar to:

Multiple select boxes aren’t the easiest to work with as we all know. Also, it isn’t as easy to visually see the difference as the height of the select box could not be long enough to show you what an optgroup’s title is. Instead, I decided to use the checkbox approach. Issue with this, the current Symfony2 form themes don’t output checkboxes in groups or with any visual indication of the hierarchy. I ended up creating my own custom field type so I could customize the way it renders globally via the form themeing. My custom type just always set the choice options to expanded and multiple as true. For the actual rendering, below is what I ended up with.

The above is assuming you are using bootstrap to render your forms as it has those classes. My listless class just sets the ul list style to none. The code should be fairly easy to follow, basically it goes through and any sub-array (an optgroup) it will nest in the list from the previous option. This method does assume that you have the ‘parent’ node before the nested array. I also in the bottom have some javascript that basically makes sure that you can’t check off a sub-group if the parent is not checked. When you first check the parent, it selects all the children. For the example I just put the javascript in there, it uses and id attribute, so you can only have one of these per page. If you were using this globally, I’d recommend tagging the UL with a data attribute and moving the javascript into a global JS file.

Since a picture is worth a thousand words, here is an example of what it looks like working:

permissions-example

Let me know if you have any questions! Happy Friday.

PHP: Using Gearman for a MapReduce inspired workflow

Over the last few weeks we’ve been utilizing Gearman to help us do some realtime stream processing. In production, what we’ve basically been doing is reading messages off an Amazon Kinesis stream, creating jobs in Gearman for anything that’s computationally expensive, and then gathering up the processed data for a batched insert into Amazon Redshift on a Gearman job as well. Conceptually, this workflow is reasonably similar to how MapReduce works where a series of input jobs is transformed by “mappers” and then results are collected in a “reduce” step.

From a practical point of view, using Gearman like this offers some interesting benefits:

  • Adding additional “map” capacity is relatively straightforward since you can just add additional machines that connect to the Gearman server.
  • Developing and testing the “map” and “reduce” functionality is easy since nothing is shared and you can run the code directly, independently of Gearman.
  • In our experience so far, the Gearman server can handle a high volume of jobs/minute – we’ve pushed ~300/sec without a problem.
  • Since Gearman clients exist for dozens of languages, you could write different pieces of the system in whatever language fits best.

Overview

OK, so how does all of this actually work. For the purposes of a demonstration, lets assume you’ve been tasked with scraping the META keywords and descriptions from a few hundred thousand sites and counting up word frequencies across all the sites. Assuming you were doing this in straight PHP, you’d end up with code that looks something like this.

The problem is that since you’re making the requests sequentially, scraping a significant number of URLs is going to take an intractable amount of time. What we really want to do is fetch the URLs in parallel, extract the META keywords, and then combine all that data in a single data structure.

To keep the amount of code down, I used the Symfony2 Console component, Guzzle and Monolog to provide infastructure around the project. Walking through the files of interest:

  • GearmanCommand.php: Command to execute either the “node” or the “master” Gearman workers.
  • StartScrapeCommand.php: Command to create the Gearman jobs to start the scrapers
  • Master.php: The code to gather up all the extracted keywords and maintain a running count.
  • Node.php: Worker code to extract the meta keywords from a given URL

Setup

Taking this for a spin is straightforward enough. Fire up an Ubuntu EC2 and then run the following:

OK, now that everything is setup lets run the normal PHP implementation.

Looks like about 10-12 seconds to process 100 URLs. Not terrible but assuming linear growth that means processing 100,000 URLs would take almost 2.5 hours which is a bit painful. You can verify it worked by looking at the “bin/nogearman_keyword_results.json” file.

Now, lets look at the Gearman version. Running the Gearman version is straightforward, just run the following:

You’ll eventually get an output from the “master” when it finishes with the total elapsed time. It’ll probably come in somewhere around 15ish seconds again because we’re still just using a single process to fetch the URLs.

Party in parallel

But now here’s where things get interesting, we can start adding multiple “worker” processes to do some of the computation in parallel. In my experience, the easiest way to handle this is using Supervisor since it makes starting and stopping groups of processes easy and also handles collecting their output. Run the following to copy the config file, restart supervisor, and verify the workers are running:

And now, you’ll want to run “application.php setfive:gearman master” in one terminal and in another run “php setfive:start-scraper 100sites.txt” to kick off the jobs.

Boom! Much faster. We’re still only doing 100 URLs so the effect of processing in parallel isn’t that dramatic. Again, you can check out the results by looking at “bin/keyword_results.json”.

The effects of using multiple workers will be more apparent when you’ve got a larger number of URLs to scrape. Inside the “bin” directory there’s a file named “quantcast_site_lists.tar.gz” which has site lists of different sizes up to the full 1 million from Quantcast.

I ran a some tests on the lists using different numbers of workers and the results are below.

0 Workers 10 Workers 25 Workers
100 URLs 12 sec. 12 sec. 5 sec.
1000 URLs 170 sec. 34 sec. 33 sec.
5000 URLs 1174 sec. 195 sec. 183 sec.
10000 URLs 2743 sec. 445 sec. 424 sec.

One thing to note, is if you run:

And notice that “processUrl” has zero jobs but there’s a lot waiting for “countKeywords”, you’re actually saturating the “reducer” and adding additional worker nodes in Supervisor isn’t going to increase your speed. Testing on a m3.small, I was seeing this happen with 25 workers.

Another powerful feature of Gearman is that it makes running jobs on remote hosts really easy. To add a “remote” to the job server, you’d just need to start a second machine, update the IP address in Base.php, and user the same Supervisor config to start a group of workers. They’d automatically register to your Gearman server and start processing jobs.

Anyway, as always questions and comments appreciated and all the code is on GitHub.

AWS: Using Kinesis with PHP for real time stream processing

Over the last few weeks we’ve been working with one of our clients to build out a real time data processing application. At a high level, the system ingests page view data, processes it in real time, and then ingests it into a database backend. In terms of scale, the system would need to start off processing roughly 30,000 events per minute at peak with the capability to scale out to 100,000 events per minute fairly easily. In addition, we wanted the data to become available to query “reasonably quickly” so that we could iterate quickly on how we were processing data.

To kick things off, we began by surveying the available tools to ingest, process, and then ultimately query the data. On the datawarehouse side, we had already had some positive experiences with Amazon Redshift so it was a natural choice to keep using it moving forward. In terms of ingestion and processing, we decided to move forward with Kinesis and Gearman. The fully managed nature of Kinesis made it the most appealing choice and Gearman’s strong PHP support would let us develop workers in a language everyone was comfortable with.

Our final implementation is fairly straightforward. An Elastic Load Balancer handles all incoming HTTP requests which are routed to any number of front end machines. These servers don’t do any computation and fire of messages into a Kinesis stream. On the backend, we have a consumer per Kinesis stream shard that creates Gearman jobs for pre-processing as well as Redshift data ingestion. Although it’s conceptually simple, there’s a couple of “gotchas” that we ran into implementing this system:

Slow HTTP requests are a killer: The Kinesis API works entirely over HTTP so anytime you want to “put” something into the stream it’ll require a HTTP request. The problem with this is that if you’re making these requests in real time in a high traffic environment you run the risk of locking up your php-fpm workers if the network latency to Kinesis starts to increase. We saw this happen first hand, everything would be fine and then all of a sudden the latency across the ELB would skyrocket when the latency to Kinesis increased. To avoid this, you need to make the Kinesis request in the background.

SSL certificate verification is REALLY slow: Kinesis is only available over HTTPs so by default the PHP SDK (I assume others as well) will perform an SSL key verification every time you use a new client. If you’re making Kinesis requests inside your php-fpm workers that means you’ll be verifying SSL keys on every request which turns out to be really slow. You can disable this in the official SDK using the “curl.options” parameter and passing in “CURLOPT_SSL_VERIFYHOST” and “CURLOPT_SSL_VERIFYPEER”

There’s no “batch” add operation: Interestingly Apache Kafka, which Kinesis is based on, supports batch operations but unfortunately Kinesis doesn’t. You have to make an HTTP request for every message you’re adding to the stream. What this means is that even if you’re queuing the messages in the background, you’ll still need to loop through them all firing off HTTP requests

Your consumer needs to be fast: In your consumer, you’ll basically end up with code that looks like – https://gist.github.com/adatta02/842531b3fe93097ee030 Because Kinesis shard iterators are only valid for 5 minutes, you’ll need to be cognizant of how long the inner for loop takes to run. Each “getRecords” call can return a max of 10,000 records so you’ll need to be able to process 10k records in less than 5 minutes. Our solution for this was to offload all the actual processing to Gearman jobs.

Anyway, we’re still fairly new to using Kinesis so I’m sure we’ll learn more about using it as the system is in production. A few things have already been positive including that it makes testing new code locally easy since you can just “tap” into the stream, scaling up looks like it means just adding additional shards, and since its managed we’ve got one less thing to worry about.

As always, questions and comments welcome!

Symfony2: Using “request_matcher” for custom firewall rules

Last week we were looking to leverage a set of JSON API endpoints in a Symfony2 project to power a single page Javascript app. The way the API had been setup, the routes were all secured with an HTTP Basic Auth firewall matching on “/api”. This worked great for the mobile apps but for a Javascript app it would be awkward to have the user re-enter their credentials to authorize the basic auth firewall. What we really wanted to do was to leave everything “as is” but have Symfony use the normal cookie based firewall when we passed in a special “isOnlineApp” parameters on the URL.

Unfortunately, setting something like this up with the default “pattern” setting in your security.yml file isn’t possible. The “pattern” setting only matches on the route URL, not the parameters so there’s no way to have it selectively trigger when a parameter is present on a URL. So how do you do it? Well as it turns out, there’s a firewall configuration called “reuqest_matcher” which lets you “match” a firewall using a service. Just create a service that extends the RequestMatcherInterface, implment a “matches” function, and then add the class as a service.

Our code for the service ended up looking like:

And then the actual firewall configuration ends up being:

You don’t need a “pattern” setting anymore since the “matches” function supersedes it. Anyway, let me know if you have any questions!