Ramblings on code, startups, and everything in between
Over the last few weeks we’ve been utilizing Gearman to help us do some realtime stream processing. In production, what we’ve basically been doing is reading messages off an Amazon Kinesis stream, creating jobs in Gearman for anything that’s computationally expensive, and then gathering up the processed data for a batched insert into Amazon Redshift on a Gearman job as well. Conceptually, this workflow is reasonably similar to how MapReduce works where a series of input jobs is transformed by “mappers” and then results are collected in a “reduce” step.
From a practical point of view, using Gearman like this offers some interesting benefits:
OK, so how does all of this actually work. For the purposes of a demonstration, lets assume you’ve been tasked with scraping the META keywords and descriptions from a few hundred thousand sites and counting up word frequencies across all the sites. Assuming you were doing this in straight PHP, you’d end up with code that looks something like this.
The problem is that since you’re making the requests sequentially, scraping a significant number of URLs is going to take an intractable amount of time. What we really want to do is fetch the URLs in parallel, extract the META keywords, and then combine all that data in a single data structure.
Taking this for a spin is straightforward enough. Fire up an Ubuntu EC2 and then run the following:
OK, now that everything is setup lets run the normal PHP implementation.
Looks like about 10-12 seconds to process 100 URLs. Not terrible but assuming linear growth that means processing 100,000 URLs would take almost 2.5 hours which is a bit painful. You can verify it worked by looking at the “bin/nogearman_keyword_results.json” file.
Now, lets look at the Gearman version. Running the Gearman version is straightforward, just run the following:
You’ll eventually get an output from the “master” when it finishes with the total elapsed time. It’ll probably come in somewhere around 15ish seconds again because we’re still just using a single process to fetch the URLs.
But now here’s where things get interesting, we can start adding multiple “worker” processes to do some of the computation in parallel. In my experience, the easiest way to handle this is using Supervisor since it makes starting and stopping groups of processes easy and also handles collecting their output. Run the following to copy the config file, restart supervisor, and verify the workers are running:
And now, you’ll want to run “application.php setfive:gearman master” in one terminal and in another run “php setfive:start-scraper 100sites.txt” to kick off the jobs.
Boom! Much faster. We’re still only doing 100 URLs so the effect of processing in parallel isn’t that dramatic. Again, you can check out the results by looking at “bin/keyword_results.json”.
The effects of using multiple workers will be more apparent when you’ve got a larger number of URLs to scrape. Inside the “bin” directory there’s a file named “quantcast_site_lists.tar.gz” which has site lists of different sizes up to the full 1 million from Quantcast.
I ran a some tests on the lists using different numbers of workers and the results are below.
|0 Workers||10 Workers||25 Workers|
|100 URLs||12 sec.||12 sec.||5 sec.|
|1000 URLs||170 sec.||34 sec.||33 sec.|
|5000 URLs||1174 sec.||195 sec.||183 sec.|
|10000 URLs||2743 sec.||445 sec.||424 sec.|
One thing to note, is if you run:
And notice that “processUrl” has zero jobs but there’s a lot waiting for “countKeywords”, you’re actually saturating the “reducer” and adding additional worker nodes in Supervisor isn’t going to increase your speed. Testing on a m3.small, I was seeing this happen with 25 workers.
Another powerful feature of Gearman is that it makes running jobs on remote hosts really easy. To add a “remote” to the job server, you’d just need to start a second machine, update the IP address in Base.php, and user the same Supervisor config to start a group of workers. They’d automatically register to your Gearman server and start processing jobs.
Anyway, as always questions and comments appreciated and all the code is on GitHub.
Posted In: General
Over the last few weeks we’ve been working with one of our clients to build out a real time data processing application. At a high level, the system ingests page view data, processes it in real time, and then ingests it into a database backend. In terms of scale, the system would need to start off processing roughly 30,000 events per minute at peak with the capability to scale out to 100,000 events per minute fairly easily. In addition, we wanted the data to become available to query “reasonably quickly” so that we could iterate quickly on how we were processing data.
To kick things off, we began by surveying the available tools to ingest, process, and then ultimately query the data. On the datawarehouse side, we had already had some positive experiences with Amazon Redshift so it was a natural choice to keep using it moving forward. In terms of ingestion and processing, we decided to move forward with Kinesis and Gearman. The fully managed nature of Kinesis made it the most appealing choice and Gearman’s strong PHP support would let us develop workers in a language everyone was comfortable with.
Our final implementation is fairly straightforward. An Elastic Load Balancer handles all incoming HTTP requests which are routed to any number of front end machines. These servers don’t do any computation and fire of messages into a Kinesis stream. On the backend, we have a consumer per Kinesis stream shard that creates Gearman jobs for pre-processing as well as Redshift data ingestion. Although it’s conceptually simple, there’s a couple of “gotchas” that we ran into implementing this system:
Slow HTTP requests are a killer: The Kinesis API works entirely over HTTP so anytime you want to “put” something into the stream it’ll require a HTTP request. The problem with this is that if you’re making these requests in real time in a high traffic environment you run the risk of locking up your php-fpm workers if the network latency to Kinesis starts to increase. We saw this happen first hand, everything would be fine and then all of a sudden the latency across the ELB would skyrocket when the latency to Kinesis increased. To avoid this, you need to make the Kinesis request in the background.
SSL certificate verification is REALLY slow: Kinesis is only available over HTTPs so by default the PHP SDK (I assume others as well) will perform an SSL key verification every time you use a new client. If you’re making Kinesis requests inside your php-fpm workers that means you’ll be verifying SSL keys on every request which turns out to be really slow. You can disable this in the official SDK using the “curl.options” parameter and passing in “CURLOPT_SSL_VERIFYHOST” and “CURLOPT_SSL_VERIFYPEER”
There’s no “batch” add operation: Interestingly Apache Kafka, which Kinesis is based on, supports batch operations but unfortunately Kinesis doesn’t. You have to make an HTTP request for every message you’re adding to the stream. What this means is that even if you’re queuing the messages in the background, you’ll still need to loop through them all firing off HTTP requests
Your consumer needs to be fast: In your consumer, you’ll basically end up with code that looks like – https://gist.github.com/adatta02/842531b3fe93097ee030 Because Kinesis shard iterators are only valid for 5 minutes, you’ll need to be cognizant of how long the inner for loop takes to run. Each “getRecords” call can return a max of 10,000 records so you’ll need to be able to process 10k records in less than 5 minutes. Our solution for this was to offload all the actual processing to Gearman jobs.
Anyway, we’re still fairly new to using Kinesis so I’m sure we’ll learn more about using it as the system is in production. A few things have already been positive including that it makes testing new code locally easy since you can just “tap” into the stream, scaling up looks like it means just adding additional shards, and since its managed we’ve got one less thing to worry about.
As always, questions and comments welcome!
Unfortunately, setting something like this up with the default “pattern” setting in your security.yml file isn’t possible. The “pattern” setting only matches on the route URL, not the parameters so there’s no way to have it selectively trigger when a parameter is present on a URL. So how do you do it? Well as it turns out, there’s a firewall configuration called “reuqest_matcher” which lets you “match” a firewall using a service. Just create a service that extends the RequestMatcherInterface, implment a “matches” function, and then add the class as a service.
Our code for the service ended up looking like:
And then the actual firewall configuration ends up being:
You don’t need a “pattern” setting anymore since the “matches” function supersedes it. Anyway, let me know if you have any questions!
At the beginning of the summer we decided to redo our website. The design on the old site was looking a bit dated and more importantly the content didn’t really reflect the types of projects we’re looking to work on. From a technology perspective, our old site was built on WordPress with the explicit goal of being able to share the same WordPress theme as our blog. The two sites did in fact share the same theme but looking back, we never updated the main site to really make it “worth it”. With that experience in mind, we started looking around for what we could use to build setfive.com.
Stepping back and looking at our requirements, we really don’t need a CMS. I’d argue this holds true for most website projects when there’s less than 20 pages, everyone who might edit it is technical, and the content isn’t updated frequently. Specifically looking at some major WordPress features, we don’t need the WYSIWYG editor, plugin ecosystem, media handling, or theming capabilities. So what capabilities do we need?
There’s certainly more capabilities static websites could need but I think this is a decent list for the “general” case and it captures our requirements. After doing some research, it looks like there’s currently a few options that would satisfy these requirements:
I ultimately chose Silex because our team has deep PHP experience, especially with Symfony2. Because of that we’d be right at home with the Routing component and of course Twig for templating.
OK so how do you actually get this to work? I ran across Jonathan Petitcolas’s Building a static website with Silex post and used it as a guide. Here are the actual commands you’d need to get this all setup though:
Now, you just need to create a file named “index.php” which contains:
And finally, in the “views” directory add a file called “index.html.twig” which contains some content. If you have a web server setup, just point a vhost at the “web” directory, load it, and you should see the content of your index file.
If you don’t have a web server setup, a nifty trick via Gonzalo Ayuso, create a in the “web” directory named “router.php” containing:
And now, you can start the built in PHP 5.4+ server by running:
You can load your Silex app by loading http://localhost:8888 in your browser.
Anyway, as always questions and comments are welcome!
Posted In: PHP
As far as type systems go, PHPs is pretty schizophrenic. You’ve got primitive types, like strings and booleans, the ubiquitous “array” type, and then user defined classes. Most of the time, the type system is invisible since it barely enforces anything. Especially for basic types and the standard library, you can almost always use strings, booleans, and integers interchangeably without much complaining from the interpreter. Where things go sideways is when you start using user defined types, especially with type hinting.
Imagine we’ve got the following setup:
If you run that in a terminal, PHP will throw the following error:
PHP Catchable fatal error: Argument 1 passed to sayHello() must be an instance of Dog, instance of Pet given, called in /home/ashish/Downloads/dog.php on line 19 and defined in /home/ashish/Downloads/dog.php on line 14
Because even though every “Dog” is by definition a superset of the “Pet” class, PHP doesn’t see it that way. And now, our original problem. In most other object oriented languages, you’d be able to simply typecast the instance of Pet to a Dog and then call the function as expected. Unfortunately, PHP doesn’t natively support typecasting so we’re stuck looking for a crazy workaround. Enter Reflection. PHPs reflection library lets you do all sorts of nefarious things, like manipulating private properties and retrieving the source for an arbitrary object.
So how do you use it to do a bootleg typecast? It’s actually pretty straightforward:
The “copyShimmedObject” is the money maker. It basically pulls the private properties out of the “from”, makes the property public, and then sets them on the “to” object. If you run the sample you’ll get the expected output instead of the error above:
Hello: Fluffy of destroyer of worlds