Symfony2 and Gearman: Parallel Processing and Background Processing

On a few of our projects we have a few different needs to either queue items to be processed in the background or we need a single request to be able to process something in parallel. Generally we use Gearman and the GearmanBundle.  Let me explain a few different situations where we’ve found it handy to have Gearman around.

Background Processing

Often we’ll need to do something which takes a bit more time to process such as sending out a couple thousand push notifications to resizing several images. For this example lets use sending push notifications. You could have a person sit around as each notification is sent out and hope the page doesn’t timeout, however after a certain number of notifications, not to mention a terrible user experience, this approach will fail. Enter Gearman. With Gearman you are able to basically queue the event that a user has triggered a bunch of notifications that need to be processed and sent.

What we’ve done above is sent to the Gearman server a job to be processed in the background which means we don’t have to wait for it to finish. At this point all we’ve done is queued a job on the Gearman server, Gearman itself doesn’t know how to run the actual job. For that we create a ‘worker’ which reads jobs and processes them:

The worker will consume the job and then process it as it sees fit. In this case we just loop over each user ID and send them a notification.

Parallel Processing

One one of our applications users can associate their account with multiple databases. From there we go through each database and create different reports. On some of the application screens we let users poll each of their databases and we aggregate the data and create a real time report. The problem with doing this synchronously is that you have to go to each database one by one, meaning if you have 10 databases and each one takes 1 seconds to get the data from, you have at least ten seconds the user is waiting around; this doesn’t go well when you have 20 databases and so on. Instead, we use Gearman to farm out the task of going to each database and pull the data. From there, we have the request process total up all the aggregated data and display it. Now instead of waiting 10 seconds for each database, we farm out the work to 10 workers, wait 1 second and then can do any final processing and show it to the user. In the example below for brevity we’ve just done the totaling in a controller.

What we’ve done here is created a job for each connection. This time we add them as tasks, which means we’ll wait until they’ve completed. On the worker side it is similar to except you return some data, ie `return json_encode(array(‘total’=>50000));` at the end of the the function.

What this allows us to do is to farm out the work in parallel to all the databases. Each worker runs queries on the database, computes some local data and passes it back. From there you can add it all together (if you want) and then display it to the user. With the job running in parallel the number of databases you can process is no longer limited on your request, but more on how many workers you have running in the background. The beauty with Gearman is that the workers don’t need to live on the same machine, so you could have a cluster of machines acting as ‘workers’ and be able to process more database connections in this scenario.

Anyways, Gearman has really made parallel processing and farming out work much easier. As the workers are also written in PHP, it is very easy to reuse code between the frontend and the workers. Often, we’ll start a new report without Gearman; getting logic/fixing bugs in a single request without the worker is easier. After we’re happy with how the code works, we’ll move the code we wrote into the worker and have it just return the final result.

Good luck! Feel free to drop us a line if you need any help.

Javascript: Using PhantomJS-node with Deferreds

Earlier this week, a buddy of mine reached out asking for a good solution to programmatically taking screenshots of a few thousand URLs. For whatever reason, this question seems to come up ever so often so I pointed him towards PhantomJS and figured he’d be on his way. Wrong. Not one to pass up free beer and the opportunity to learn something I agreed to write up the script to generate screenshots from a list of URLs.

Looking at PhantomJS, it seems relatively straightforward but it’s clear you’d really need something “else” to orchestrate this entire process. After some poking around, NodeJS, everyone’s favorite hipster runtime, seemed to be the obvious choice. There’s a handful of node modules that basically “bridge” node with phantom and allow a node script to asynchronously manipulate a PhantomJS instance. Based solely on the funny description I decided to run with phantomjs-node and was off to the races.

Getting everything setup was straightforward enough but then as I started looking at the phantomjs-node examples I started realizing this was a one way trip to callback soup. I’ve been doing some PhoneGap work recently and using jQuery’s Deferreds has significantly help keep the project from becoming a mess of callbacks. On the NodeJS side, it looks like there’s two functionally equivalent implementations but I decided to run with Q since the “wrapper” function names are shorter.

The Code

Anyway, the main problem we’re trying to address is that with multiple nested callbacks code becomes particularly difficult to follow. It’s hard to read, hard to trace control flow, and obviously hard to debug. Take for example, the phantomjs-node example:

It’s already THREE callbacks deep and all it’s done is initialize PhantomJS and load a page. Imagine layering on a few more asynchronous operations, doing all of this in a loop, and then some post-processing. Enter Deferreds. How To Node has an eloquent explanation of what deferreds are and how Node impliments them but in a nutshell they’re useful for making asynchronous code easier to follow.

The main issue I ran into using Q was that “Q.ninvoke” and “Q.npost” wrapper functions kept causing exceptions while using them with phantomjs-node. What I ended up doing instead was creating my own Deferreds in separate functions and then resolving them inside a normal callback.

My PhantomJS-node code ended up looking like:

It’s without a doubt easier to follow and would also make it much easier to do more complicated PhantomJS related tasks.

So how did screenshotting a few thousand domains go? Well there’s a story about that as well…

Drupal 7 body content going blank? An obscure PCRE configuration setting may be the culprit.

Recently, one of our clients noticed that when they added additional text to the body field of a node with a bunch of existing content the changes would appear to save on the back-end edit screen but the body content of the page disappears on the front end without a trace and with no errors. At first, we thought it was a character or word count restriction that was placed on the body field or that a text-format filter/html combination was throwing things off. After checking a bunch of settings on the admin screen and testing different combinations of words, characters and text-format filters we came up empty handed.

Turns out it was an obscure setting within sites/default/settings.php. If you open this file and search for ‘pcre.backtrack_limit’ you’ll find a surprisingly accurate description of the problem at hand:

/**
* If you encounter a situation where users post a large amount of text, and
* the result is stripped out upon viewing but can still be edited, Drupal’s
* output filter may not have sufficient memory to process it. If you
* experience this issue, you may wish to uncomment the following two lines
* and increase the limits of these variables. For more information, see
* http://php.net/manual/en/pcre.configuration.php.
*/

# ini_set(‘pcre.backtrack_limit’, 200000);
# ini_set(‘pcre.recursion_limit’, 200000);

So once you comment these out and increase the limits you’ll find that the body content reappears on the front end.
Since everyone’s server setup is different, you’ll have to experiment with what values work best for you. Here’s a link to the php.net manual for this configuration setting: http://php.net/manual/en/pcre.configuration.php.

Hope this saves you some time and frustration!

Bitcoin: One vulnerability, two interesting questions

Over the last two weeks, there’s been two high profile negative Bitcoin incidents. First up, was Mt. Gox announcing that they were temporarily halting withdrawls and then soon after Silk Road 2.0 announcing that they been hacked and ~$2 million of BTC had been stolen. In both situations, the sites are blaming “transaction malleability”, what is supposedly a well known Bitcoin exploit, as the root cause of the issues. Predictably, most of the commentary surrounding both of these incidents has been that they’re both in fact cover ups for the site admins stealing the “lost” bitcoin. Regardless of what turns out to be true, both incidents are raising some interesting questions about bitcoin.

As I understand it, the “transaction malleability” vulnerability is an implementation specific issue that’s already been fixed in the “official” bitcoin client. This is directly contradictory to what Mt. Gox announced and one of the lead Bitcoin developers actually went as far as calling out Mt. Gox in Why Mt. Gox is full of shit. It isn’t clear if Mt. Gox is being intentionally dishonest, but this spat does raise an interesting issue of trusting the software that you’re using. Looking at the software we use on a daily basis, there’s a remarkable lack of transparency into how systems are built, if they’ve been audited, and if they’re composed of independently verifiable open source components. From the software that switches trains on tracks to the code that powers your cell phones, we generally don’t really know how the sausage was ultimately made. In general, things seem to work “OK” without consumers knowing these details but for people to be confident in Bitcoin payment systems they’ll ultimately demand transparency into the underlying implementations.

Another interesting point surfaced by this issue is the irreversibility of Bitcoin transactions. The Silk Road 2.0 announcement really highlights this, since they’re basically pleading with whoever stole the coins to “give them back”. It’s pretty clear that the inability to rollback transactions is going to make combating Bitcoin fraud a herculean task as the volume of transactions grows. Without a mechanism to “undo” a transaction, the majority of fraud prevention will have to rely on preventively blocking transactions as opposed to mediating them after the fact. There are certainly benefits to not being able to reverse transactions but Bitcoin will definitely need a strategy to combat issues like this.

Anyway, I’m still bullish on Bitcoin, the community has shown that it’s resilient and overall it’s definitely better to work out the kinks with $2 million instead of $200 million at stake. It looks like Mt. Gox is close to resuming normal activity and Silk Road 2.0 has recently announced that it’ll reimburse coins to everyone that was affected by the hack. Now if only the price would get back to $1000/coin…

Big Data: Amazon Redshift vs. Hive

In the last few months there’s been a handful blog posts basically themed “Redshift vs. Hive”. Companies from Airbnb to FlyData have been broadcasting their success in migrating from Hive to Redshift in both performance and cost. Unfortunately, a lot of casual observers have interpreted these posts to mean that Redshift is a “silver bullet” in the big data space. For some background, Hive is an abstraction layer that executes MapReduce jobs using Hadoop across data stored in HDFS. Amazon’s Redshift is a managed “petabyte scale” data warehouse solution that provides managed access to a ParAccel cluster and exposes a SQL interface that’s roughly similar to PostgreSQL. So where does that leave us?

From the outside, Hive and Redshift look oddly similar. They both promise “petabyte” scale, linear scalability, and expose an SQL’ish query syntax. On top of that, if you squint, they’re both available as Amazon AWS managed services through Elastic Mapreduce and of course Redshift. Unfortunately, that’s really where the similarities end which makes the “Hive vs. Redshift” comparisons along the lines of “apples to oranges”. Looking at Hive, its defining characteristic is that it runs across Hadoop and works on data stored in HDFS. Removing the acronym soup, that basically means that Hive runs MapReduce jobs across a bunch of text files that are stored in a distribued file system (HDFS). In comparison, Redshift uses a data model similar to PostgreSQL so data is structured in terms of rows and tables and includes the concept of indexes.

OK so who cares?

Well therein lays the rub that everyone seem to be missing. Hadoop, and by extension Hive (and Pig) are really good at processing text files. So imagine you have 10 million x 1mb XML documents or 100GB worth of nginx logs, this would be a perfect use case for Hive. All you would have to do is push them into HDFS or S3, write a RegEx to extract your data and then query away. Need to add another 2 million documents or 20GB of logs? No problem, just get them into HDFS and you’re good to go.

Could you do this with Redshift? Sure, but you’d need to pre-process 10 million XML documents and 100GB of logs to extract the appropriate fields, and then create CSV files or SQL INSERT statements to load into Redshift. Given the available options, you’re probably going to end up using Hadoop to do this anyway.

Where Redshift is really going to excel is in situations where your data is basically already relational and you have a clear path to actually get it into your cluster. For example, if you were running three x 15GB MySQL databases with unique, but related data, you’d be able to regularly pull that data into Redshift and then ad-hoc query it with regular SQL. In addition, since the data is already structured you’d be able to use the existing format to create keys in Redshift to improve performance.

Hammers, screws, etc

When it comes down it, it’ll come down to the old “right tool for the right job” aphorism. As an organization, you’ll have to evaluate how your data is structured, the types of queries you’re interested in running, and what level of abstraction you’re comfortable with. What’s definitely true is that “enterprise” data warehousing is being commoditized and the “old guard” better innovate or die.