One of the nice things about nodejs is that since the majority of its libraries are asynchronous it boasts strong support for concurrently performing IO heavy workloads. Even though node is single threaded the event loop is able to concurrently progress separate operations because of the asynchronous style of its libraries. A canonical example would something like fetching 10 web pages and extracting all the links from the fetched HTML. This fits into node’s computational model nicely since the most time consuming part of an HTTP request is waiting around for the network during which node can use the CPU for something else. For the sake of discussion, let’s consider this sample implementation:
Request debugging is enabled so you’ll see that node starts fetching all the URLs at the same time and then the various events fire at different times for each URL:
So we’ve demonstrated that node will concurrently “do” several things at the same time but what happens if computationally intensive code is tying up the event loop? As a concrete example, imagine doing something like compressing the results of the HTTP request. For our purposes we’ll just throw in a while(1) so it’s easier to see what’s going on:
If you run the script you’ll notice it takes much longer to finish since we’ve now introduced a while() loop that causes each URL to take at least 5 seconds to be processed:
And now back to the original problem, how can we fetch the URLs in parallel so that our script completes in around 5 seconds? It turns out it’s possible to do this with node with the child_process module. Child_process basically lets you fire up a second nodejs instance and use IPC to pass messages between the parent and it’s child. We’ll need to move a couple of things around to get this to work and the implementation ends up looking like:
What’s happening now is that we’re launching a child process for each URL we want to process, passing a message with the target URL, and then passing the links back to the parent. And then running along with a timer results in:
It isn’t exactly 5 seconds since there’s a non-trivial amount of time required to start each of the child processes but it’s around what you’d expect. So there you have it, we’ve successfully demonstrated how you can achieve parallelism with nodejs.
Earlier this week, a buddy of mine reached out asking for a good solution to programmatically taking screenshots of a few thousand URLs. For whatever reason, this question seems to come up ever so often so I pointed him towards PhantomJS and figured he’d be on his way. Wrong. Not one to pass up free beer and the opportunity to learn something I agreed to write up the script to generate screenshots from a list of URLs.
Looking at PhantomJS, it seems relatively straightforward but it’s clear you’d really need something “else” to orchestrate this entire process. After some poking around, NodeJS, everyone’s favorite hipster runtime, seemed to be the obvious choice. There’s a handful of node modules that basically “bridge” node with phantom and allow a node script to asynchronously manipulate a PhantomJS instance. Based solely on the funny description I decided to run with phantomjs-node and was off to the races.
Getting everything setup was straightforward enough but then as I started looking at the phantomjs-node examples I started realizing this was a one way trip to callback soup. I’ve been doing some PhoneGap work recently and using jQuery’s Deferreds has significantly help keep the project from becoming a mess of callbacks. On the NodeJS side, it looks like there’s two functionally equivalent implementations but I decided to run with Q since the “wrapper” function names are shorter.
Anyway, the main problem we’re trying to address is that with multiple nested callbacks code becomes particularly difficult to follow. It’s hard to read, hard to trace control flow, and obviously hard to debug. Take for example, the phantomjs-node example:
It’s already THREE callbacks deep and all it’s done is initialize PhantomJS and load a page. Imagine layering on a few more asynchronous operations, doing all of this in a loop, and then some post-processing. Enter Deferreds. How To Node has an eloquent explanation of what deferreds are and how Node impliments them but in a nutshell they’re useful for making asynchronous code easier to follow.
The main issue I ran into using Q was that “Q.ninvoke” and “Q.npost” wrapper functions kept causing exceptions while using them with phantomjs-node. What I ended up doing instead was creating my own Deferreds in separate functions and then resolving them inside a normal callback.
My PhantomJS-node code ended up looking like:
It’s without a doubt easier to follow and would also make it much easier to do more complicated PhantomJS related tasks.
So how did screenshotting a few thousand domains go? Well there’s a story about that as well…