NodeJS: Running code in parallel with child_process

One of the nice things about nodejs is that since the majority of its libraries are asynchronous it boasts strong support for concurrently performing IO heavy workloads. Even though node is single threaded the event loop is able to concurrently progress separate operations because of the asynchronous style of its libraries. A canonical example would something like fetching 10 web pages and extracting all the links from the fetched HTML. This fits into node’s computational model nicely since the most time consuming part of an HTTP request is waiting around for the network during which node can use the CPU for something else. For the sake of discussion, let’s consider this sample implementation:

Request debugging is enabled so you’ll see that node starts fetching all the URLs at the same time and then the various events fire at different times for each URL:

So we’ve demonstrated that node will concurrently “do” several things at the same time but what happens if computationally intensive code is tying up the event loop? As a concrete example, imagine doing something like compressing the results of the HTTP request. For our purposes we’ll just throw in a while(1) so it’s easier to see what’s going on:

If you run the script you’ll notice it takes much longer to finish since we’ve now introduced a while() loop that causes each URL to take at least 5 seconds to be processed:

And now back to the original problem, how can we fetch the URLs in parallel so that our script completes in around 5 seconds? It turns out it’s possible to do this with node with the child_process module. Child_process basically lets you fire up a second nodejs instance and use IPC to pass messages between the parent and it’s child. We’ll need to move a couple of things around to get this to work and the implementation ends up looking like:

What’s happening now is that we’re launching a child process for each URL we want to process, passing a message with the target URL, and then passing the links back to the parent. And then running along with a timer results in:

It isn’t exactly 5 seconds since there’s a non-trivial amount of time required to start each of the child processes but it’s around what you’d expect. So there you have it, we’ve successfully demonstrated how you can achieve parallelism with nodejs.

Scala: Building with Eclipse and Maven

We’ve been writing a bit of Scala lately (more on that later) and one of the “gotchas” we ran into was adding a Maven project in the Scala IDE (Eclipse). We wanted to use Maven because we needed to manage some Java dependencies, are generally more familiar with it, and didn’t want to deal with figuring out sbt. It turns out, there’s an existing Maven archetype for building Scala projects but it takes a bit of finagling to get it to work in Eclipse.

From Eclipse

The first thing you’ll need to do is add a “Remote Catalog” to your list of available Maven archetypes. To do this, click through Windows > Preferences and then on the left navigate through > Maven > Archetypes > Add Remote Catalog. From there, you’ll need to add a “Remote Catalog” with the catalog file set to http://repo1.maven.org/maven2/archetype-catalog.xml.

Once this is done, you’ll be able to File > New > Other and select Maven > Maven Project. On the archetype selection screen you’ll now be able to search for “net.alchim31.maven” which is what you’ll want to select.

When I tested this, there were a couple of problems with the project that the archetype created. To solve these issues I had to do the following:

  • The pom.xml was generated with a placeholder for my Scala version so I had to replace all the instances of “${scala.version}” in the pom with “2.11.7”. You’ll want to match this with the version of Scala you have installed.
  • junit wasn’t properly importing so the classes in test/ were throwing a compile error. I didn’t have any immediate testing needs so I deleted the entire test/ directory and removed the test related dependencies: junit, org.specs2, org.scalatest
  • The pom passes an invalid “-make:transitive” option to scalac which I just removed. It’s around line 51 inside the “args” block for scala-maven-plugin
  • The archetype also sets the compiler version to 1.6 which I bumped to 1.8

Creating a runnable JAR

Another common “gotcha” with Scala and Maven is creating a runnable JAR, so basically something you can run with “java -jar yourjar.jar”. This is a bit tricky with Scala since you have to package in the scala library along with your dependencies. And then on the Maven side, it seems like there’s a dozen ways to accomplish this successfully. I ended up using the maven-assembly-plugin with the following configuration:

And then you can compile and run like any other Maven project:

A working pom.xml

Copied below is the pom.xml file in all of its glory. Let me know if you run into any issues.

ML: Taking AWS machine learning for a spin

I’ll preface this by saying that I know just enough about machine learning to be dangerous and get myself into trouble. That said, if anything is inaccurate or misleading let me know in the comments and I’ll update it. Last April Amazon announced Amazon Machine Learning, a new AWS service aimed at developers to help them build and deploy machine learning solutions. We’ve been excited to experiment with AWS ML since it launched but haven’t had a chance until just now.

A bit of background

So what is “machine learning”? Looking at Wikipedia’s definition machine learning is ‘is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a “Field of study that gives computers the ability to learn without being explicitly programmed”.’ That definition in turn translates to using a computer to solve problems like regression or classification. Machine learning powers dozens of the products that internet users interact with everyday from spam filtering to product recommendations to Siri and Google Now.

Looking at the Wikipedia article, ML as a field has existed since the late 1980s so what’s been driving its recent growth in popularity? I’d argue key driving factors have been compute resources getting cheaper, especially storage, which has allowed companies to store orders of magnitude more data than they were 5 or 10 years ago. This data along with elastic public cloud resources and the increasing maturity of open source packages has made ML accessible and worthwhile for an increasingly large number of companies. Additionally, there’s been an explosion of venture capital funding into ML focussed startups which has certainly also helped boost its popularity.

Kicking the tires

The first thing we need to do before testing out Amazon ML was to pick a good machine learning problem to tackle. Unfortunately, we didn’t have any internal data to test with so I headed over to Kaggle to find a good problem to tackle. After some exploring I settled on Digit Recognizer since its a “known problem”, the Kaggle challenge had benchmark solutions, and no additional data transformations would be neccessary. The goal of the Digit Recognizer problem is to accept bitmap representations of handwritten numerals and then correctly output what number was written.

The dataset is a modified version of the Mixed National Institute of Standards and Technology which is a well known dataset often used for training image processing systems. Unlike the original MNIST images, the Kaggle dataset has already been converted to a grayscale bitmap array so individual pixels are represented by an integer from 0-255. In ML parlance, the “Digit Recognizer” challenge would fall under the umbrella of a classification problem since the goal would be to correctly “classify” unknown inputs with a label, in this case a 0-9 digit. Another interesting feature of the MNIST dataset is that the Wikipedia provides benchmark performance for a variety of approaches so we can have a sense of how AWS ML stacks up.

At a high level, the big steps we’re going to take are to train our model using “train.csv”, evaluate it against a subset of known data, and then predict labels for the rows in “test.csv”. Amazon ML makes this whole process pretty easy using the AWS Console UI so there’s not really any magic. One thing worth noting is that Amazon doesn’t let you select which algorithm will be used in the model you build, it selects it automatically based on the type of ML problem. After around 30 minutes your model should be built and you’ll be able to explore the model’s performance. This is actually a really interesting feature of Amazon ML since you wouldn’t get these insights with visualizations “out of the box” from most open source packages.

Performance

With the model built the last step is to use it to predict unknown values from the “test.csv” dataset. Similar to generating the model, running a “batch prediction” is pretty straightforward on the AWS ML UI. After the prediction finishes you’ll end up with a results file in your specified S3 bucket that looks similar to:

Because there are several possible classifications of a digit the ML model generates a probability per classification with the largest number being the most likely. Individual probabilities are great but what we really want is a single digit per input sample. Running the input through the following PHP will produce that along with a header for Kaggle:

And finally the last step of the evaluation is uploading our results file to Kaggle to see how our model stacks up. Uploading my results produced a score of 0.91671 so right around 92% accuracy. Interestingly, looking at the Wikipedia entry for MNIST a 8% error rate is right around what was academically achieved using a linear classifier. So overall, not a bad showing!

Takeaways

Comparing the model’s performance to the Kaggle leaderboard and Wikipedia benchmarks, AWS ML performanced decently well especially considering we took the defaults and didn’t pre-process the data. One of the downside of AWS ML is the lack of visibility into what algorithms are being used and additionally not being able to select specific algorithms. In my experience, solutions that mask complexity like this work great for “typical” use cases but then quickly breakdown for more complicated tasks. Another downside of AWS ML is that it can currently only process text data that’s formatted into CSVs with one record per row. The result of this is that you’ll have to do any data transformations with your own code running on your own compute infrastructure or AWS EC2.

Anyway, all in all I think Amazon’s Machine Learning product is definitely an interesting addition to the AWS suite. At the very least, I can see it being a powerful tool to be able to quickly test out ML hypothesis which can then be implemented and refined using an open source package like skit-learn or Apache Mahout.

TxtyJukebox: Powering the soundtrack of your night

Picture the scene, it’s Friday night, you’ve got friends over and everyone wants to listen to some great music. The problem is everyone wants to jam to something different and you’re not thrilled to sit by your laptop all night. Enter, the TxtyJukebox.

TxtyJukebox lets you setup an event which creates you a unique number which users can text in song requests to. As TxtyJukebox receives song requests, it searches YouTube for music videos and then places the videos into your event’s queue. And then if you hook up TxtyJukebox to a TV you’ll be able to jam to videos on a big screen with big room sound. But wait, there’s more! If you have a Chromecast you can connect TxtyJukebox to your Chromecast via our app. The Chromecast app will launch from within http://jukebox.setfive.com/ so there’s nothing to download or setup.

So how does TxtyJukebox work under the hood? Well sit tight, technical details lay ahead. The webapp itself is a standard Symfony2 app along with the usual suspects – Bootstrap, Underscore, and a sprinkling of jQuery. Along with that, we’re using Twillio’s REST API to handle SMS along with a “webhook” from Twillio to the webapp to recieve messages. In addition, we’re leveraging the YouTube API to search and load videos which are then loaded into an iframe. Finally, the Chromecast app is HTML/CSS/JS powered by jQuery and underscore.

Building TxtyJukebox was a lot of fun and we’re thrilled that it’s been positively received. An awesome surprise was that Ryan over at Makeusof.com found it and incldued it in his post of How to Share Music from Multiple Devices to a Chromecast. As always, let us know if you have any questions or comments.

Spring Boot: Authentication with custom HTTP header

For the last few months we’ve been working on a Spring Boot project and one of the more challenging aspects has been wrangling Spring’s security component. For the project, we were looking to authenticate users using a custom HTTP header that contained a token generated from a third party service. There doesn’t seem to be a whole lot of concrete examples on how to set something like this up so here’s some notes from the trenches. Note: I’m still new to Spring so if any of this is inaccurate, let me know in the comments.

Concretely, what we’re looking to do is authenticate a user by passing a value in an X-Authorization HTTP header. So for example using cURL or jQuery:

In addition to insuring that the token is valid, we also want to setup Spring Security so that we can access the user’s details using “SecurityContextHolder.getContext().getAuthentication()”. So how do you do this? Turns out, you need a couple of classes to make this work:

  • An Authentication Token: You need a class that extends AbstractAuthenticationToken so that you can let Spring know about your authenticated user. The UsernamePasswordAuthenticationToken class is a pretty good starting point.
  • The Filter: You’ll need to create a filter to inspect requests that you want authenticated, grab the X-Authentication filter, confirm that it’s a valid token, and set the corresponding Authentication. Since we only want this to run once per request you can extend the OncePerRequestFilter class to set this up. You can see an example class below:
  • An Authentication Provider: The final piece is a class that extends AuthenticationProvider which handles retrieving a JPA entity from the database. By implementing an AuthenticationProvider instead of doing the database lookup in the filter, you can keep your filter framework agnostic by not having to autowire in a JPA repository. My implementation looks similar to:

And finally, the last step is to wire this all up. You’ll need a class that extends WebSecurityConfigurerAdapter with two ovveridden configure methods to configure the filter and the authentication provider. For example, the following works at a bare minimum:

And then finally to access the authenticated user from a controller you’d do:

Anyway, hope this helps and as mentioned above if there’s anything inaccurate feel free to post in the comments.