We’ve been evaluating Apache Flume over the last few weeks as part of a client project we’re working on. At a high level, our goal was to get plain text data generated by one of our applications running in a non-AWS datacenter back into Amazon S3 so that we could load it into Redshift. Reading through the Is Flume a good fit? section of their docs it perfectly describes this use case:

If you need to ingest textual log data into Hadoop/HDFS then Flume is the right fit for your problem, full stop

OK great, but what about writing into S3? It turns out you can use the HDFS sink to write into S3 if you use a “path” configuration formatted like ‘s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>@<bucket.name>’.

But wait! Unfortunately Flume doesn’t ship with “batteries included” for writing to HDFS and S3 so you’ll need to grab a couple more dependencies before you can get this working. Frustratingly, you need to grab version compatible JARs of the Amazon S3 client, HDFS, and Hadoop with S3 compatibility. After flailing around downloading packages, hitting an error, downloading more JARs, and finally getting Flume working I realized there had to be a better way to replicate the process.

Enter Maven! Since we’re just grabbing down JARs it’s actually possible to use a pom.xml to describe what dependencies we need, let Maven grab the JARs, and then copy the JARs into a local folder. Here’s a working pom.xml file against Flume 1.6:

To use it, just run “mvn process-sources” and you’ll end up with all the JARs conveniently in a “lib/” folder in the current directory. Copy those JARs into the “lib/” folder of your Flume download and you should be off to the races. Note: These are very possibly more JARs than you need to get Flume running but as Maven dependencies this is the simplest I could come up with.

Flushing out the steps to getting a working S3 sink you should be able to do the following:

Before you run the last command to launch Flume you’ll need to edit “agent1.conf” to enter your Amazon token, secret key, and S3 bucket location. You’ll need to create the S3 bucket before trying to write to it with Flume. And then finally, to test that everything is working you can use netcat with the following:

Back on the terminal with Flume you should see debug data about receiving the message an a notification about an S3 upload. So what’s next? Not much, you’ll need to pick an appropriate source and then tune your HDFS and channel parameters for the amount of throughput you need.

As always, questions and comments welcome!

Posted In: Big Data

Tags: , ,

One of the nice things about nodejs is that since the majority of its libraries are asynchronous it boasts strong support for concurrently performing IO heavy workloads. Even though node is single threaded the event loop is able to concurrently progress separate operations because of the asynchronous style of its libraries. A canonical example would something like fetching 10 web pages and extracting all the links from the fetched HTML. This fits into node’s computational model nicely since the most time consuming part of an HTTP request is waiting around for the network during which node can use the CPU for something else. For the sake of discussion, let’s consider this sample implementation:

Request debugging is enabled so you’ll see that node starts fetching all the URLs at the same time and then the various events fire at different times for each URL:

So we’ve demonstrated that node will concurrently “do” several things at the same time but what happens if computationally intensive code is tying up the event loop? As a concrete example, imagine doing something like compressing the results of the HTTP request. For our purposes we’ll just throw in a while(1) so it’s easier to see what’s going on:

If you run the script you’ll notice it takes much longer to finish since we’ve now introduced a while() loop that causes each URL to take at least 5 seconds to be processed:

And now back to the original problem, how can we fetch the URLs in parallel so that our script completes in around 5 seconds? It turns out it’s possible to do this with node with the child_process module. Child_process basically lets you fire up a second nodejs instance and use IPC to pass messages between the parent and it’s child. We’ll need to move a couple of things around to get this to work and the implementation ends up looking like:

What’s happening now is that we’re launching a child process for each URL we want to process, passing a message with the target URL, and then passing the links back to the parent. And then running along with a timer results in:

It isn’t exactly 5 seconds since there’s a non-trivial amount of time required to start each of the child processes but it’s around what you’d expect. So there you have it, we’ve successfully demonstrated how you can achieve parallelism with nodejs.

Posted In: Javascript

Tags: ,

We’ve been writing a bit of Scala lately (more on that later) and one of the “gotchas” we ran into was adding a Maven project in the Scala IDE (Eclipse). We wanted to use Maven because we needed to manage some Java dependencies, are generally more familiar with it, and didn’t want to deal with figuring out sbt. It turns out, there’s an existing Maven archetype for building Scala projects but it takes a bit of finagling to get it to work in Eclipse.

From Eclipse

The first thing you’ll need to do is add a “Remote Catalog” to your list of available Maven archetypes. To do this, click through Windows > Preferences and then on the left navigate through > Maven > Archetypes > Add Remote Catalog. From there, you’ll need to add a “Remote Catalog” with the catalog file set to http://repo1.maven.org/maven2/archetype-catalog.xml.

Once this is done, you’ll be able to File > New > Other and select Maven > Maven Project. On the archetype selection screen you’ll now be able to search for “net.alchim31.maven” which is what you’ll want to select.

When I tested this, there were a couple of problems with the project that the archetype created. To solve these issues I had to do the following:

  • The pom.xml was generated with a placeholder for my Scala version so I had to replace all the instances of “${scala.version}” in the pom with “2.11.7”. You’ll want to match this with the version of Scala you have installed.
  • junit wasn’t properly importing so the classes in test/ were throwing a compile error. I didn’t have any immediate testing needs so I deleted the entire test/ directory and removed the test related dependencies: junit, org.specs2, org.scalatest
  • The pom passes an invalid “-make:transitive” option to scalac which I just removed. It’s around line 51 inside the “args” block for scala-maven-plugin
  • The archetype also sets the compiler version to 1.6 which I bumped to 1.8

Creating a runnable JAR

Another common “gotcha” with Scala and Maven is creating a runnable JAR, so basically something you can run with “java -jar yourjar.jar”. This is a bit tricky with Scala since you have to package in the scala library along with your dependencies. And then on the Maven side, it seems like there’s a dozen ways to accomplish this successfully. I ended up using the maven-assembly-plugin with the following configuration:

And then you can compile and run like any other Maven project:

A working pom.xml

Copied below is the pom.xml file in all of its glory. Let me know if you run into any issues.

Posted In: Scala

Tags: ,

I’ll preface this by saying that I know just enough about machine learning to be dangerous and get myself into trouble. That said, if anything is inaccurate or misleading let me know in the comments and I’ll update it. Last April Amazon announced Amazon Machine Learning, a new AWS service aimed at developers to help them build and deploy machine learning solutions. We’ve been excited to experiment with AWS ML since it launched but haven’t had a chance until just now.

A bit of background

So what is “machine learning”? Looking at Wikipedia’s definition machine learning is ‘is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a “Field of study that gives computers the ability to learn without being explicitly programmed”.’ That definition in turn translates to using a computer to solve problems like regression or classification. Machine learning powers dozens of the products that internet users interact with everyday from spam filtering to product recommendations to Siri and Google Now.

Looking at the Wikipedia article, ML as a field has existed since the late 1980s so what’s been driving its recent growth in popularity? I’d argue key driving factors have been compute resources getting cheaper, especially storage, which has allowed companies to store orders of magnitude more data than they were 5 or 10 years ago. This data along with elastic public cloud resources and the increasing maturity of open source packages has made ML accessible and worthwhile for an increasingly large number of companies. Additionally, there’s been an explosion of venture capital funding into ML focussed startups which has certainly also helped boost its popularity.

Kicking the tires

The first thing we need to do before testing out Amazon ML was to pick a good machine learning problem to tackle. Unfortunately, we didn’t have any internal data to test with so I headed over to Kaggle to find a good problem to tackle. After some exploring I settled on Digit Recognizer since its a “known problem”, the Kaggle challenge had benchmark solutions, and no additional data transformations would be neccessary. The goal of the Digit Recognizer problem is to accept bitmap representations of handwritten numerals and then correctly output what number was written.

The dataset is a modified version of the Mixed National Institute of Standards and Technology which is a well known dataset often used for training image processing systems. Unlike the original MNIST images, the Kaggle dataset has already been converted to a grayscale bitmap array so individual pixels are represented by an integer from 0-255. In ML parlance, the “Digit Recognizer” challenge would fall under the umbrella of a classification problem since the goal would be to correctly “classify” unknown inputs with a label, in this case a 0-9 digit. Another interesting feature of the MNIST dataset is that the Wikipedia provides benchmark performance for a variety of approaches so we can have a sense of how AWS ML stacks up.

At a high level, the big steps we’re going to take are to train our model using “train.csv”, evaluate it against a subset of known data, and then predict labels for the rows in “test.csv”. Amazon ML makes this whole process pretty easy using the AWS Console UI so there’s not really any magic. One thing worth noting is that Amazon doesn’t let you select which algorithm will be used in the model you build, it selects it automatically based on the type of ML problem. After around 30 minutes your model should be built and you’ll be able to explore the model’s performance. This is actually a really interesting feature of Amazon ML since you wouldn’t get these insights with visualizations “out of the box” from most open source packages.

Performance

With the model built the last step is to use it to predict unknown values from the “test.csv” dataset. Similar to generating the model, running a “batch prediction” is pretty straightforward on the AWS ML UI. After the prediction finishes you’ll end up with a results file in your specified S3 bucket that looks similar to:

Because there are several possible classifications of a digit the ML model generates a probability per classification with the largest number being the most likely. Individual probabilities are great but what we really want is a single digit per input sample. Running the input through the following PHP will produce that along with a header for Kaggle:

And finally the last step of the evaluation is uploading our results file to Kaggle to see how our model stacks up. Uploading my results produced a score of 0.91671 so right around 92% accuracy. Interestingly, looking at the Wikipedia entry for MNIST a 8% error rate is right around what was academically achieved using a linear classifier. So overall, not a bad showing!

Takeaways

Comparing the model’s performance to the Kaggle leaderboard and Wikipedia benchmarks, AWS ML performanced decently well especially considering we took the defaults and didn’t pre-process the data. One of the downside of AWS ML is the lack of visibility into what algorithms are being used and additionally not being able to select specific algorithms. In my experience, solutions that mask complexity like this work great for “typical” use cases but then quickly breakdown for more complicated tasks. Another downside of AWS ML is that it can currently only process text data that’s formatted into CSVs with one record per row. The result of this is that you’ll have to do any data transformations with your own code running on your own compute infrastructure or AWS EC2.

Anyway, all in all I think Amazon’s Machine Learning product is definitely an interesting addition to the AWS suite. At the very least, I can see it being a powerful tool to be able to quickly test out ML hypothesis which can then be implemented and refined using an open source package like skit-learn or Apache Mahout.

Posted In: Amazon AWS, Machine learning

Tags: ,

Picture the scene, it’s Friday night, you’ve got friends over and everyone wants to listen to some great music. The problem is everyone wants to jam to something different and you’re not thrilled to sit by your laptop all night. Enter, the TxtyJukebox.

TxtyJukebox lets you setup an event which creates you a unique number which users can text in song requests to. As TxtyJukebox receives song requests, it searches YouTube for music videos and then places the videos into your event’s queue. And then if you hook up TxtyJukebox to a TV you’ll be able to jam to videos on a big screen with big room sound. But wait, there’s more! If you have a Chromecast you can connect TxtyJukebox to your Chromecast via our app. The Chromecast app will launch from within http://jukebox.setfive.com/ so there’s nothing to download or setup.

So how does TxtyJukebox work under the hood? Well sit tight, technical details lay ahead. The webapp itself is a standard Symfony2 app along with the usual suspects – Bootstrap, Underscore, and a sprinkling of jQuery. Along with that, we’re using Twillio’s REST API to handle SMS along with a “webhook” from Twillio to the webapp to recieve messages. In addition, we’re leveraging the YouTube API to search and load videos which are then loaded into an iframe. Finally, the Chromecast app is HTML/CSS/JS powered by jQuery and underscore.

Building TxtyJukebox was a lot of fun and we’re thrilled that it’s been positively received. An awesome surprise was that Ryan over at Makeusof.com found it and incldued it in his post of How to Share Music from Multiple Devices to a Chromecast. As always, let us know if you have any questions or comments.

Posted In: Launch

Tags: