I’ll preface this by saying that I know just enough about machine learning to be dangerous and get myself into trouble. That said, if anything is inaccurate or misleading let me know in the comments and I’ll update it. Last April Amazon announced Amazon Machine Learning, a new AWS service aimed at developers to help them build and deploy machine learning solutions. We’ve been excited to experiment with AWS ML since it launched but haven’t had a chance until just now.
So what is “machine learning”? Looking at Wikipedia’s definition machine learning is ‘is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a “Field of study that gives computers the ability to learn without being explicitly programmed”.’ That definition in turn translates to using a computer to solve problems like regression or classification. Machine learning powers dozens of the products that internet users interact with everyday from spam filtering to product recommendations to Siri and Google Now.
Looking at the Wikipedia article, ML as a field has existed since the late 1980s so what’s been driving its recent growth in popularity? I’d argue key driving factors have been compute resources getting cheaper, especially storage, which has allowed companies to store orders of magnitude more data than they were 5 or 10 years ago. This data along with elastic public cloud resources and the increasing maturity of open source packages has made ML accessible and worthwhile for an increasingly large number of companies. Additionally, there’s been an explosion of venture capital funding into ML focussed startups which has certainly also helped boost its popularity.
The first thing we need to do before testing out Amazon ML was to pick a good machine learning problem to tackle. Unfortunately, we didn’t have any internal data to test with so I headed over to Kaggle to find a good problem to tackle. After some exploring I settled on Digit Recognizer since its a “known problem”, the Kaggle challenge had benchmark solutions, and no additional data transformations would be neccessary. The goal of the Digit Recognizer problem is to accept bitmap representations of handwritten numerals and then correctly output what number was written.
The dataset is a modified version of the Mixed National Institute of Standards and Technology which is a well known dataset often used for training image processing systems. Unlike the original MNIST images, the Kaggle dataset has already been converted to a grayscale bitmap array so individual pixels are represented by an integer from 0-255. In ML parlance, the “Digit Recognizer” challenge would fall under the umbrella of a classification problem since the goal would be to correctly “classify” unknown inputs with a label, in this case a 0-9 digit. Another interesting feature of the MNIST dataset is that the Wikipedia provides benchmark performance for a variety of approaches so we can have a sense of how AWS ML stacks up.
At a high level, the big steps we’re going to take are to train our model using “train.csv”, evaluate it against a subset of known data, and then predict labels for the rows in “test.csv”. Amazon ML makes this whole process pretty easy using the AWS Console UI so there’s not really any magic. One thing worth noting is that Amazon doesn’t let you select which algorithm will be used in the model you build, it selects it automatically based on the type of ML problem. After around 30 minutes your model should be built and you’ll be able to explore the model’s performance. This is actually a really interesting feature of Amazon ML since you wouldn’t get these insights with visualizations “out of the box” from most open source packages.
With the model built the last step is to use it to predict unknown values from the “test.csv” dataset. Similar to generating the model, running a “batch prediction” is pretty straightforward on the AWS ML UI. After the prediction finishes you’ll end up with a results file in your specified S3 bucket that looks similar to:
Because there are several possible classifications of a digit the ML model generates a probability per classification with the largest number being the most likely. Individual probabilities are great but what we really want is a single digit per input sample. Running the input through the following PHP will produce that along with a header for Kaggle:
And finally the last step of the evaluation is uploading our results file to Kaggle to see how our model stacks up. Uploading my results produced a score of 0.91671 so right around 92% accuracy. Interestingly, looking at the Wikipedia entry for MNIST a 8% error rate is right around what was academically achieved using a linear classifier. So overall, not a bad showing!
Comparing the model’s performance to the Kaggle leaderboard and Wikipedia benchmarks, AWS ML performanced decently well especially considering we took the defaults and didn’t pre-process the data. One of the downside of AWS ML is the lack of visibility into what algorithms are being used and additionally not being able to select specific algorithms. In my experience, solutions that mask complexity like this work great for “typical” use cases but then quickly breakdown for more complicated tasks. Another downside of AWS ML is that it can currently only process text data that’s formatted into CSVs with one record per row. The result of this is that you’ll have to do any data transformations with your own code running on your own compute infrastructure or AWS EC2.
Anyway, all in all I think Amazon’s Machine Learning product is definitely an interesting addition to the AWS suite. At the very least, I can see it being a powerful tool to be able to quickly test out ML hypothesis which can then be implemented and refined using an open source package like skit-learn or Apache Mahout.