AWS: Looking back on a year with AWS Elastic Container Service

At the beginning of last year we tried something different and deployed an application on Amazon’s Elastic Container Service (ECS). This is in contrast to our normal approach of deploying applications directly on EC2 instances. So from a devops perspective using ECS involved two challenges, working with Docker containers and using ECS as a service. We’ll focus on ECS here since enough ink has been spilled about Docker.

For a bit of background, ECS is a managed services that allows you to run Docker containers on AWS cloud infrastructure (EC2s). Amazon extended the abstraction with “Fargate for ECS” which launches your Docker containers on AWS managed EC2s so you don’t have to manage or maintain any underlying hardware. With Fargate you define a “Task” consisting of a Docker image, # of vCPUs, and amount of RAM which AWS uses to launch a Docker container on a EC2 which you don’t have any access to. And then naturally if you need to provision additional capacity you can just tick the 1 to 2 and AWS will launch an additional container.

The Setup

The app we deployed on ECS is one we inherited from another team. The app is a consumer facing Vert.x (Java) web app that provides a set of API endpoints for content focussed consumer sites. Before taking over the app we had learned that it had some “unique” (aka bugs) scaling requirements which was one of the motivators to use ECS. On the AWS side, our setup consisted of a Fargate ECS cluster connected to an application load balancer (ALB) which handled SSL termination and health checks for the ECS Tasks. In addition, we connected CircleCI for continuous integration and continuous deployment. We’re usingecs-deploy to handle CD on ECS. ecs-deploy handles creating a new ECS task definition, bringing up a new container with that definition, and cycling out the old container if everything went well. So a year in here are some takeaways of using Fargate ECS.

Reinforces cattle, not pets

There’s a cloud devops mantra that you should treat your servers like a herd of cattle, not the family pets. The thinking being that, especially for horizontally scalable cloud servers, you want the servers to be easy to bring up and not “special” in any way. When using EC2s you can convince yourself that you’re adopting this philosophy but eventually something will leak through. Sure you use a configuration management tool but during an emergency someone will surely manually install something. And your developers aren’t supposed to use the local disk but at some point someone will rely on files in “/tmp” always being there.

In contrast, deploying on ECS reinforces thinking of individual servers as disposable because the state of your container is destroyed every time you launch a new task. So each time you deploy new code a new container will be launched without retaining any previous state. At the server level, this dynamic actually makes it impossible to make ad hoc server changes since it’s impossible to SSH to your ECS so any changes would have to present in your Dockerfile. Similarly at the app level since your disks don’t persist between deployments you’d quickly stop writing anything important just to disk.

Logs are…challenging

When using Fargate on ECS the only way to access output from your container is through a CloudWatch log group through the CloudWatch UI. At first look this is great, you can view your logs right in the AWS console UI without having to SSH into any servers! But as time goes on you’ll start to miss being able to see and manipulate logs in a regular terminal.

The first stumbling block is actually the UI itself. It’s not uncommon for the UI to be a couple of minutes delayed which ends up being a significant pain point when “shit is broken”. Related to the delays, it seems like sometimes logs are available in the ECS Task view before CloudWatch which ends up being confusing to members of the team as they debugged issues.

Additionally, although the UI has search and filtering capabilities they’re fairly limited and difficult to use. Compounding this, there frustratingly isn’t an easy way to download the log files locally. This makes it difficult to use common Linux command line tools to parse and analyze logs which you’d normally be able to do. It is possible to export your CloudWatch logs to S3 via the console and then download them locally but the process involves a lot of clicks. You could automate this via the API but it feels like something you shouldn’t have to build since for example the load balancer automatically delivers logs into S3.

You’ll (probably) still need an EC2

The ECS/container dream is that you’ll be able to run all of your apps on managed, abstract infrastructure where all you have to worry about is a Dockerfile. This might be true for some people but for typical organizations you’re probably going to need to run an EC2. The biggest pain points for us were running scheduled tasks (crontabs) and having the flexibility to run ad hoc commands inside AWS.

It is possible to run scheduled tasks on ECS but after experimenting with it we didn’t think it was a great fit. Instead, the approach we took was setting up Jenkins on an EC2 to run scheduled jobs which consisted of running some command in a Docker container. So ultimately our scheduled jobs shared Docker images with our ECS tasks since the images are hosted by AWS ECR. Because of this, the same CircleCI build process that updates the ECS task will also update the image that Jenkins runs so the presence of Jenkins is mostly transparent to developers.

Not having an “inside AWS” environment to run commands is one of the most limiting aspects of using ECS exclusively. At some point an engineer on every team is going to find themselves needing to run a database dump or analyze some log files both of which will simply be orders of magnitude faster if run within AWS vs. over the internet. We took a pretty typical approach here with an EC2 configured via Ansible in an autoscale group.

Is it worth it?

After a year with ECS+Fargate I think it’s definitely a good solution for a set of deployment scenarios. Specifically, if you’re dealing with running a dynamic set of web frontends that are easily containerized it’ll probably be a great fit. The task scaling is “one click” (or API call) as advertised and it feels much snappier than bringing up a whole EC2, even with a snapshotted AMI. One final dimension to evaluate is naturally cost. ECS is billed roughly at the same rate as a regular EC2 but if you’re leaving tasks underutilized you’ll be paying for capacity you aren’t using. As noted above, there are some operational pain points with ECS but on the whole I think it’s a good option to evaluate when using AWS.

An Edge Case of Time in AWS PHP SDK

When Amazon Web Services rolled out their version 4 signature we started seeing sporadic errors on a few projects when we created pre-authenticated link to S3 resources with a relative timestamp. Trying to track down the errors wasn’t easy. It seemed that it would occur rarely while executing the same exact code. Our code was simply to get a pre-authenticated URL that would expire in 7 days, the max duration V4 signatures are allowed to be valid. The error we’d get was “The expiration date of a signature version 4 presigned URL must be less than one week”. Weird, we kept passing in “7 days” as the expiration time. After the error occurred a couple of times over a few weeks I decided to look into it.

The code throwing the error was located right in the SignatureV4 class. The error is thrown when the end timestamp minus the start timestamp for the signature was greater than a week. Looking through the way the timestamps were generated it went something like this:

  1. Generate the start timestamp as current time for the signature assuming one is not passed.
  2. Do a few other quick things not related to this problem.
  3. Do a check to insure that the end minus start timestamp is less than a week in seconds.

So a rough example with straight PHP could of the above steps for a ‘7 days’ expiration would be as follows:

Straight forward enough, right? the problem lies when a second “rolls” between generating the `$start` and the end timestamp check. For example, if you generate the `$start` at `2017-08-20 12:01:01.999999`. Let’s say this gets assigned the timestamp of `2017-08-20 12:01:01`. Then the check for the 7 weeks occurs at `2017-08-27 12:01:02.0000` it’ll throw an exception as duration between the start and end it’s actually now for 86,401 seconds total. It turns outs triggering this error is easier than you’d think. Run this script locally:

That will throw an exception within a few seconds of running most likely.

After I figured out the error, the next step was to submit a issue to make sure I’m not misunderstanding how the library should be used. The simplest fix for me was to generate the end expiration timestamp before generating the start timestamp. After I made the PR, Kevin S. from AWS pointed out that while this fixed the problem, the duration still wasn’t guaranteed to always be the same for the same relative time period. For example, if you created 1000 presigned URLs all with ‘+7 days’ as the valid period, some may be 86400 in duration others may be 86399. This isn’t a huge problem, but Kevin made a great point that we could solve the problem by locking the relative timestamp for the end based on the start timestamp. After adding that to the PR it was accepted. As of release 3.32.4 the fix is now included in the SDK.

Apache Flume: Setting up Flume for an S3 sink

We’ve been evaluating Apache Flume over the last few weeks as part of a client project we’re working on. At a high level, our goal was to get plain text data generated by one of our applications running in a non-AWS datacenter back into Amazon S3 so that we could load it into Redshift. Reading through the Is Flume a good fit? section of their docs it perfectly describes this use case:

If you need to ingest textual log data into Hadoop/HDFS then Flume is the right fit for your problem, full stop

OK great, but what about writing into S3? It turns out you can use the HDFS sink to write into S3 if you use a “path” configuration formatted like ‘s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>@<bucket.name>’.

But wait! Unfortunately Flume doesn’t ship with “batteries included” for writing to HDFS and S3 so you’ll need to grab a couple more dependencies before you can get this working. Frustratingly, you need to grab version compatible JARs of the Amazon S3 client, HDFS, and Hadoop with S3 compatibility. After flailing around downloading packages, hitting an error, downloading more JARs, and finally getting Flume working I realized there had to be a better way to replicate the process.

Enter Maven! Since we’re just grabbing down JARs it’s actually possible to use a pom.xml to describe what dependencies we need, let Maven grab the JARs, and then copy the JARs into a local folder. Here’s a working pom.xml file against Flume 1.6:

To use it, just run “mvn process-sources” and you’ll end up with all the JARs conveniently in a “lib/” folder in the current directory. Copy those JARs into the “lib/” folder of your Flume download and you should be off to the races. Note: These are very possibly more JARs than you need to get Flume running but as Maven dependencies this is the simplest I could come up with.

Flushing out the steps to getting a working S3 sink you should be able to do the following:

Before you run the last command to launch Flume you’ll need to edit “agent1.conf” to enter your Amazon token, secret key, and S3 bucket location. You’ll need to create the S3 bucket before trying to write to it with Flume. And then finally, to test that everything is working you can use netcat with the following:

Back on the terminal with Flume you should see debug data about receiving the message an a notification about an S3 upload. So what’s next? Not much, you’ll need to pick an appropriate source and then tune your HDFS and channel parameters for the amount of throughput you need.

As always, questions and comments welcome!

ML: Taking AWS machine learning for a spin

I’ll preface this by saying that I know just enough about machine learning to be dangerous and get myself into trouble. That said, if anything is inaccurate or misleading let me know in the comments and I’ll update it. Last April Amazon announced Amazon Machine Learning, a new AWS service aimed at developers to help them build and deploy machine learning solutions. We’ve been excited to experiment with AWS ML since it launched but haven’t had a chance until just now.

A bit of background

So what is “machine learning”? Looking at Wikipedia’s definition machine learning is ‘is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a “Field of study that gives computers the ability to learn without being explicitly programmed”.’ That definition in turn translates to using a computer to solve problems like regression or classification. Machine learning powers dozens of the products that internet users interact with everyday from spam filtering to product recommendations to Siri and Google Now.

Looking at the Wikipedia article, ML as a field has existed since the late 1980s so what’s been driving its recent growth in popularity? I’d argue key driving factors have been compute resources getting cheaper, especially storage, which has allowed companies to store orders of magnitude more data than they were 5 or 10 years ago. This data along with elastic public cloud resources and the increasing maturity of open source packages has made ML accessible and worthwhile for an increasingly large number of companies. Additionally, there’s been an explosion of venture capital funding into ML focussed startups which has certainly also helped boost its popularity.

Kicking the tires

The first thing we need to do before testing out Amazon ML was to pick a good machine learning problem to tackle. Unfortunately, we didn’t have any internal data to test with so I headed over to Kaggle to find a good problem to tackle. After some exploring I settled on Digit Recognizer since its a “known problem”, the Kaggle challenge had benchmark solutions, and no additional data transformations would be neccessary. The goal of the Digit Recognizer problem is to accept bitmap representations of handwritten numerals and then correctly output what number was written.

The dataset is a modified version of the Mixed National Institute of Standards and Technology which is a well known dataset often used for training image processing systems. Unlike the original MNIST images, the Kaggle dataset has already been converted to a grayscale bitmap array so individual pixels are represented by an integer from 0-255. In ML parlance, the “Digit Recognizer” challenge would fall under the umbrella of a classification problem since the goal would be to correctly “classify” unknown inputs with a label, in this case a 0-9 digit. Another interesting feature of the MNIST dataset is that the Wikipedia provides benchmark performance for a variety of approaches so we can have a sense of how AWS ML stacks up.

At a high level, the big steps we’re going to take are to train our model using “train.csv”, evaluate it against a subset of known data, and then predict labels for the rows in “test.csv”. Amazon ML makes this whole process pretty easy using the AWS Console UI so there’s not really any magic. One thing worth noting is that Amazon doesn’t let you select which algorithm will be used in the model you build, it selects it automatically based on the type of ML problem. After around 30 minutes your model should be built and you’ll be able to explore the model’s performance. This is actually a really interesting feature of Amazon ML since you wouldn’t get these insights with visualizations “out of the box” from most open source packages.

Performance

With the model built the last step is to use it to predict unknown values from the “test.csv” dataset. Similar to generating the model, running a “batch prediction” is pretty straightforward on the AWS ML UI. After the prediction finishes you’ll end up with a results file in your specified S3 bucket that looks similar to:

Because there are several possible classifications of a digit the ML model generates a probability per classification with the largest number being the most likely. Individual probabilities are great but what we really want is a single digit per input sample. Running the input through the following PHP will produce that along with a header for Kaggle:

And finally the last step of the evaluation is uploading our results file to Kaggle to see how our model stacks up. Uploading my results produced a score of 0.91671 so right around 92% accuracy. Interestingly, looking at the Wikipedia entry for MNIST a 8% error rate is right around what was academically achieved using a linear classifier. So overall, not a bad showing!

Takeaways

Comparing the model’s performance to the Kaggle leaderboard and Wikipedia benchmarks, AWS ML performanced decently well especially considering we took the defaults and didn’t pre-process the data. One of the downside of AWS ML is the lack of visibility into what algorithms are being used and additionally not being able to select specific algorithms. In my experience, solutions that mask complexity like this work great for “typical” use cases but then quickly breakdown for more complicated tasks. Another downside of AWS ML is that it can currently only process text data that’s formatted into CSVs with one record per row. The result of this is that you’ll have to do any data transformations with your own code running on your own compute infrastructure or AWS EC2.

Anyway, all in all I think Amazon’s Machine Learning product is definitely an interesting addition to the AWS suite. At the very least, I can see it being a powerful tool to be able to quickly test out ML hypothesis which can then be implemented and refined using an open source package like skit-learn or Apache Mahout.

AWS: Using Kinesis with PHP for real time stream processing

Over the last few weeks we’ve been working with one of our clients to build out a real time data processing application. At a high level, the system ingests page view data, processes it in real time, and then ingests it into a database backend. In terms of scale, the system would need to start off processing roughly 30,000 events per minute at peak with the capability to scale out to 100,000 events per minute fairly easily. In addition, we wanted the data to become available to query “reasonably quickly” so that we could iterate quickly on how we were processing data.

To kick things off, we began by surveying the available tools to ingest, process, and then ultimately query the data. On the datawarehouse side, we had already had some positive experiences with Amazon Redshift so it was a natural choice to keep using it moving forward. In terms of ingestion and processing, we decided to move forward with Kinesis and Gearman. The fully managed nature of Kinesis made it the most appealing choice and Gearman’s strong PHP support would let us develop workers in a language everyone was comfortable with.

Our final implementation is fairly straightforward. An Elastic Load Balancer handles all incoming HTTP requests which are routed to any number of front end machines. These servers don’t do any computation and fire of messages into a Kinesis stream. On the backend, we have a consumer per Kinesis stream shard that creates Gearman jobs for pre-processing as well as Redshift data ingestion. Although it’s conceptually simple, there’s a couple of “gotchas” that we ran into implementing this system:

Slow HTTP requests are a killer: The Kinesis API works entirely over HTTP so anytime you want to “put” something into the stream it’ll require a HTTP request. The problem with this is that if you’re making these requests in real time in a high traffic environment you run the risk of locking up your php-fpm workers if the network latency to Kinesis starts to increase. We saw this happen first hand, everything would be fine and then all of a sudden the latency across the ELB would skyrocket when the latency to Kinesis increased. To avoid this, you need to make the Kinesis request in the background.

SSL certificate verification is REALLY slow: Kinesis is only available over HTTPs so by default the PHP SDK (I assume others as well) will perform an SSL key verification every time you use a new client. If you’re making Kinesis requests inside your php-fpm workers that means you’ll be verifying SSL keys on every request which turns out to be really slow. You can disable this in the official SDK using the “curl.options” parameter and passing in “CURLOPT_SSL_VERIFYHOST” and “CURLOPT_SSL_VERIFYPEER”

There’s no “batch” add operation: Interestingly Apache Kafka, which Kinesis is based on, supports batch operations but unfortunately Kinesis doesn’t. You have to make an HTTP request for every message you’re adding to the stream. What this means is that even if you’re queuing the messages in the background, you’ll still need to loop through them all firing off HTTP requests

Your consumer needs to be fast: In your consumer, you’ll basically end up with code that looks like – https://gist.github.com/adatta02/842531b3fe93097ee030 Because Kinesis shard iterators are only valid for 5 minutes, you’ll need to be cognizant of how long the inner for loop takes to run. Each “getRecords” call can return a max of 10,000 records so you’ll need to be able to process 10k records in less than 5 minutes. Our solution for this was to offload all the actual processing to Gearman jobs.

Anyway, we’re still fairly new to using Kinesis so I’m sure we’ll learn more about using it as the system is in production. A few things have already been positive including that it makes testing new code locally easy since you can just “tap” into the stream, scaling up looks like it means just adding additional shards, and since its managed we’ve got one less thing to worry about.

As always, questions and comments welcome!