AWS: What are the key Amazon Web Services components?

Over the last couple of years, the popularity of the “cloud computing” has grown dramatically and along with it so has the dominance of Amazon Web Services (AWS) in the market. Unfortunately, AWS doesn’t do a great job of explaining exactly what AWS is, how its pieces work together, or what typical use cases for its components may be. This post is an effort to address this by providing a whip around overview of the key AWS components and how they can be effectively used.

Great, so what is AWS? Generally speaking, Amazon Web Services is a loosely coupled collection of “cloud” infrastructure services that allows customers to “rent” computing resources. What this means is that using AWS, you as the client are able to flexibly provision various computing resources on a “pay as you go” pricing model. Expecting a huge traffic spike? AWS has you covered. Need to flexibly store between 1 GB or 100 GB of photos? AWS has you covered. Additionally, each of the components that makes up AWS is generally loosely coupled meaning that they can work independently or in concert with other AWS resources.

Since AWS components are loosely coupled, you’d be able to mix and match only what you need but here is an overview of the key services.

Route53

What is it? Route53 is a highly available, scalable, and feature rich domain name service (DNS) web service. What a DNS service does is translate a domain name like “setfive.com” into an IP address like 64.22.80.79 which allows a client’s computer to “find” the correct server for a given domain name. In addition, Route53 also has several advanced features normally only available in pricey enterprise DNS solutions. Route53 would typically replace the DNS service provided by your registrar like GoDaddy or Register.com.

Should you use it? Definitely. Allow it isn’t free, after last year’s prolonged GoDaddy outage it’s clear that DNS is a critical component and using a company that treats it as such is important.

Simple Email Service

What is it? Simple Email Service (SES) is a hosted transactional email service. It allows you to easily send highly deliverable emails using a RESTful API call or via regular SMTP without running your own email infrastructure.

Should you use it? Maybe. SES is comparable to services like SendGrid in that it offers a highly deliverable email service. Although it is missing some of the features that you’ll find on SendGrid, its pricing is attractive and the integration is straightforward. We normally use SES for application emails (think “Forgot your password”) but then use MailChimp or SendGrid for marketing blasts and that seems to work pretty well.

Identity and Access Management

What is it? Identity and access management (IAM) provides enhanced security and identity management for your AWS account. In addition, it allows you to enable “multi factor” authentication to enhance the security of your AWS account.

Should you use it? Definitely. If you have more than 1 person accessing your AWS account using IAM will allow everyone to get a separate account with fine grained permissions. Multi factor authentication is also critically important since a compromise at the infrastructure level would be catastrophic for most businesses. Read more about IAM here.

Simple Storage Service

What is it? Simple storage service (S3) is a flexible, scalable, and highly available storage web service. Think of S3 like having an infinitely large hard drive where you can store files which are then accessible via a unique URL. S3 also supports access control, expiration times, and several other useful features. Additionally, the payment model for S3 is “pay as you go” so you’ll only be billed for the amount of data you store and how much bandwidth you use to transfer it in and out.

Should you use it? Definitely. S3 is probably the most widely used AWS service because of its attractive pricing and ease of use. If you’re running a site with lots of static assets (images, CSS assets, etc.), you’ll probably get a “free” performance boost by hosting those assets on S3. Additionally, S3 is an ideal solution for incremental backups, both data and code. We use S3 extensively, usually for hosting static files, frequently backing up MySQL databases, and backing up git repositories. The new AWS S3 Console also makes administering S3 and using it non-programmatically much easier.

Elastic Compute Cloud

What is it? Elastic Compute Cloud (EC2) is the central piece of the AWS ecosystem. EC2 provides flexible, on-demand computing resources with a “pay as you go” pricing model. Concretely, what this means is that you can “rent” computing resources for as long as you need them and process any workload on the machines you’ve provisioned. Because of its flexibility, EC2 is an attractive alternative to buying traditional servers for unpredictable workloads.

Should you use it? Maybe. Whether or not to use EC2 is always a controversial discussion because the complexity it introduces doesn’t always justify its benefits. As a rule of thumb, if you have unpredictable workloads like sporadic traffic using EC2 to run your infrastructure is probably a worthwhile investment. However, if you’re confident that you can predict the resources you’ll need you might be better served by a “normal” VPS solution like Linode.

Elastic Block Store

What is it? Elastic block store (EBS) provides persist storage volumes that attach to EC2 instances to allow you to persist data past the lifespan of a single EC2. Due to the architecture of elastic compute cloud, all the storage systems on an instance are ephemeral. This means that when an instance is terminated all the data stored on that instance is lost. EBS addresses this issue by providing persistent storage that appears on instances as a regular hard drive.

Should you use it? Maybe. If you’re using EC2, you’ll have to weigh the choice between using only ephemeral instance storage or using EBS to persist data. Beyond that, EBS has well documented performance issues so you’ll have to be cognizant of that while designing your infrastructure.

CloudWatch

What is it? CloudWatch provides monitoring for AWS resources including EC2 and EBS. CloudWatch enables administrators to view and collect key metrics and also set a series of alarms to be notified in case of trouble. In addition, CloudWatch can aggregate metrics across EC2 instances which provides useful insight into how your entire stack is operating.

Should you use it? Probably. CloudWatch is significantly easier to setup and use than tools like Nagios but its also less feature rich. We’ve had some success coupling CloudWatch with PagerDuty to provide alerts in case of critical service interruptions. You’ll probably need additional monitoring on top of CloudWatch but its certainly a good baseline to start with.

Anyway, the AWS ecosystem includes several additional services but these are the ones that I felt are key to getting started on AWS. We haven’t had a chance to use it yet but Redshift looks like it’s an exciting addition which will probably make this list soon. As always, comments and feedback welcome.

Symfony2: Configuring VichUploaderBundle and Gaufrette to use AmazonS3

Last week, I was looking to install the VichUploaderBundle into a Symfony2 project to automatically handle file uploads. As I was looking through the Vich documentation I ran across a chunk describing being able to use Gaufrette to skip the local filesystem and push files directly to Amazon S3. Since we’d eventually need to load balance the app and push uploaded files to S3 anyway, I decided to set it up out of the gate. Unfortunately, the documentation for setting up Vich with Gaufrette is a bit opaque so here’s a step by step guide to getting it going.

Install Everything

The first thing you’ll want to do is install all the required packages. If you’re using Composer, the following will work:

Once all the packages are installed, you’ll need to configure *both* Gaufrette and Vich. This is where the documentation broke down a bit for me. You’ll need your Amazon AWS “Access Key ID” and “Secret Key” which are both available at https://portal.aws.amazon.com/gp/aws/securityCredentials if you’re logged into AWS.

Configure It

Once everything is configured at the YAML level, the final step is adding the Vich annotations to your entities.

Make sure you add the “@Vich\Uploadable” annotation to your Entity or Vich will fail silently.

The “mapping” specified in “@Vich\UploadableField(mapping=”logo”, fileNameProperty=”logo”)” needs to match the value under “vich_uploader.mappings” which you defined in config.yml

Finally, one last “gotcha” to be cognizant of is this bug – https://github.com/dustin10/VichUploaderBundle/issues/123. Since Vich uses Doctrine lifecycle callbacks to manage files, if no Doctrine fields are changed then the Vich code isn’t executed. The easiest way to get around this (and what we used), is just to manually update the “updated_at” column every time a form is submitted to ensure that the upload handling code is executed.

Anyway, as always, questions and comments are welcome.

Amazon Web Services: Using AWS? You Should Enable IAM

Most of our clients are using Amazon Web Services for most, if not all, of their infastructure needs. They’re doing things like using EC2 for servers, S3 for storage and backups, Route53 for DNS, and SES for sending transactional email. For the most part, everything works pretty well and the overall experience is pretty solid. One issue that does come up is that with this strong reliance on Amazon, a lot of people within an organization end up needing to login to the AWS Console. Doing things like pulling data off S3, managing EC2 instances, and creating email addresses all ultimately require logging in to Amazon. Unfortunately, as an organization grows they’ll usually end up passing around a single “master password” for their single Amazon account. Passing around a password like this poses a huge operational risk but AWS actually has built in functionality to mitigate this called Amazon IAM which helps you administer rights access on your account.

What is it?

Amazon IAM is AWS’s identty and access management solution. What it does is allows you to add additional authorized users to your Amazon account, organize them in groups, and then grant the individual groups various permissions on your account. IAM would allow you to do something like setup a group called “access backup only”, add 3 users to it, and then only allow them to download files from S3. From an operational perspective, IAM will allow every user that needs access to have their own account with its own set of permissions which can be revoked at any time.

Why you should use it

The biggest direct benefit to using IAM is that you’ll be able to give every authorized user a separate account which they can access AWS with. This means if you have to terminate an employee or stop working with an agency you won’t have to do a “fire drill” and change your AWS password or worry about which access keys they have. On top of this, since each group has limited permissions you can be confident that inexperienced users won’t accidentally do something inappropriate.

The other big benefit to implementing IAM is that you’ll be able to take advantage of multi-factor authentication. Multi-factor authentication basically means that instead of *just* needing a password to login, you’ll also need a one-time use secure token. MFA tokens can be generated in several ways, from an RSA token to a smartphone app. If you’re already using Google’s Authenticator app for your Google Account (and you should) you can just link it in with your IAM account.

Anyway, enable Amazon IAM and you’ll sleep better at night.

S3: Using Amazon S3 for large file transfers

A few days ago, a friend of mine reached out asking for a good solution for securely transferring a relatively large (~1GB) file to several of her prospective clients. Strangely, even in 2013 the options for transferring such a large file in a reliable manner is pretty limited. I looked into services like YouSendIt, WeTransfer, and SendThisFile but they all suffer from similar limitations. Most of them have a <1GB file size limit, their payment plans are monthly subscription based instead of pay as you go, and they don’t offer custom domains or access control. Apart from these services, there is also the trusty old school option of using an FTP server but that raises the issue of having to maintain your own FTP server, using a non-intuitive FTP client, and still being locked into paying a monthly fee instead of “pay as you go". Stepping back and looking at the issue from a different angle, it then became clear that the S3 component of Amazon’s Web Service offering is actually an ideal solution for this problem. The S3 piece of AWS is basically a flexible “cloud based” storage solution that lets you programmatically upload files, store them indefinitely, and then serve them as you please. Looking at the issues we’re trying to overcome, S3 satisfies all of them out of the box. S3 has a single file size limit of 5 Terabytes, files can be served off a custom domain like archives.setfive.com, billing is pay as you go depending on the resources you use, and S3 supports access control so you have fine grained access over who can download files and for how long. So how do you actually use S3?

Setting up and using S3

  • The first thing you’ll need is an Amazon account that has S3 enabled. If you already have an Amazon account, just head over to http://aws.amazon.com/s3/ to activate S3 for your account.
  • Next, there are several ways to actually use S3 but the easy way is probably using Amazon’s own Web Console. Just head over to https://console.aws.amazon.com/s3/home?region=us-east-1 to load the console.
  • In AWS parlance, you’ll need to create a “bucket” which is the root organizational structure on S3. You can map a “bucket” to a custom domain name so think of it like the “drive” that you’re upload files to. Go ahead and create a bucket!
  • Next, click the name of your bucket and you’ll get “into” the bucket where you should see a notice telling you the bucket is empty. This is where you can upload and delete files or create additional organizational folders. To upload a file, click the “Actions” menu in the header and select “Upload”. Click upload, and then in the popup select “Add Files” to add some files and “Stat Upload” to kick off the upload.
  • When the upload finishes, in the left panel you’ll see the file you just upload. Congratulations you’re using the cloud! If you want to make the file PUBLIC, just right click on it and click “Make Public”, this will let you access the file without any special URL arguments like https://s3.amazonaws.com/big-bertha/logo_horizontal.png
  • To get the link for your file, click it to see the properties and then on the right panel you’ll see the link.
  • To delete a file, just right click on it and select “Delete”

Anyway, thats a quick rundown of how to use Amazon’s S3 service for file transfers. The pricing is also *very* cheap compared to traditional “large file transfer” services.

Check out some other useful links about S3:

Big Data: What is “Big Data”?

Last week, I was catching up with a friend of mine and we started chatting about his most recent project. As we were chatting, he made an offhand comment about how some of the business guys on the team love to refer to what they are working on as a “big data” play, even though it really wasn’t. This stuck with me, since because of the vague definitions around “big data”, it’s easy to shoe horn problems into a “big data” play. Because of this, I think its worth taking a step back and discussing what big data really is and what tools are available to work with it.

It’s all just data

At the end of the day, data is data. It doesn’t really matter if its stored in a CSV text file, a MySQL database, or a NoSQL datastore like Cassandra or MongoDB. Typically though, web applications tend to use a relational database like MySQL or Postgres to persist data. Relational databases store data in a series of tables which are in turn arranged as a series of rows and columns. As an abstraction, think of a series of Excel worksheets which can have links between the rows of each sheet.

For most applications, this works out fine, the database ends up managing say a few thousand customer accounts, each with a few hundred thousand objects associated with them and the total dataset fits conveniently into the server’s RAM. Since the dataset is relatively small, things like retrieving information, updating records, and running ad-hoc analytics queries are all easy to implement and relatively fast. But what happens if your dataset doesn’t fit into memory of even the beefiest of servers? Therein lies the “big data” problem.

Certain applications generate an enormous amount of data on a daily basis. For example, look at Mixpanel, tracking discreet user interactions is going to produce hundreds of thousands of datapoints every day even with just a few clients. With this volume of data, typical relational databases quickly start performing sluggishly and eventually stop being effective entirely. Even simple queries like counting the “# of clicks by user” start to take hours to run, effectively becoming intractable. Although specialized relational databases like Vertica and Oracle 11g do exist to help solve this problem, they’re expensive and proprietery.

Enter the elephant

One of the first companies to publicly discuss their big data strategies was Google in Bigtable: A Distributed Storage System for Structured Data which described their BigTable datastorage system. Although a proprietary solution, the research paper was used as the basis for Apache Hadoop, an open source framework for running MapReduce style jobs over large datasets.

At this point, Hadoop has distinguished itself as the most popular open source big data solution with a rich ecosystem of tools and several companies providing professional services and support including Cloudera and Hortonworks. What Hadoop provides is a low level framework for allowing computation jobs to be distributed across several servers within a cluster. This allows tools to split up very large datasets into smaller chunks, distribute computational tasks across the cluster, and finally assemble the result. So with the Hadoop framework in place, you still need specific tools built to leverage the distributed framework.

The toolbox

There are several tools that effectively leverage Hadoop but here are some of my favorites for quickly building out a cluster:

Apache Whirr – Automates deploying, bootstrapping, and configuring a Hadoop cluster. Whirr will save you hours of time because instead of manually starting 4 EC2s and configuring them all you can kickstart a cluster with a single command.

Apache HBase – A column store database that is similar to Google’s original BigTable system. Great for storing billions of records across a Hadoop HDFS file system.

Apache Hive – A datawharehousing solution that allows you to run “SQL like” queries using Hadoop. It also has native support for pulling data out of MySQL, making it a convenient addition to a stack includes MySQL.

Apart from these, there are dozens of other Hadoop powered tools but its impossible to recommend a single silver bullet without knowing the details of your “big data” problem.