Amazon S3 Archives - {5} Setfive - Talking to the World

When Amazon Web Services rolled out their version 4 signature we started seeing sporadic errors on a few projects when we created pre-authenticated link to S3 resources with a relative timestamp. Trying to track down the errors wasn’t easy. It seemed that it would occur rarely while executing the same exact code. Our code was simply to get a pre-authenticated URL that would expire in 7 days, the max duration V4 signatures are allowed to be valid. The error we’d get was “The expiration date of a signature version 4 presigned URL must be less than one week”. Weird, we kept passing in “7 days” as the expiration time. After the error occurred a couple of times over a few weeks I decided to look into it.

The code throwing the error was located right in the SignatureV4 class. The error is thrown when the end timestamp minus the start timestamp for the signature was greater than a week. Looking through the way the timestamps were generated it went something like this:

Generate the start timestamp as current time for the signature assuming one is not passed.
Do a few other quick things not related to this problem.
Do a check to insure that the end minus start timestamp is less than a week in seconds.

So a rough example with straight PHP could of the above steps for a ‘7 days’ expiration would be as follows:

Straight forward enough, right? the problem lies when a second “rolls” between generating the `$start` and the end timestamp check. For example, if you generate the `$start` at `2017-08-20 12:01:01.999999`. Let’s say this gets assigned the timestamp of `2017-08-20 12:01:01`. Then the check for the 7 weeks occurs at `2017-08-27 12:01:02.0000` it’ll throw an exception as duration between the start and end it’s actually now for 86,401 seconds total. It turns outs triggering this error is easier than you’d think. Run this script locally:

That will throw an exception within a few seconds of running most likely.

After I figured out the error, the next step was to submit a issue to make sure I’m not misunderstanding how the library should be used. The simplest fix for me was to generate the end expiration timestamp before generating the start timestamp. After I made the PR, Kevin S. from AWS pointed out that while this fixed the problem, the duration still wasn’t guaranteed to always be the same for the same relative time period. For example, if you created 1000 presigned URLs all with ‘+7 days’ as the valid period, some may be 86400 in duration others may be 86399. This isn’t a huge problem, but Kevin made a great point that we could solve the problem by locking the relative timestamp for the end based on the start timestamp. After adding that to the PR it was accepted. As of release 3.32.4 the fix is now included in the SDK.

On a project we were working on recently it appeared that we had data coming into our Extract, Transform, Load (ETL) processes which should have been filtered out. In this particular case the files which we imported only would exist at max up to 7 days and on any given day we’d have tens of thousands of files that would be created and imported. This presented a difficult problem to trace down if something inside our ETL had gone awry or if we were being fed bad data. Furthermore as the files always would be deleted after importing we didn’t keep where a data point was created from.

Instead of updating our ETL process to track where a specific piece of data originated from we wanted to basically ‘grep’ the files in S3. After looking around it doesn’t look like anyone has built a “Grep for S3”, so we built one. The reason we didn’t simply download the files locally and then process them one at a time is it’d take forever to transfer, then grep each one individual sequentially. Instead we wanted to do the search in parallel and not hold the entire files on the local disk.

With this we came up with our simple S3Grep java app (a pre-built jar is located in the releases) which will search all files in a specific bucket for a specific string. It currently supports both regex or non-regex search strings. You can specify how many threads you want it to use to process the files or it by default will try to use the same number of CPU’s on your machine. It utilizes the S3 Java adapter to read the files as a stream rather than a single transfer, than read from disk. Using the tool is very simple:

A the s3grep.properties file is a config file where you setup what you are searching for. An example:

For the most part this is self explanatory. The log level will default to INFO, however if you specify DEBUG it will output some more information such as what file’s it is currently checking. The logger_pattern parameter defaults to “%d{dd MMM yyyy HH:mm:ss} [%p] %m%n” and can be any pattern you want. For more information on the formatting visit the PatternLayout Documentation.

The default output format would look something like this:

If you want a little less verbose and more of just log lines you can update the logger_pattern to be just %m%n and end up with something similar to:

The format of the output is FILE:LINE_NUMBER:matching_string.

Anyways hope this helps you if you are trying to hunt down what file contains a text string in your S3 buckets. Let us know if you have any questions or if we can help!

We use Amazon Web Services quite a bit here. We not only use it to host most of our clients’ applications, but also for backups. We like to use S3 to store our backups as it is reliable, secure and very cheap. S3 stands for Amazon’s Simple Storage Service, it is more or less a limitless place to store data. You can mount S3 as a network hard drive but it’s main use is to store objects, or data, that you can retrieve at a low cost. It has 99.999999999% durability, so you most likely won’t lose anything, but even if you do, we use produce multiple backups for every object.

One thing we’ve noticed is that some people have issues interacting with S3, so here are a few things to help you out. First, if you are just looking to browse your S3 you can do so via your AWS Console or I like to use S3Fox. However, when you are looking to write some scripts or access it from the command line it can be difficult if you don’t use some pre-built tools. The best one we’ve found is s3cmd.

s3cmd allows you to list, update, create, delete objects and buckets in your S3. It’s really easy to install. Depending on your distribution of linux you can most likely get it from your package manager. Once you’ve done that you can configure it easily via ‘s3cmd –configure’. You’ll just need access credentials from your AWS account. Once you’ve set it up lets go through some useful commands.

To list your available buckets:

To create a bucket:

To list the contents of a bucket:

To put a file in the bucket it is very easy, just run (ie move tester-1.jpg to the bucket):

To delete the file you can run:

These are the basics. Probably the most common uses that we see are doing backups of data from a server to S3. An example of a bash script for this is as follows:

In this script it will just output the the console any errors. As you are most likely not running this by hand every day you’d want to change the “echo” statements to be mail commands or another way to alert administrators of an error on the backup. If you want to backup more than once a day all you need to change is the way the SQL_FILE variable is named to include hours for example.

This is a very simple backup script for MySQL. One thing that it doesn’t do is remove any old files, there is no reason for this to happen in the script. Amazon now has object lifecycles which allows you to automatically expire files in a bucket that are older than 60 days for example.

One thing that many people forget to do when they are making backups is to make sure that they actually work. We highly suggest that you once a month have a script which will check that whatever you are backing up is valid. This means if you are backing up a database that it checks to make sure that the database will reimport and that the data is valid (ie a row that should always exist does). The worst thing is finding out when you need a backup that your backup failed ages ago and you have no valid ones.

Make sure that your backups are not deleted quicker than it would take you to discover a problem. For example, if you only check your blog once a week, don’t have your backups delete after 5 days as you may discover a problem too late and your backups will also have the problem. Storage is cheap, keep backups for a long time.

Hope s3cmd makes your life easier and if you have any questions leave us a comment below!

Over the last couple of years, the popularity of the “cloud computing” has grown dramatically and along with it so has the dominance of Amazon Web Services (AWS) in the market. Unfortunately, AWS doesn’t do a great job of explaining exactly what AWS is, how its pieces work together, or what typical use cases for its components may be. This post is an effort to address this by providing a whip around overview of the key AWS components and how they can be effectively used.

Great, so what is AWS? Generally speaking, Amazon Web Services is a loosely coupled collection of “cloud” infrastructure services that allows customers to “rent” computing resources. What this means is that using AWS, you as the client are able to flexibly provision various computing resources on a “pay as you go” pricing model. Expecting a huge traffic spike? AWS has you covered. Need to flexibly store between 1 GB or 100 GB of photos? AWS has you covered. Additionally, each of the components that makes up AWS is generally loosely coupled meaning that they can work independently or in concert with other AWS resources.

Since AWS components are loosely coupled, you’d be able to mix and match only what you need but here is an overview of the key services.

Route53

What is it? Route53 is a highly available, scalable, and feature rich domain name service (DNS) web service. What a DNS service does is translate a domain name like “setfive.com” into an IP address like 64.22.80.79 which allows a client’s computer to “find” the correct server for a given domain name. In addition, Route53 also has several advanced features normally only available in pricey enterprise DNS solutions. Route53 would typically replace the DNS service provided by your registrar like GoDaddy or Register.com.

Should you use it? Definitely. Allow it isn’t free, after last year’s prolonged GoDaddy outage it’s clear that DNS is a critical component and using a company that treats it as such is important.

Simple Email Service

What is it? Simple Email Service (SES) is a hosted transactional email service. It allows you to easily send highly deliverable emails using a RESTful API call or via regular SMTP without running your own email infrastructure.

Should you use it? Maybe. SES is comparable to services like SendGrid in that it offers a highly deliverable email service. Although it is missing some of the features that you’ll find on SendGrid, its pricing is attractive and the integration is straightforward. We normally use SES for application emails (think “Forgot your password”) but then use MailChimp or SendGrid for marketing blasts and that seems to work pretty well.

Identity and Access Management

What is it? Identity and access management (IAM) provides enhanced security and identity management for your AWS account. In addition, it allows you to enable “multi factor” authentication to enhance the security of your AWS account.

Should you use it? Definitely. If you have more than 1 person accessing your AWS account using IAM will allow everyone to get a separate account with fine grained permissions. Multi factor authentication is also critically important since a compromise at the infrastructure level would be catastrophic for most businesses. Read more about IAM here.

Simple Storage Service

What is it? Simple storage service (S3) is a flexible, scalable, and highly available storage web service. Think of S3 like having an infinitely large hard drive where you can store files which are then accessible via a unique URL. S3 also supports access control, expiration times, and several other useful features. Additionally, the payment model for S3 is “pay as you go” so you’ll only be billed for the amount of data you store and how much bandwidth you use to transfer it in and out.

Should you use it? Definitely. S3 is probably the most widely used AWS service because of its attractive pricing and ease of use. If you’re running a site with lots of static assets (images, CSS assets, etc.), you’ll probably get a “free” performance boost by hosting those assets on S3. Additionally, S3 is an ideal solution for incremental backups, both data and code. We use S3 extensively, usually for hosting static files, frequently backing up MySQL databases, and backing up git repositories. The new AWS S3 Console also makes administering S3 and using it non-programmatically much easier.

Elastic Compute Cloud

What is it? Elastic Compute Cloud (EC2) is the central piece of the AWS ecosystem. EC2 provides flexible, on-demand computing resources with a “pay as you go” pricing model. Concretely, what this means is that you can “rent” computing resources for as long as you need them and process any workload on the machines you’ve provisioned. Because of its flexibility, EC2 is an attractive alternative to buying traditional servers for unpredictable workloads.

Should you use it? Maybe. Whether or not to use EC2 is always a controversial discussion because the complexity it introduces doesn’t always justify its benefits. As a rule of thumb, if you have unpredictable workloads like sporadic traffic using EC2 to run your infrastructure is probably a worthwhile investment. However, if you’re confident that you can predict the resources you’ll need you might be better served by a “normal” VPS solution like Linode.

Elastic Block Store

What is it? Elastic block store (EBS) provides persist storage volumes that attach to EC2 instances to allow you to persist data past the lifespan of a single EC2. Due to the architecture of elastic compute cloud, all the storage systems on an instance are ephemeral. This means that when an instance is terminated all the data stored on that instance is lost. EBS addresses this issue by providing persistent storage that appears on instances as a regular hard drive.

Should you use it? Maybe. If you’re using EC2, you’ll have to weigh the choice between using only ephemeral instance storage or using EBS to persist data. Beyond that, EBS has well documented performance issues so you’ll have to be cognizant of that while designing your infrastructure.

CloudWatch

What is it? CloudWatch provides monitoring for AWS resources including EC2 and EBS. CloudWatch enables administrators to view and collect key metrics and also set a series of alarms to be notified in case of trouble. In addition, CloudWatch can aggregate metrics across EC2 instances which provides useful insight into how your entire stack is operating.

Should you use it? Probably. CloudWatch is significantly easier to setup and use than tools like Nagios but its also less feature rich. We’ve had some success coupling CloudWatch with PagerDuty to provide alerts in case of critical service interruptions. You’ll probably need additional monitoring on top of CloudWatch but its certainly a good baseline to start with.

Anyway, the AWS ecosystem includes several additional services but these are the ones that I felt are key to getting started on AWS. We haven’t had a chance to use it yet but Redshift looks like it’s an exciting addition which will probably make this list soon. As always, comments and feedback welcome.

Last week, I was looking to install the VichUploaderBundle into a Symfony2 project to automatically handle file uploads. As I was looking through the Vich documentation I ran across a chunk describing being able to use Gaufrette to skip the local filesystem and push files directly to Amazon S3. Since we’d eventually need to load balance the app and push uploaded files to S3 anyway, I decided to set it up out of the gate. Unfortunately, the documentation for setting up Vich with Gaufrette is a bit opaque so here’s a step by step guide to getting it going.

Install Everything

The first thing you’ll want to do is install all the required packages. If you’re using Composer, the following will work:

Once all the packages are installed, you’ll need to configure *both* Gaufrette and Vich. This is where the documentation broke down a bit for me. You’ll need your Amazon AWS “Access Key ID” and “Secret Key” which are both available at https://portal.aws.amazon.com/gp/aws/securityCredentials if you’re logged into AWS.

Configure It

Once everything is configured at the YAML level, the final step is adding the Vich annotations to your entities.

Make sure you add the “@Vich\Uploadable” annotation to your Entity or Vich will fail silently.

The “mapping” specified in “@Vich\UploadableField(mapping=”logo”, fileNameProperty=”logo”)” needs to match the value under “vich_uploader.mappings” which you defined in config.yml

Finally, one last “gotcha” to be cognizant of is this bug – https://github.com/dustin10/VichUploaderBundle/issues/123. Since Vich uses Doctrine lifecycle callbacks to manage files, if no Doctrine fields are changed then the Vich code isn’t executed. The easiest way to get around this (and what we used), is just to manually update the “updated_at” column every time a form is submitted to ensure that the upload handling code is executed.

Anyway, as always, questions and comments are welcome.

Tag: Amazon S3

An Edge Case of Time in AWS PHP SDK

S3Grep – Searching S3 Files and Buckets

Using s3cmd to make interactaction with Amazon S3 easier, including simple backups

AWS: What are the key Amazon Web Services components?