S3Grep – Searching S3 Files and Buckets

On a project we were working on recently it appeared that we had data coming into our Extract, Transform, Load (ETL) processes which should have been filtered out. In this particular case the files which we imported only would exist at max up to 7 days and on any given day we’d have tens of thousands of files that would be created and imported. This presented a difficult problem to trace down if something inside our ETL had gone awry or if we were being fed bad data. Furthermore as the files always would be deleted after importing we didn’t keep where a data point was created from.

Instead of updating our ETL process to track where a specific piece of data originated from we wanted to basically ‘grep’ the files in S3. After looking around it doesn’t look like anyone has built a “Grep for S3”, so we built one. The reason we didn’t simply download the files locally and then process them one at a time is it’d take forever to transfer, then grep each one individual sequentially. Instead we wanted to do the search in parallel and not hold the entire files on the local disk.

With this we came up with our simple S3Grep java app (a pre-built jar is located in the releases) which will search all files in a specific bucket for a specific string. It currently supports both regex or non-regex search strings. You can specify how many threads you want it to use to process the files or it by default will try to use the same number of CPU’s on your machine. It utilizes the S3 Java adapter to read the files as a stream rather than a single transfer, than read from disk. Using the tool is very simple:

A the s3grep.properties file is a config file where you setup what you are searching for. An example:

For the most part this is self explanatory. The log level will default to INFO, however if you specify DEBUG it will output some more information such as what file’s it is currently checking. The logger_pattern parameter defaults to “%d{dd MMM yyyy HH:mm:ss} [%p] %m%n” and can be any pattern you want. For more information on the formatting visit the PatternLayout Documentation.

The default output format would look something like this:

If you want a little less verbose and more of just log lines you can update the logger_pattern to be just %m%n and end up with something similar to:

The format of the output is FILE:LINE_NUMBER:matching_string.

Anyways hope this helps you if you are trying to hunt down what file contains a text string in your S3 buckets. Let us know if you have any questions or if we can help!

Amazon Web Services: Using AWS? You Should Enable IAM

Most of our clients are using Amazon Web Services for most, if not all, of their infastructure needs. They’re doing things like using EC2 for servers, S3 for storage and backups, Route53 for DNS, and SES for sending transactional email. For the most part, everything works pretty well and the overall experience is pretty solid. One issue that does come up is that with this strong reliance on Amazon, a lot of people within an organization end up needing to login to the AWS Console. Doing things like pulling data off S3, managing EC2 instances, and creating email addresses all ultimately require logging in to Amazon. Unfortunately, as an organization grows they’ll usually end up passing around a single “master password” for their single Amazon account. Passing around a password like this poses a huge operational risk but AWS actually has built in functionality to mitigate this called Amazon IAM which helps you administer rights access on your account.

What is it?

Amazon IAM is AWS’s identty and access management solution. What it does is allows you to add additional authorized users to your Amazon account, organize them in groups, and then grant the individual groups various permissions on your account. IAM would allow you to do something like setup a group called “access backup only”, add 3 users to it, and then only allow them to download files from S3. From an operational perspective, IAM will allow every user that needs access to have their own account with its own set of permissions which can be revoked at any time.

Why you should use it

The biggest direct benefit to using IAM is that you’ll be able to give every authorized user a separate account which they can access AWS with. This means if you have to terminate an employee or stop working with an agency you won’t have to do a “fire drill” and change your AWS password or worry about which access keys they have. On top of this, since each group has limited permissions you can be confident that inexperienced users won’t accidentally do something inappropriate.

The other big benefit to implementing IAM is that you’ll be able to take advantage of multi-factor authentication. Multi-factor authentication basically means that instead of *just* needing a password to login, you’ll also need a one-time use secure token. MFA tokens can be generated in several ways, from an RSA token to a smartphone app. If you’re already using Google’s Authenticator app for your Google Account (and you should) you can just link it in with your IAM account.

Anyway, enable Amazon IAM and you’ll sleep better at night.