Deleting files older than specified time with s3cmd and bash

Update: Amazon has now made it so you can set expiration times on objects in S3, see more here: https://forums.aws.amazon.com/ann.jspa?annID=1303

Recently I was working on a project where we upload backups to Amazon S3.  I wanted to keep the files around for a certain duration and remove any files that were older than a month.  We use the s3cmd utility for most of our command line based calls to S3, however it doesn’t have a built in “delete any file in this bucket that is older than 30 days” function.  After googling around a bit, we found some python based scripts, however there wasn’t any that was a simple bash script that would do what I was looking for.  I whipped this one up real quick, it may not be the best looking but it gets the job done:

Getting started with Hadoop, Hive, and Sqoop

I apologize for the buzzword heavy title but it was the best I could do. I couldn’t find a good quick start explaining how to get started with Hive so I thought I’d share my experiences.

Anyway, a client of ours came to us needing to analyze a dataset that was about ~200 million rows over 6 months and is currently growing at about 10 million rows a week and increasing. From a reporting standpoint, they were looking to run aggregate counts and group bys over the data and then display the results on charts. Additionally, they were also looking to select subsets of the data and use them later – basically SELECT * FROM table WHERE x AND y AND z.

Obviously, doing the calculations in real time was out of the question so we knew we were looking for a solution that would be easy to use, support the necessary requirements and that would predictably scale with the increasing generation rate of data.

On the surface, MySQL looks like a decent approach but it presents a couple of issues pretty quickly:

  • In order for the SUM, GROUP BY, and COUNT queries to be at all useful the MySQL tables would have to be heavily indexed. Unfortunately, due to the write heavy workload of the app this would mean having to copy data into an indexed MySQL database before running any reports.
  • Even with indexes, MySQL was pretty awful at selecting subsets of the data from a performance perspective.
  • And probably the biggest issue with MySQL is that it doesn’t scale linearly in the sense that if the data is growing at 500 million rows a week you can’t simply “throw more hardware” at it and be done with it.

With requirements in hand we hit the Internet and finally arrived at Hive running on top of Hadoop. Per Wikipedia,

From our perspective, this stack fits our requirements nicely since it doesn’t rely on keeping a second “reporting” MySQL database available, it will handle both sum/count/group by and selecting subets, and probably most importantly it will allow us at least in the near term to scale with the increasing rate of data generation.

“Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.”

To grossly over simplify, Hadoop provides a framework that allows you to break up a data intensive task into discrete pieces, run the pieces in a distributed fashion, and then combine the results giving you the results of the completed task. The quintessential example of a task that can be parallelized in this fashion is sorting a *really* big list since the list can be sorted in pieces and then the results can be combined at the end. See Merge Sort

The second piece of the tool chain is Hive. Again via Wikipedia,

“Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.”

Basically, Hive is a tool that leverages the Hadoop framework to provide reporting and query capabilities using a syntax similar to SQL.

That just leaves Sqoop, the app with a funny name and no Wikipedia entry. Sqoop was originally developed by Cloudera and basically serves as an import tool for Hadoop. For my purposes, it allowed me to easily import the data from my MySQL database into Hadoop’s HDFS so I could use it in Hive.

The rest of this post walks you through setting up Hadoop+Hive and analyzing some MySQL data.

Now that you know the players, lets figure out what we’re actually trying to do.

  1. We want to start a Hadoop cluster to use Hive on.
  2. Load our data from a MySQL database into this Hadoop cluster.
  3. Use Hive to run some reports on this data.
  4. Warehouse the results of this data in MySQL so we can graph it (not that exciting).

Starting the cluster

Theres actually one more tool you’ll need to get this to work – Apache Whirr. Whirr is actually really cool, it lets you automatically start cluster services (Hadoop, Voldermont, etc.) at a handful of cloud platforms (AWS, Rackspace, etc.)

NOTE: We exclusively use AWS for our hosting so everything described here is specific to AWS.

Fisrt, download the latest copy of Whirr – http://www.fightrice.com/mirrors/apache//incubator/whirr/ to your local machine. Whirr should work everywhere but these directions will match up against Linux/OSX the best.

The first thing you’ll need is a Whirr configuration file describing the cluster you want to build. Create a file called hadoop.properties and paste in the following:

whirr.cluster-name=hadoop
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,2 hadoop-datanode+hadoop-tasktracker
whirr.hadoop-install-function=install_cdh_hadoop
whirr.hadoop-configure-function=configure_cdh_hadoop

whirr.provider=aws-ec2
whirr.identity=[REPLACE THIS WITH YOUR AWS ID]
whirr.credential=[REPLACE THIS WITH YOUR AWS SECRET]

whirr.hardware-id=m1.large
whirr.image-id=us-east-1/ami-68ad5201
whirr.location-id=us-east-1

whirr.private-key-file=~/.ssh/id_rsa
whirr.public-key-file=~/.ssh/id_rsa.pub

There isn’t a ton going on in the file but you’ll need to switch out the credential lines for your AWS credentials. Also, you’ll need to double check that the ssh paths are accurate for your account.

The next step, is to actually launch the cluster. To do this run this command – double check the path to your hadoop.properties file is accurate:

./bin/whirr launch-cluster --config hadoop.properties

Just give it a few minutes, you’ll see a bunch of debug info scrolling across your terminal and hopefully a success message once its done. At this point, you’ll have a fully built Hadoop cluster with 3 nodes as described in your properties file ( 1 hadoop-namenode+hadoop-jobtracker,2 hadoop-datanode+hadoop-tasktracker ).

You can see all your nodes by checking out your Whirr cluster directory.

cat ~/.whirr/hadoop/instances

Prepping and loading the cluster

Now that the cluster is up, you’ll need to prep it and then load your data with Sqoop.

One of the most irritating “gotchas” I stumbled across was that Whirr adds the firewall rules necessary for Hadoop to its AWS security group.

Before you do anything, open your EC2 control panel and modify the new Whirr security group (#jcloud-something) so that all of your nodes can connect to each other on port 3306 (MySQL)

The next step is to install mysql-client across the entire cluster since Sqoop uses mysqldump to get at your data. You could manually ssh into every machine but Whirr provides a convenient “run-script” command to do just that.

Create a file called “prepCluster.sh” and put “sudo apt-get -q -y install mysql-client” in it. Then make sure the paths are right and run,

./bin/whirr run-script --script prepCluster.sh --config hadoop.properties

Once its done, you’ll see the aptitude output from all your nodes as they downloaded the MySQL client.

The next step is to install mysql-server, hive, and sqoop on the jobtracker. Doing this is pretty straightforward, look at the .whirr/hadoop/instances file from above and copy the namenode hostname.

Next, ssh in to that machine using your current username as the username. Once you’re in, just run the following to install everything:

sudo apt-get -q -y install sqoop
sudo apt-get -q -y install hadoop-hive
sudo apt-get -q -y install screen

NOTE: You’ll also need the MySQL ConnectorJ library so that Sqoop can connect to MySQL. Download it here and place it in “/usr/lib/sqoop/lib/”

Once everything is done installing, you’ll most likely want to move your MySQL data directories from their default location onto the /mnt partition since it’s much larger. Check out this article for a good walk through. Don’t forget to update AppArmour or MySQL won’t start. Once MySQL is setup, load the data you want to crunch.

Now, you’ll need to use Sqoop to load the data from the MySQL database into Hadoop’s HDFS. While logged into the jobtracker node you can just run the following to do that. You’ll need to swap out the placeholders in the command and change the u/p.

sqoop-import-all-tables --connect jdbc:mysql://[IP of your jobtracker]/[your db_name] --username root --password root --verbose --hive-import

Once it completes, Sqoop will have copied all your MySQL data into Hadoop’s HDFS file system and initialized Hive for you.

Crunching the data

Run “hive” on the jobtracker and you’ll be ready to start crunching your data.

Check out the Hive language manual for more info on exactly what queries you can write.

Once you’ve narrowed down how to write your queries, you can use Hive’s “INSERT OVERWRITE LOCAL DIRECTORY” command to output the results of your query into a local directory.

Then, the next step would be to TAR up these results and use scp to copy the results back to your local machine to analyze or warehouse.

Shutting it down

The final thing you’ll need to do is shut down the cluster. Whirr makes it pretty easy:

./bin/whirr destroy-cluster --config hadoop.properties

Give it a few minutes and Whirr will shutdown the cluster and clean up the EC2 security group as well.

Anyway, hope this walk through proves useful for someone. As always, feedback, questions, and comments are all more than welcome.

Upload directly to S3 with SWFUpload

I was working on an application earlier today that required allowing a user to upload a large file (several hundred MB) which would eventually be stored on Amazon S3. After reviewing the requirements, I realized it made sense to just upload the file directly to S3 instead of having to first stage the file on a server and then use PHP to push the file to S3.

Amazon has a nice walk through of using a plain HTML form to upload a file directly to S3 here.

I had all ready been using SWFUpload to upload files to the server so I decided to look into using it to uploading directly to S3. After some head banging, I finally got it to work – here’s the quick n dirty.

  1. Download SWFUpload 2.5
  2. Get SWFUpload ready to use in your project. Copy the SWF file somewhere accessible and include their swfupload.js Javascript file. More info here
  3. Setup an S3 bucket. You’ll need to set the policy to allow uploads from your own user (its the default).
  4. Place a crossdomain.xml file in the root of your S3 bucket. This file “authorizes” flash player to upload files into this host. The content of the file is below.
  5. Initialize the SWFUpload object (example below).
  6. Before beginning the upload, you need to set the appropriate postParams in the SWFUpload object. This is really the “magic” of this process. Example is below.
  7. Start the upload with startUpload()

Thats it! It’s pretty straight forward once you have things going. As an FYI, you can put SWFUpload into “debug” mode by adding debug: true as a property to the initialization object. You can also debug the responses from Amazon by using a packet sniffer like Wireshark.

crossdomain.xml

You probably want to make this file a little less permissive. More details here. Also note, there are differences in the implementation of the file between various versions of Flash player.

Initialize SWFUpload

Set SWFUpload postParams

The HMAC signature MUST be calculated on the server because it uses your S3 secret. You MUST keep that value secret in order to maintain the security of your S3 buckets. I’m using Don Schonknecht’s S3 PHP library to calculate the HMAC signatures but you could just as easily do it in straight PHP.

99designs and Amazon. Design. Crowd sourced.

A week or so ago my Dad asked if we could have our designer put together a logo for him. Unfortunately, our guy was buried under a mound of work and generally couldn’t help us out. We haven’t always had the best luck with Craigslist so I was ready to try something new.

Over the last few months, I’ve been seeing a good amount of chatter surrounding “spec” design sites especially 99designs.com. After taking a look around the site I figured now was a good time to give it a shot. We were on a tight budget, tight time line, and my Dad didn’t have much direction for the logo.

I posted up a contest last Sunday here and we were looking at entries by Monday afternoon. Now things got more difficult. We were having a hard time coming up with “star ratings” and constructive feedback in general. My Dad’s staff was having a hard time not getting pigeon holed by the submitted designs and Setfive wasn’t doing a great job helping them along.

We did our best and we felt like the entries were moving in the right direction. Then the contest closed. In the last 8 hours of the contest the number of entries nearly doubled. With 70 entries we now had the problem with objectively picking a winning logo.

At this point, I wanted some more input on what people thought about the logos. I decided to create a set of Amazon Mechanical Turk tasks to get some feedback.

After about a day, I had 200 responses asking for user’s top three logos and any additional comments they had.

Some of the comments I got back were insightful and moving:

  • Don’t pick any of the logos on the second page. They all look terrible.
  • Due to nature of your business I would prefer a sober and serious looking logo.
  • I chose these three because they are visually appealing, and convey a sense of being able to ease pain.
  • I suffer from cronic pain. I wish you the best of luck in finding your logo. People that do your type of work are a life line for people like me. Hope I hope have helped.

I tallied up the results by weighting +3, +2, +1 for first, second, and third choices respectively. The results were interesting.

  • Every logo received at least one vote.
  • The top ten logos accounted for just about 41% of all the votes.
  • Only counting the top choice caused 3 logos to fall out of the top ten.

The top ten logos as voted by the Amazon Mechanical Turks were:

Entry ID Votes URL

88

76

http://99designs.com/contests/24619/entries/88

78

65

http://99designs.com/contests/24619/entries/78

75

60

http://99designs.com/contests/24619/entries/75

87

55

http://99designs.com/contests/24619/entries/87

91

51

http://99designs.com/contests/24619/entries/91

58

48

http://99designs.com/contests/24619/entries/58

47

42

http://99designs.com/contests/24619/entries/47

90

41

http://99designs.com/contests/24619/entries/90

43

36

http://99designs.com/contests/24619/entries/43

Personally, I like the top ten logos and my Dad’s staff seems to like many of the same logos that were voted up. It’s been an interesting experiment almost exclusively using the “crowd” to design and then select a logo. I’m not sure if we’ll use 99designs in the future but it has been a pleasant experience.

We still haven’t picked a winning logo but I’ll update once we do!

Update:

We finally picked a winner! We decided to go with the crowd and selected http://99designs.com/contests/24619/entries/88 as the winning logo.

Artificial artificial intelligence – our experience with Mechanical Turk

So Amazon Web Services (AWS) has a pretty neat service called the Mechanical Turk. The name of the service is inspired by this story where a human was hidden inside a chess playing “robot” which impressed crowds during the late 18th century. Amazon’s service doesn’t hide the fact that its powered by humans but the end result is the same – humans performing tasks in lieu of a computer.
The task we needed “turked” was the transcription of various text labels that were on top of overlays on a particular large image. The goal of all of this was a Google Maps product so we all ready had the overlays loaded into a Google Maps UI.  To recap:

  • We had a Google Maps application which loaded an image layer with about 3000 discrete overlays.
  • We had lat/lng coordinates for all of these overlays.
  • Each of these images had a text label that uniquely described it.
  • We wanted the Turks to transcribe these labels.

Next, we began to investigate how the Mechanical Turk service actually works. The process is delightfully simple. The idea is to break the tasks out into discrete and reproducible actions called human intelligence tasks (HITs). This pattern aligned well with our problem because all of the overlays were independent and we had coordinates for each of them. Amazon displays each of these HITs inside a template that the “requester” constructs.

The templates are HTML pages (Amazon allows javascript) which plug-in variables for each HIT when the tasks go live. They also capture the data that you want to save from the task – in our case the labels. Designing templates is pretty straightforward and javascript is a nice touch.

For our task, we embeded Google Maps with our image overlaid and used HIT variables to display 3 markers per task. This is where we hit our first and only snag. Because of directory restrictions on Google Maps API keys our HIT template kept generating javascript errors because it was being executed on un-predictable domain names. We didn’t think this was a huge deal so we plowed ahead as planned.
We ran the first set of HITs with a 2 cent/HIT payment and a minimum approval rating of 95%.

The results were less than stellar – within an hour or so we received this email:

A strange error message pops up for your/these HITS.
“The Google Maps API key used on this web site was registered for a different web site. You can generate a new key for this web site at http://code.google.com/apis/maps/.”
Also, the mouseover that displays at screen bottom is “javascript: void (0)”…and no label appears below the marker either.  I do NOT have any javascript blockers operational, so no problem there.
Also, I would appreciate it if you NOT deny me credit for this HIT, as the technical error message was not my fault and not explained or warned in the instructions.
I would like to do several, possibly many of these HITS, if you can tell me what to do to overcome this problem.
Thanks,
[REDACTED]

People were obviously not getting the instructions and were scared away by the JS error and fear of HIT rejection. One of the problems with the service is that you can’t modify your HIT template once a set has been published. You have to cancel the batch and re-start after your edits. At this point, we decided to cancel the run and modify the instructions. With the new instructions people seemed to “get it” and were generally more willing to forgive the JS error.

At the end of the run, the accuracy of the Turks was around 90% or so on transcribing the labels. We got about 25 man hours of work accomplished in about 42 hours of actual time at a cost of about $25. All told we were really impressed with the service as well as the Turks themselves. We definitely recommend the service for any discrete and repeatable tasks.

Things we learned:

  • Be EXTREMELY clear in your instructions – try to be as un-ambiguity as possible.
  • The Turks live and die by their approval rating so be nice (we just accepted every task).
  • Unexpected popups and JavaScript errors probably scare people away so try to avoid them.
  • Obviously higher reward rates are going to attract more Turks – something to take into account.

We’ve also recently become fascinated by other things that could be played out or experimented with the Turks. Notably it seems like an ideal venue to experiment with some game theory topics.