In the last few months there’s been a handful blog posts basically themed “Redshift vs. Hive”. Companies from Airbnb to FlyData have been broadcasting their success in migrating from Hive to Redshift in both performance and cost. Unfortunately, a lot of casual observers have interpreted these posts to mean that Redshift is a “silver bullet” in the big data space. For some background, Hive is an abstraction layer that executes MapReduce jobs using Hadoop across data stored in HDFS. Amazon’s Redshift is a managed “petabyte scale” data warehouse solution that provides managed access to a ParAccel cluster and exposes a SQL interface that’s roughly similar to PostgreSQL. So where does that leave us?
From the outside, Hive and Redshift look oddly similar. They both promise “petabyte” scale, linear scalability, and expose an SQL’ish query syntax. On top of that, if you squint, they’re both available as Amazon AWS managed services through Elastic Mapreduce and of course Redshift. Unfortunately, that’s really where the similarities end which makes the “Hive vs. Redshift” comparisons along the lines of “apples to oranges”. Looking at Hive, its defining characteristic is that it runs across Hadoop and works on data stored in HDFS. Removing the acronym soup, that basically means that Hive runs MapReduce jobs across a bunch of text files that are stored in a distribued file system (HDFS). In comparison, Redshift uses a data model similar to PostgreSQL so data is structured in terms of rows and tables and includes the concept of indexes.
Well therein lays the rub that everyone seem to be missing. Hadoop, and by extension Hive (and Pig) are really good at processing text files. So imagine you have 10 million x 1mb XML documents or 100GB worth of nginx logs, this would be a perfect use case for Hive. All you would have to do is push them into HDFS or S3, write a RegEx to extract your data and then query away. Need to add another 2 million documents or 20GB of logs? No problem, just get them into HDFS and you’re good to go.
Could you do this with Redshift? Sure, but you’d need to pre-process 10 million XML documents and 100GB of logs to extract the appropriate fields, and then create CSV files or SQL INSERT statements to load into Redshift. Given the available options, you’re probably going to end up using Hadoop to do this anyway.
Where Redshift is really going to excel is in situations where your data is basically already relational and you have a clear path to actually get it into your cluster. For example, if you were running three x 15GB MySQL databases with unique, but related data, you’d be able to regularly pull that data into Redshift and then ad-hoc query it with regular SQL. In addition, since the data is already structured you’d be able to use the existing format to create keys in Redshift to improve performance.
When it comes down it, it’ll come down to the old “right tool for the right job” aphorism. As an organization, you’ll have to evaluate how your data is structured, the types of queries you’re interested in running, and what level of abstraction you’re comfortable with. What’s definitely true is that “enterprise” data warehousing is being commoditized and the “old guard” better innovate or die.
Recently we were getting ready to deploy a new project which functions only over SSL. The project is deployed on AWS using the Elastic Load Balancers (ELB). We have the ELB doing the SSL termination to reduce the load on the server and to help simply management of the SSL certs. Anyways the the point of this short post. One of the developers noticed that on some of the internal links she kept getting a link something like “https://dev.app.com:80/….”, it was properly generating the link to HTTPS but then specify port 80. Of course your browser really does not like that as its conflicting calls of port 80 and 443. After a quick look into the project we found that we had yet to enable the proxy headers and specify the proxy(s), it was we had to turn on `trust_proxy_headers`. However, doing this did not fix the issue. You must in addition to enable the headers specify which ones you trust. This can be easily done via the following:
Here is a very simple example of how you could specify them. You just let it know the IP’s of the proxy(s) and it will then properly generate your links.
You can read up on this more in the Symfony documentation on trusting proxies.
Anyways just wanted to put throw this out there incase you see this and realize you forgot to configure the proxy in your app!
We use Amazon Web Services quite a bit here. We not only use it to host most of our clients’ applications, but also for backups. We like to use S3 to store our backups as it is reliable, secure and very cheap. S3 stands for Amazon’s Simple Storage Service, it is more or less a limitless place to store data. You can mount S3 as a network hard drive but it’s main use is to store objects, or data, that you can retrieve at a low cost. It has 99.999999999% durability, so you most likely won’t lose anything, but even if you do, we use produce multiple backups for every object.
One thing we’ve noticed is that some people have issues interacting with S3, so here are a few things to help you out. First, if you are just looking to browse your S3 you can do so via your AWS Console or I like to use S3Fox. However, when you are looking to write some scripts or access it from the command line it can be difficult if you don’t use some pre-built tools. The best one we’ve found is s3cmd.
s3cmd allows you to list, update, create, delete objects and buckets in your S3. It’s really easy to install. Depending on your distribution of linux you can most likely get it from your package manager. Once you’ve done that you can configure it easily via ‘s3cmd –configure’. You’ll just need access credentials from your AWS account. Once you’ve set it up lets go through some useful commands.
To list your available buckets:
To create a bucket:
To list the contents of a bucket:
To put a file in the bucket it is very easy, just run (ie move tester-1.jpg to the bucket):
To delete the file you can run:
These are the basics. Probably the most common uses that we see are doing backups of data from a server to S3. An example of a bash script for this is as follows:
In this script it will just output the the console any errors. As you are most likely not running this by hand every day you’d want to change the “echo” statements to be mail commands or another way to alert administrators of an error on the backup. If you want to backup more than once a day all you need to change is the way the SQL_FILE variable is named to include hours for example.
This is a very simple backup script for MySQL. One thing that it doesn’t do is remove any old files, there is no reason for this to happen in the script. Amazon now has object lifecycles which allows you to automatically expire files in a bucket that are older than 60 days for example.
One thing that many people forget to do when they are making backups is to make sure that they actually work. We highly suggest that you once a month have a script which will check that whatever you are backing up is valid. This means if you are backing up a database that it checks to make sure that the database will reimport and that the data is valid (ie a row that should always exist does). The worst thing is finding out when you need a backup that your backup failed ages ago and you have no valid ones.
Make sure that your backups are not deleted quicker than it would take you to discover a problem. For example, if you only check your blog once a week, don’t have your backups delete after 5 days as you may discover a problem too late and your backups will also have the problem. Storage is cheap, keep backups for a long time.
Hope s3cmd makes your life easier and if you have any questions leave us a comment below!
We’ve heard some people are having a few small issues with getting AWS up and running. I’ve whipped up a quick guide to get you up and running, for FREE, on AWS within a few minutes. Let’s get started.
First you need to get signed up on Amazon in order to use their accounts. Head on over to http://aws.amazon.com/ to create an account. Creating an account is 100% free, even though they do ask for your credit card. Click on the ‘Get started for free’ button in the middle of the page. From there you’ll be taken through a quick registration.
There are are tons of different instances you can choose from. For this tutorial we’ll just give you a simple Ubuntu 12 image. Click https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-3bec7952 this will take you to the launch instance screen:
Click on “Continue”. On this screen for now just make sure that the Instance type (top right of screen) is “T1 Micro…”. This is their free tier. You get 720 hours of run time for free on it. The other options on this screen allow you to customize the number of instances and their location, but for now just click “Continue”.
This screen is the advanced options screen where you can select some extra options such as the kernel and monitoring for the instance. The defaults here are fine, so just click “Continue”.
This screen will let you configure the storage for your instance, again the defaults are fine just click “Continue”.
This screen you can put different tags on your instance. If you have a ton of instances it can be helpful to tag them, however as this is your first and only instance, no need to do anything other than click “Continue”.
This screen is important. You are going to setup your SSH keys to access the server here. Amazon does not launch the server with passwords, instead is uses https://wiki.archlinux.org/index.php/SSH_KeysSSH Keys. These let you identify with the server without having to specify a password. Read up on them, their really helpful.
You’ll want to click “Create a new Key Pair”. Amazon does not currently let you upload your own public ssh key, you must use one stored on your account.” Enter whatever name you want for the pair and click “Create & Download your Key Pair”. This will download a file to your computer.
You’ll be automatically advanced to the next screen when is has downloaded. Here you’ll configure which security group you want the server to be in. A security group is pretty much just a set of firewall settings. Use the “default” group. Click “Continue.”
Your instance is now being launched. You’ll see a “pending” on the screen under state until it is fully up and running.
Once it is running the state will change to “running”. Click on the server. At the bottom of the screen you’ll see information about the server. At the top of it under the top line “EC2 Instance ….” there is a URL. This is your servers public DNS record. You’ll use this to connect.
Before you can connect to your server you need to update your default security group to allow SSH. On the left side of the window click on “Security Groups”. Click default. In the bottom pane click “Inbound”. Select “SSH” in the dropdown for “Create a new rule:”. Click the “Add rule” button. Then do the same but for “HTTP”. At this point click “Apply Rule Changes”. If you do not do this, it will NOT save your updates.
Now open your terminal. Navigate to where you downloaded the file from earlier. Now it is time to SSH into your server. You may encounter a permission error, if you do run the chmod command from the gist below.
Congratulations, you’re now on your own server!
Now that you are on your server you need to install the LAMP stack. The next steps we’ll do is have you become the super user, run apt-get and install the LAMP software. apt-get is a package/software manager.
You now have a fully functional LAMP web server. To modify the files that are being served you’ll need to go to the webroot on the filesystem at “/var/www”.
Don’t forget to turn your instance off, as once your free tier runs out they will charge you. When you turn off your instance you will not be able to recover anything on it, so make sure if you have any files you want to keep you download them first.
Congrats on launching a LAMP server on AWS. Good luck and let us know if we can help you out on AWS or your next project!
Want to learn how to do other things on AWS? Leave us a comment and we’ll do our best to help out!
Posted In: Amazon AWS
Over the last couple of years, the popularity of the “cloud computing” has grown dramatically and along with it so has the dominance of Amazon Web Services (AWS) in the market. Unfortunately, AWS doesn’t do a great job of explaining exactly what AWS is, how its pieces work together, or what typical use cases for its components may be. This post is an effort to address this by providing a whip around overview of the key AWS components and how they can be effectively used.
Great, so what is AWS? Generally speaking, Amazon Web Services is a loosely coupled collection of “cloud” infrastructure services that allows customers to “rent” computing resources. What this means is that using AWS, you as the client are able to flexibly provision various computing resources on a “pay as you go” pricing model. Expecting a huge traffic spike? AWS has you covered. Need to flexibly store between 1 GB or 100 GB of photos? AWS has you covered. Additionally, each of the components that makes up AWS is generally loosely coupled meaning that they can work independently or in concert with other AWS resources.
Since AWS components are loosely coupled, you’d be able to mix and match only what you need but here is an overview of the key services.
What is it? Route53 is a highly available, scalable, and feature rich domain name service (DNS) web service. What a DNS service does is translate a domain name like “setfive.com” into an IP address like 18.104.22.168 which allows a client’s computer to “find” the correct server for a given domain name. In addition, Route53 also has several advanced features normally only available in pricey enterprise DNS solutions. Route53 would typically replace the DNS service provided by your registrar like GoDaddy or Register.com.
Should you use it? Definitely. Allow it isn’t free, after last year’s prolonged GoDaddy outage it’s clear that DNS is a critical component and using a company that treats it as such is important.
What is it? Simple Email Service (SES) is a hosted transactional email service. It allows you to easily send highly deliverable emails using a RESTful API call or via regular SMTP without running your own email infrastructure.
Should you use it? Maybe. SES is comparable to services like SendGrid in that it offers a highly deliverable email service. Although it is missing some of the features that you’ll find on SendGrid, its pricing is attractive and the integration is straightforward. We normally use SES for application emails (think “Forgot your password”) but then use MailChimp or SendGrid for marketing blasts and that seems to work pretty well.
What is it? Identity and access management (IAM) provides enhanced security and identity management for your AWS account. In addition, it allows you to enable “multi factor” authentication to enhance the security of your AWS account.
Should you use it? Definitely. If you have more than 1 person accessing your AWS account using IAM will allow everyone to get a separate account with fine grained permissions. Multi factor authentication is also critically important since a compromise at the infrastructure level would be catastrophic for most businesses. Read more about IAM here.
What is it? Simple storage service (S3) is a flexible, scalable, and highly available storage web service. Think of S3 like having an infinitely large hard drive where you can store files which are then accessible via a unique URL. S3 also supports access control, expiration times, and several other useful features. Additionally, the payment model for S3 is “pay as you go” so you’ll only be billed for the amount of data you store and how much bandwidth you use to transfer it in and out.
Should you use it? Definitely. S3 is probably the most widely used AWS service because of its attractive pricing and ease of use. If you’re running a site with lots of static assets (images, CSS assets, etc.), you’ll probably get a “free” performance boost by hosting those assets on S3. Additionally, S3 is an ideal solution for incremental backups, both data and code. We use S3 extensively, usually for hosting static files, frequently backing up MySQL databases, and backing up git repositories. The new AWS S3 Console also makes administering S3 and using it non-programmatically much easier.
What is it? Elastic Compute Cloud (EC2) is the central piece of the AWS ecosystem. EC2 provides flexible, on-demand computing resources with a “pay as you go” pricing model. Concretely, what this means is that you can “rent” computing resources for as long as you need them and process any workload on the machines you’ve provisioned. Because of its flexibility, EC2 is an attractive alternative to buying traditional servers for unpredictable workloads.
Should you use it? Maybe. Whether or not to use EC2 is always a controversial discussion because the complexity it introduces doesn’t always justify its benefits. As a rule of thumb, if you have unpredictable workloads like sporadic traffic using EC2 to run your infrastructure is probably a worthwhile investment. However, if you’re confident that you can predict the resources you’ll need you might be better served by a “normal” VPS solution like Linode.
What is it? Elastic block store (EBS) provides persist storage volumes that attach to EC2 instances to allow you to persist data past the lifespan of a single EC2. Due to the architecture of elastic compute cloud, all the storage systems on an instance are ephemeral. This means that when an instance is terminated all the data stored on that instance is lost. EBS addresses this issue by providing persistent storage that appears on instances as a regular hard drive.
Should you use it? Maybe. If you’re using EC2, you’ll have to weigh the choice between using only ephemeral instance storage or using EBS to persist data. Beyond that, EBS has well documented performance issues so you’ll have to be cognizant of that while designing your infrastructure.
What is it? CloudWatch provides monitoring for AWS resources including EC2 and EBS. CloudWatch enables administrators to view and collect key metrics and also set a series of alarms to be notified in case of trouble. In addition, CloudWatch can aggregate metrics across EC2 instances which provides useful insight into how your entire stack is operating.
Should you use it? Probably. CloudWatch is significantly easier to setup and use than tools like Nagios but its also less feature rich. We’ve had some success coupling CloudWatch with PagerDuty to provide alerts in case of critical service interruptions. You’ll probably need additional monitoring on top of CloudWatch but its certainly a good baseline to start with.
Anyway, the AWS ecosystem includes several additional services but these are the ones that I felt are key to getting started on AWS. We haven’t had a chance to use it yet but Redshift looks like it’s an exciting addition which will probably make this list soon. As always, comments and feedback welcome.
Posted In: Amazon AWS