In the last few months there’s been a handful blog posts basically themed “Redshift vs. Hive”. Companies from Airbnb to FlyData have been broadcasting their success in migrating from Hive to Redshift in both performance and cost. Unfortunately, a lot of casual observers have interpreted these posts to mean that Redshift is a “silver bullet” in the big data space. For some background, Hive is an abstraction layer that executes MapReduce jobs using Hadoop across data stored in HDFS. Amazon’s Redshift is a managed “petabyte scale” data warehouse solution that provides managed access to a ParAccel cluster and exposes a SQL interface that’s roughly similar to PostgreSQL. So where does that leave us?
From the outside, Hive and Redshift look oddly similar. They both promise “petabyte” scale, linear scalability, and expose an SQL’ish query syntax. On top of that, if you squint, they’re both available as Amazon AWS managed services through Elastic Mapreduce and of course Redshift. Unfortunately, that’s really where the similarities end which makes the “Hive vs. Redshift” comparisons along the lines of “apples to oranges”. Looking at Hive, its defining characteristic is that it runs across Hadoop and works on data stored in HDFS. Removing the acronym soup, that basically means that Hive runs MapReduce jobs across a bunch of text files that are stored in a distribued file system (HDFS). In comparison, Redshift uses a data model similar to PostgreSQL so data is structured in terms of rows and tables and includes the concept of indexes.
Well therein lays the rub that everyone seem to be missing. Hadoop, and by extension Hive (and Pig) are really good at processing text files. So imagine you have 10 million x 1mb XML documents or 100GB worth of nginx logs, this would be a perfect use case for Hive. All you would have to do is push them into HDFS or S3, write a RegEx to extract your data and then query away. Need to add another 2 million documents or 20GB of logs? No problem, just get them into HDFS and you’re good to go.
Could you do this with Redshift? Sure, but you’d need to pre-process 10 million XML documents and 100GB of logs to extract the appropriate fields, and then create CSV files or SQL INSERT statements to load into Redshift. Given the available options, you’re probably going to end up using Hadoop to do this anyway.
Where Redshift is really going to excel is in situations where your data is basically already relational and you have a clear path to actually get it into your cluster. For example, if you were running three x 15GB MySQL databases with unique, but related data, you’d be able to regularly pull that data into Redshift and then ad-hoc query it with regular SQL. In addition, since the data is already structured you’d be able to use the existing format to create keys in Redshift to improve performance.
When it comes down it, it’ll come down to the old “right tool for the right job” aphorism. As an organization, you’ll have to evaluate how your data is structured, the types of queries you’re interested in running, and what level of abstraction you’re comfortable with. What’s definitely true is that “enterprise” data warehousing is being commoditized and the “old guard” better innovate or die.