In the last few months there’s been a handful blog posts basically themed “Redshift vs. Hive”. Companies from Airbnb to FlyData have been broadcasting their success in migrating from Hive to Redshift in both performance and cost. Unfortunately, a lot of casual observers have interpreted these posts to mean that Redshift is a “silver bullet” in the big data space. For some background, Hive is an abstraction layer that executes MapReduce jobs using Hadoop across data stored in HDFS. Amazon’s Redshift is a managed “petabyte scale” data warehouse solution that provides managed access to a ParAccel cluster and exposes a SQL interface that’s roughly similar to PostgreSQL. So where does that leave us?
From the outside, Hive and Redshift look oddly similar. They both promise “petabyte” scale, linear scalability, and expose an SQL’ish query syntax. On top of that, if you squint, they’re both available as Amazon AWS managed services through Elastic Mapreduce and of course Redshift. Unfortunately, that’s really where the similarities end which makes the “Hive vs. Redshift” comparisons along the lines of “apples to oranges”. Looking at Hive, its defining characteristic is that it runs across Hadoop and works on data stored in HDFS. Removing the acronym soup, that basically means that Hive runs MapReduce jobs across a bunch of text files that are stored in a distribued file system (HDFS). In comparison, Redshift uses a data model similar to PostgreSQL so data is structured in terms of rows and tables and includes the concept of indexes.
Well therein lays the rub that everyone seem to be missing. Hadoop, and by extension Hive (and Pig) are really good at processing text files. So imagine you have 10 million x 1mb XML documents or 100GB worth of nginx logs, this would be a perfect use case for Hive. All you would have to do is push them into HDFS or S3, write a RegEx to extract your data and then query away. Need to add another 2 million documents or 20GB of logs? No problem, just get them into HDFS and you’re good to go.
Could you do this with Redshift? Sure, but you’d need to pre-process 10 million XML documents and 100GB of logs to extract the appropriate fields, and then create CSV files or SQL INSERT statements to load into Redshift. Given the available options, you’re probably going to end up using Hadoop to do this anyway.
Where Redshift is really going to excel is in situations where your data is basically already relational and you have a clear path to actually get it into your cluster. For example, if you were running three x 15GB MySQL databases with unique, but related data, you’d be able to regularly pull that data into Redshift and then ad-hoc query it with regular SQL. In addition, since the data is already structured you’d be able to use the existing format to create keys in Redshift to improve performance.
When it comes down it, it’ll come down to the old “right tool for the right job” aphorism. As an organization, you’ll have to evaluate how your data is structured, the types of queries you’re interested in running, and what level of abstraction you’re comfortable with. What’s definitely true is that “enterprise” data warehousing is being commoditized and the “old guard” better innovate or die.
A weeks ago, Facebook released a new open source project called PrestoDB which they billed as a market improvement over Hive and Hadoop. According to the PrestoDB site, Presto is a real time query engine that supports a SQL like syntax, similar to Hive. However, unlike Hive, Presto doesn’t execute queries using MapReduce jobs but instead uses its own internal distribution mechanism. According to the Presto site and current users, most queries will see an order of magnitude speedup compared to Hive. And the best part? PrestoDB can read metadata from Hive’s metastore and read files off HDFS just like Hive – pretty wild.
Anyway, since I love new toys (who doesn’t!?) I decided to try setting up PrestoDB on Amazon EMR to see how difficult it was and also experience the speedups. Turns out, once you have an Amazon EMR cluster running getting PrestoDB up is almost trivial. Just follow the PrestoDB deploying directions to get yourself situated. Make sure you create *all* the files or you’ll get some necessarily cryptic errors along the way.
The config files I ended up using were:
You’ll need to create the “/mnt/presto” directory and also make it accessible to whatever user you plan to run the daemon under.
The one huge gotcha I ran into was that I couldn’t figure out what port Hive’s Thrift service was running on. For some reason, it’s notably absent from Amazon’s documentation and I couldn’t find the hive-site.xml file on the EMR EC2. Completely randomly, I ran across this manual page from Jaspersoft enumerating which ports different versions of Hive run Thrift on when you use EMR. Turns out, its different per Hive version but 0.11.0 will use 10004.
Once you have everything configured, just follow the docs to start the server and you’ll be ready to query. One thing to note though is that you’ll need to setup PrestoDB manually on the rest of your machines and also enable the discovery service for this to “really” work.
Anyway, happy querying!
Posted In: Big Data
As far as “big data” solutions go, Hive is probably one of the more recognizable names. Hive basically offers the end user an abstraction layer to run “SQL like” queries as MapReduce jobs across data that they have in HDFS. Concretely, say you had several hundred million rows of data and you wanted to count the number of unique IDs Hive would let you do that. One of the issues with Hadoop and by proxy Hive is that it’s notably difficult to setup a cluster to try things out. Tools like Whirr exist to make things easier they’re, a bit rough around the edges and in my experience hit up against “version hell”. One alternative that I’m surprised isn’t more popular is using Amazon’s Elastic Map Reduce to bootstrap a Hadoop cluster to experiment with.
The first thing you’ll need to do is fire up an EMR cluster from the AWS backend. It’s mostly just point and click but the settings I used were:
The “security and access” section is important, you need to select an existing key pair that you have access to so that you can SSH into your master node to use the Hive CLI client.
Then finally, under Steps since you’re not specifying any pre-determined steps make sure you mark “Auto-terminate” as “No” so that the cluster doesn’t terminate immediately after it boots.
Click “Create Cluster” and you’re off to the races.
Once the cluster launches, you’ll see a dashboard screen with a bunch of information about the cluster including the public DNS address for the “Master”. SSH into this machine using the user “hadoop” and whatever key you launched the cluster with:
Once you’re in, you’ll want to grab some data to play with. I pulled down Wikipedia Page View data since it’s just a bunch of gzipped text files which are perfect for Hive. You can pull down a chunk of files using wget, be aware though that the small EC2s don’t have much storage so you’ll need to keep an eye on your disk space.
Once you have some data (grab a few GB), the next step is to push it over to HDFS, Hadoop’s distributed filesystem. As an aside, Amazon EMR is tightly integrated with Amazon S3 so if you already have a dataset in S3 you can copy directly from S3 to HDFS. Anyway, to push your files to HDFS just run:
And finally, it’s time to query some of the pageview data using Hive. The first step is to let Hive know about your data and what format it’s stored in. To do this, you need to create an external table that points to the location of the files that you just pushed to HDFS. Start the Hive client by running “hive” and then do the following:
Now select some data from your newly created table!
Pretty sweet huh? Now feel free to run any arbitrary query against the data. Note: since we used m1.small EC2s the performance of Hive/Hadoop is going to be pretty abysmal. But hey, give it a shot:
Anyway, don’t forget to tear down the cluster once you’re done. As always, let me know if you run into any issues!
Posted In: Big Data
A few days ago, I was browsing my Feedly dashboard and ran across this AdWeek post describing how big retailers are gearing up to poach their competitors customers this holiday season. The article goes into some specifics, but the idea is basically that brands are planning to monitor Twitter for relevant conversations and then “at” message potential customers with special offers, product details, or even local store inventory information.
So imagine @MikeBruins65 from Boston tweeting “Wtf! @BestBuy offering 25% off all 4K TVs in-store…except nothing in stock.” and then @target replying “Cheer up @MikeBruins65! We have 4K TVs in-stock in Everett, MA! Grab coupons at http://bit.ly/target-4k-ma”. Since these brands are certainly leveraging powerful tools like Radian6 or even the full Twitter Firehose, it seems like it would be straightforward for them to execute strategies like this around high value markets. But what about as an individual, could you employ a similar strategy to make a few bucks?
The most obvious, least risky, and least lucrative approach would be to monitor Twitter for tweets that sounded like they were from frustrated buyers and then message them Amazon associates links for the product they’re looking for. Looking at Amazon’s fee structure, you’d want to target high margin categories with moderately expensive products and then hopefully end up doing a decent amount of volume. So imagine searching for Tweets from users frustrated that they can’t checkout on a small eCommerce site, finding the product they’re searching for on Amazon, and then Tweeting them the link to buy with your Associates link.
More risky and potentially more upside. I’m not entirely sure how feasible this would be, but I think the idea would be to use a SaaS eCommerce platform like Shopify to setup an eCommerce shop and then dynamically list items which you’ll later dropship. The challenge would be two fold, using Twitter to identify which previously obscure items are starting to trend and then figuring out how to introduce enough margin so that you end up profiting on the sale. It might be feasible though, with the explosion of small, boutique eCommerce sites it might be possible to negotiate a “I’ll buy 400 for 50% off!” type deal quickly enough to introduce a profitable sale. The bigger challenge would probably be identifying these items as they start trending, but that could be solved by….
Recent member of the billion dollar boys club and frequent target of “haters”, it’s current traction and latent purchase intent potentially make it the perfect place for affiliate marketing. Beyond that, the wealth of potential gift pins and the follower/repin graph might hold the key to identifying relatively obscure products right before they begin to go viral. Anyway, I don’t have any concrete ideas on how you could leverage Pinterest but it definitely seems like the ingredients for success are there.
Totally coincidentally, this article just came across TechCrunch – A Pin On Pinterest Is Worth 25% More In Sales Than Last Year, Can Drive Visits & Orders For Months
Anyway, are any of these actually feasible? Who knows, but I’d love to hear any other ideas.
Posted In: General
We’ve heard some people are having a few small issues with getting AWS up and running. I’ve whipped up a quick guide to get you up and running, for FREE, on AWS within a few minutes. Let’s get started.
First you need to get signed up on Amazon in order to use their accounts. Head on over to http://aws.amazon.com/ to create an account. Creating an account is 100% free, even though they do ask for your credit card. Click on the ‘Get started for free’ button in the middle of the page. From there you’ll be taken through a quick registration.
There are are tons of different instances you can choose from. For this tutorial we’ll just give you a simple Ubuntu 12 image. Click https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-3bec7952 this will take you to the launch instance screen:
Click on “Continue”. On this screen for now just make sure that the Instance type (top right of screen) is “T1 Micro…”. This is their free tier. You get 720 hours of run time for free on it. The other options on this screen allow you to customize the number of instances and their location, but for now just click “Continue”.
This screen is the advanced options screen where you can select some extra options such as the kernel and monitoring for the instance. The defaults here are fine, so just click “Continue”.
This screen will let you configure the storage for your instance, again the defaults are fine just click “Continue”.
This screen you can put different tags on your instance. If you have a ton of instances it can be helpful to tag them, however as this is your first and only instance, no need to do anything other than click “Continue”.
This screen is important. You are going to setup your SSH keys to access the server here. Amazon does not launch the server with passwords, instead is uses https://wiki.archlinux.org/index.php/SSH_KeysSSH Keys. These let you identify with the server without having to specify a password. Read up on them, their really helpful.
You’ll want to click “Create a new Key Pair”. Amazon does not currently let you upload your own public ssh key, you must use one stored on your account.” Enter whatever name you want for the pair and click “Create & Download your Key Pair”. This will download a file to your computer.
You’ll be automatically advanced to the next screen when is has downloaded. Here you’ll configure which security group you want the server to be in. A security group is pretty much just a set of firewall settings. Use the “default” group. Click “Continue.”
Your instance is now being launched. You’ll see a “pending” on the screen under state until it is fully up and running.
Once it is running the state will change to “running”. Click on the server. At the bottom of the screen you’ll see information about the server. At the top of it under the top line “EC2 Instance ….” there is a URL. This is your servers public DNS record. You’ll use this to connect.
Before you can connect to your server you need to update your default security group to allow SSH. On the left side of the window click on “Security Groups”. Click default. In the bottom pane click “Inbound”. Select “SSH” in the dropdown for “Create a new rule:”. Click the “Add rule” button. Then do the same but for “HTTP”. At this point click “Apply Rule Changes”. If you do not do this, it will NOT save your updates.
Now open your terminal. Navigate to where you downloaded the file from earlier. Now it is time to SSH into your server. You may encounter a permission error, if you do run the chmod command from the gist below.
Congratulations, you’re now on your own server!
Now that you are on your server you need to install the LAMP stack. The next steps we’ll do is have you become the super user, run apt-get and install the LAMP software. apt-get is a package/software manager.
You now have a fully functional LAMP web server. To modify the files that are being served you’ll need to go to the webroot on the filesystem at “/var/www”.
Don’t forget to turn your instance off, as once your free tier runs out they will charge you. When you turn off your instance you will not be able to recover anything on it, so make sure if you have any files you want to keep you download them first.
Congrats on launching a LAMP server on AWS. Good luck and let us know if we can help you out on AWS or your next project!
Want to learn how to do other things on AWS? Leave us a comment and we’ll do our best to help out!
Posted In: Amazon AWS