A weeks ago, Facebook released a new open source project called PrestoDB which they billed as a market improvement over Hive and Hadoop. According to the PrestoDB site, Presto is a real time query engine that supports a SQL like syntax, similar to Hive. However, unlike Hive, Presto doesn’t execute queries using MapReduce jobs but instead uses its own internal distribution mechanism. According to the Presto site and current users, most queries will see an order of magnitude speedup compared to Hive. And the best part? PrestoDB can read metadata from Hive’s metastore and read files off HDFS just like Hive – pretty wild.
Anyway, since I love new toys (who doesn’t!?) I decided to try setting up PrestoDB on Amazon EMR to see how difficult it was and also experience the speedups. Turns out, once you have an Amazon EMR cluster running getting PrestoDB up is almost trivial. Just follow the PrestoDB deploying directions to get yourself situated. Make sure you create *all* the files or you’ll get some necessarily cryptic errors along the way.
The config files I ended up using were:
You’ll need to create the “/mnt/presto” directory and also make it accessible to whatever user you plan to run the daemon under.
The one huge gotcha I ran into was that I couldn’t figure out what port Hive’s Thrift service was running on. For some reason, it’s notably absent from Amazon’s documentation and I couldn’t find the hive-site.xml file on the EMR EC2. Completely randomly, I ran across this manual page from Jaspersoft enumerating which ports different versions of Hive run Thrift on when you use EMR. Turns out, its different per Hive version but 0.11.0 will use 10004.
Once you have everything configured, just follow the docs to start the server and you’ll be ready to query. One thing to note though is that you’ll need to setup PrestoDB manually on the rest of your machines and also enable the discovery service for this to “really” work.
Anyway, happy querying!
Posted In: Big Data
As far as “big data” solutions go, Hive is probably one of the more recognizable names. Hive basically offers the end user an abstraction layer to run “SQL like” queries as MapReduce jobs across data that they have in HDFS. Concretely, say you had several hundred million rows of data and you wanted to count the number of unique IDs Hive would let you do that. One of the issues with Hadoop and by proxy Hive is that it’s notably difficult to setup a cluster to try things out. Tools like Whirr exist to make things easier they’re, a bit rough around the edges and in my experience hit up against “version hell”. One alternative that I’m surprised isn’t more popular is using Amazon’s Elastic Map Reduce to bootstrap a Hadoop cluster to experiment with.
The first thing you’ll need to do is fire up an EMR cluster from the AWS backend. It’s mostly just point and click but the settings I used were:
The “security and access” section is important, you need to select an existing key pair that you have access to so that you can SSH into your master node to use the Hive CLI client.
Then finally, under Steps since you’re not specifying any pre-determined steps make sure you mark “Auto-terminate” as “No” so that the cluster doesn’t terminate immediately after it boots.
Click “Create Cluster” and you’re off to the races.
Once the cluster launches, you’ll see a dashboard screen with a bunch of information about the cluster including the public DNS address for the “Master”. SSH into this machine using the user “hadoop” and whatever key you launched the cluster with:
Once you’re in, you’ll want to grab some data to play with. I pulled down Wikipedia Page View data since it’s just a bunch of gzipped text files which are perfect for Hive. You can pull down a chunk of files using wget, be aware though that the small EC2s don’t have much storage so you’ll need to keep an eye on your disk space.
Once you have some data (grab a few GB), the next step is to push it over to HDFS, Hadoop’s distributed filesystem. As an aside, Amazon EMR is tightly integrated with Amazon S3 so if you already have a dataset in S3 you can copy directly from S3 to HDFS. Anyway, to push your files to HDFS just run:
And finally, it’s time to query some of the pageview data using Hive. The first step is to let Hive know about your data and what format it’s stored in. To do this, you need to create an external table that points to the location of the files that you just pushed to HDFS. Start the Hive client by running “hive” and then do the following:
Now select some data from your newly created table!
Pretty sweet huh? Now feel free to run any arbitrary query against the data. Note: since we used m1.small EC2s the performance of Hive/Hadoop is going to be pretty abysmal. But hey, give it a shot:
Anyway, don’t forget to tear down the cluster once you’re done. As always, let me know if you run into any issues!
Posted In: Big Data
A few days ago, I was browsing my Feedly dashboard and ran across this AdWeek post describing how big retailers are gearing up to poach their competitors customers this holiday season. The article goes into some specifics, but the idea is basically that brands are planning to monitor Twitter for relevant conversations and then “at” message potential customers with special offers, product details, or even local store inventory information.
So imagine @MikeBruins65 from Boston tweeting “Wtf! @BestBuy offering 25% off all 4K TVs in-store…except nothing in stock.” and then @target replying “Cheer up @MikeBruins65! We have 4K TVs in-stock in Everett, MA! Grab coupons at http://bit.ly/target-4k-ma”. Since these brands are certainly leveraging powerful tools like Radian6 or even the full Twitter Firehose, it seems like it would be straightforward for them to execute strategies like this around high value markets. But what about as an individual, could you employ a similar strategy to make a few bucks?
The most obvious, least risky, and least lucrative approach would be to monitor Twitter for tweets that sounded like they were from frustrated buyers and then message them Amazon associates links for the product they’re looking for. Looking at Amazon’s fee structure, you’d want to target high margin categories with moderately expensive products and then hopefully end up doing a decent amount of volume. So imagine searching for Tweets from users frustrated that they can’t checkout on a small eCommerce site, finding the product they’re searching for on Amazon, and then Tweeting them the link to buy with your Associates link.
More risky and potentially more upside. I’m not entirely sure how feasible this would be, but I think the idea would be to use a SaaS eCommerce platform like Shopify to setup an eCommerce shop and then dynamically list items which you’ll later dropship. The challenge would be two fold, using Twitter to identify which previously obscure items are starting to trend and then figuring out how to introduce enough margin so that you end up profiting on the sale. It might be feasible though, with the explosion of small, boutique eCommerce sites it might be possible to negotiate a “I’ll buy 400 for 50% off!” type deal quickly enough to introduce a profitable sale. The bigger challenge would probably be identifying these items as they start trending, but that could be solved by….
Recent member of the billion dollar boys club and frequent target of “haters”, it’s current traction and latent purchase intent potentially make it the perfect place for affiliate marketing. Beyond that, the wealth of potential gift pins and the follower/repin graph might hold the key to identifying relatively obscure products right before they begin to go viral. Anyway, I don’t have any concrete ideas on how you could leverage Pinterest but it definitely seems like the ingredients for success are there.
Totally coincidentally, this article just came across TechCrunch – A Pin On Pinterest Is Worth 25% More In Sales Than Last Year, Can Drive Visits & Orders For Months
Anyway, are any of these actually feasible? Who knows, but I’d love to hear any other ideas.
Posted In: General
We’ve heard some people are having a few small issues with getting AWS up and running. I’ve whipped up a quick guide to get you up and running, for FREE, on AWS within a few minutes. Let’s get started.
First you need to get signed up on Amazon in order to use their accounts. Head on over to http://aws.amazon.com/ to create an account. Creating an account is 100% free, even though they do ask for your credit card. Click on the ‘Get started for free’ button in the middle of the page. From there you’ll be taken through a quick registration.
There are are tons of different instances you can choose from. For this tutorial we’ll just give you a simple Ubuntu 12 image. Click https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-3bec7952 this will take you to the launch instance screen:
Click on “Continue”. On this screen for now just make sure that the Instance type (top right of screen) is “T1 Micro…”. This is their free tier. You get 720 hours of run time for free on it. The other options on this screen allow you to customize the number of instances and their location, but for now just click “Continue”.
This screen is the advanced options screen where you can select some extra options such as the kernel and monitoring for the instance. The defaults here are fine, so just click “Continue”.
This screen will let you configure the storage for your instance, again the defaults are fine just click “Continue”.
This screen you can put different tags on your instance. If you have a ton of instances it can be helpful to tag them, however as this is your first and only instance, no need to do anything other than click “Continue”.
This screen is important. You are going to setup your SSH keys to access the server here. Amazon does not launch the server with passwords, instead is uses https://wiki.archlinux.org/index.php/SSH_KeysSSH Keys. These let you identify with the server without having to specify a password. Read up on them, their really helpful.
You’ll want to click “Create a new Key Pair”. Amazon does not currently let you upload your own public ssh key, you must use one stored on your account.” Enter whatever name you want for the pair and click “Create & Download your Key Pair”. This will download a file to your computer.
You’ll be automatically advanced to the next screen when is has downloaded. Here you’ll configure which security group you want the server to be in. A security group is pretty much just a set of firewall settings. Use the “default” group. Click “Continue.”
Your instance is now being launched. You’ll see a “pending” on the screen under state until it is fully up and running.
Once it is running the state will change to “running”. Click on the server. At the bottom of the screen you’ll see information about the server. At the top of it under the top line “EC2 Instance ….” there is a URL. This is your servers public DNS record. You’ll use this to connect.
Before you can connect to your server you need to update your default security group to allow SSH. On the left side of the window click on “Security Groups”. Click default. In the bottom pane click “Inbound”. Select “SSH” in the dropdown for “Create a new rule:”. Click the “Add rule” button. Then do the same but for “HTTP”. At this point click “Apply Rule Changes”. If you do not do this, it will NOT save your updates.
Now open your terminal. Navigate to where you downloaded the file from earlier. Now it is time to SSH into your server. You may encounter a permission error, if you do run the chmod command from the gist below.
Congratulations, you’re now on your own server!
Now that you are on your server you need to install the LAMP stack. The next steps we’ll do is have you become the super user, run apt-get and install the LAMP software. apt-get is a package/software manager.
You now have a fully functional LAMP web server. To modify the files that are being served you’ll need to go to the webroot on the filesystem at “/var/www”.
Don’t forget to turn your instance off, as once your free tier runs out they will charge you. When you turn off your instance you will not be able to recover anything on it, so make sure if you have any files you want to keep you download them first.
Congrats on launching a LAMP server on AWS. Good luck and let us know if we can help you out on AWS or your next project!
Want to learn how to do other things on AWS? Leave us a comment and we’ll do our best to help out!
Posted In: Amazon AWS
Over the last couple of years, the popularity of the “cloud computing” has grown dramatically and along with it so has the dominance of Amazon Web Services (AWS) in the market. Unfortunately, AWS doesn’t do a great job of explaining exactly what AWS is, how its pieces work together, or what typical use cases for its components may be. This post is an effort to address this by providing a whip around overview of the key AWS components and how they can be effectively used.
Great, so what is AWS? Generally speaking, Amazon Web Services is a loosely coupled collection of “cloud” infrastructure services that allows customers to “rent” computing resources. What this means is that using AWS, you as the client are able to flexibly provision various computing resources on a “pay as you go” pricing model. Expecting a huge traffic spike? AWS has you covered. Need to flexibly store between 1 GB or 100 GB of photos? AWS has you covered. Additionally, each of the components that makes up AWS is generally loosely coupled meaning that they can work independently or in concert with other AWS resources.
Since AWS components are loosely coupled, you’d be able to mix and match only what you need but here is an overview of the key services.
What is it? Route53 is a highly available, scalable, and feature rich domain name service (DNS) web service. What a DNS service does is translate a domain name like “setfive.com” into an IP address like 220.127.116.11 which allows a client’s computer to “find” the correct server for a given domain name. In addition, Route53 also has several advanced features normally only available in pricey enterprise DNS solutions. Route53 would typically replace the DNS service provided by your registrar like GoDaddy or Register.com.
Should you use it? Definitely. Allow it isn’t free, after last year’s prolonged GoDaddy outage it’s clear that DNS is a critical component and using a company that treats it as such is important.
What is it? Simple Email Service (SES) is a hosted transactional email service. It allows you to easily send highly deliverable emails using a RESTful API call or via regular SMTP without running your own email infrastructure.
Should you use it? Maybe. SES is comparable to services like SendGrid in that it offers a highly deliverable email service. Although it is missing some of the features that you’ll find on SendGrid, its pricing is attractive and the integration is straightforward. We normally use SES for application emails (think “Forgot your password”) but then use MailChimp or SendGrid for marketing blasts and that seems to work pretty well.
What is it? Identity and access management (IAM) provides enhanced security and identity management for your AWS account. In addition, it allows you to enable “multi factor” authentication to enhance the security of your AWS account.
Should you use it? Definitely. If you have more than 1 person accessing your AWS account using IAM will allow everyone to get a separate account with fine grained permissions. Multi factor authentication is also critically important since a compromise at the infrastructure level would be catastrophic for most businesses. Read more about IAM here.
What is it? Simple storage service (S3) is a flexible, scalable, and highly available storage web service. Think of S3 like having an infinitely large hard drive where you can store files which are then accessible via a unique URL. S3 also supports access control, expiration times, and several other useful features. Additionally, the payment model for S3 is “pay as you go” so you’ll only be billed for the amount of data you store and how much bandwidth you use to transfer it in and out.
Should you use it? Definitely. S3 is probably the most widely used AWS service because of its attractive pricing and ease of use. If you’re running a site with lots of static assets (images, CSS assets, etc.), you’ll probably get a “free” performance boost by hosting those assets on S3. Additionally, S3 is an ideal solution for incremental backups, both data and code. We use S3 extensively, usually for hosting static files, frequently backing up MySQL databases, and backing up git repositories. The new AWS S3 Console also makes administering S3 and using it non-programmatically much easier.
What is it? Elastic Compute Cloud (EC2) is the central piece of the AWS ecosystem. EC2 provides flexible, on-demand computing resources with a “pay as you go” pricing model. Concretely, what this means is that you can “rent” computing resources for as long as you need them and process any workload on the machines you’ve provisioned. Because of its flexibility, EC2 is an attractive alternative to buying traditional servers for unpredictable workloads.
Should you use it? Maybe. Whether or not to use EC2 is always a controversial discussion because the complexity it introduces doesn’t always justify its benefits. As a rule of thumb, if you have unpredictable workloads like sporadic traffic using EC2 to run your infrastructure is probably a worthwhile investment. However, if you’re confident that you can predict the resources you’ll need you might be better served by a “normal” VPS solution like Linode.
What is it? Elastic block store (EBS) provides persist storage volumes that attach to EC2 instances to allow you to persist data past the lifespan of a single EC2. Due to the architecture of elastic compute cloud, all the storage systems on an instance are ephemeral. This means that when an instance is terminated all the data stored on that instance is lost. EBS addresses this issue by providing persistent storage that appears on instances as a regular hard drive.
Should you use it? Maybe. If you’re using EC2, you’ll have to weigh the choice between using only ephemeral instance storage or using EBS to persist data. Beyond that, EBS has well documented performance issues so you’ll have to be cognizant of that while designing your infrastructure.
What is it? CloudWatch provides monitoring for AWS resources including EC2 and EBS. CloudWatch enables administrators to view and collect key metrics and also set a series of alarms to be notified in case of trouble. In addition, CloudWatch can aggregate metrics across EC2 instances which provides useful insight into how your entire stack is operating.
Should you use it? Probably. CloudWatch is significantly easier to setup and use than tools like Nagios but its also less feature rich. We’ve had some success coupling CloudWatch with PagerDuty to provide alerts in case of critical service interruptions. You’ll probably need additional monitoring on top of CloudWatch but its certainly a good baseline to start with.
Anyway, the AWS ecosystem includes several additional services but these are the ones that I felt are key to getting started on AWS. We haven’t had a chance to use it yet but Redshift looks like it’s an exciting addition which will probably make this list soon. As always, comments and feedback welcome.
Posted In: Amazon AWS