Ashish Datta, Author at {5} Setfive - Talking to the World

Over the last few weeks, there’s been a slew of HHVM related news from the “We are the 98.5%” post from the HHVM team to the Our HHVM Roadmap from the Doctrine team. With the increasing excitement around HHVM, it’s becoming clear that the project is going to play an important role in the evolution of the PHP ecosystem. Even though it’s in it’s early stages, what influences will the HHVM project have on the PHP ecosystem as a whole?

Force the creation of a language spec

In contrast with other languages like Python or JavaScript, PHP has no formal language specification. There’s some extended discussion on this StackOverflow thread with links to PHP internals posts but the final consensus is that there isn’t a defined EBNF grammar for PHP or a “specification” for how things should work. Instead of a spec, the behavior of PHP has become defined by how the Zend interpreter works since it’s been the only viable implementation of the language to date. HHVM changes this situation by introducing another run time for the language, a developer won’t necessarily know if their code will be deployed on Zend or HHVM. Because of this, the community will have to develop a language specification to ensure that any language changes are implemented identically in both run times.

Willingness to introduce BC breakage

One of the hallmarks of PHP has been its strong adherence to backwards compatibility, five year old code written to target PHP4 will generally run today on PHP5+ without any modifications. This has generally been possible because changes to PHP the language didn’t change behavior which would of broken previously working code. Because of this, many of PHP’s long standing syntax issues haven’t been fixed and changes to the standard library have been largely avoided. If HHVM emerges as an alternative runtime, some of this hesitation should be removed since if only “newer” code will run on HHVM it would be conceivable to introduce a “HHVM compat” mode into the Zend implementation which could include BC breaking syntax changes.

JIT into Zend

Just in-time compilation has been shown to dramatically increase the execution speed of interpretted programming languages and it’s one of the key benefits of the HHVM interpretter. HHVM identities “hot code” which is repeatedly executed and then compiles those blocks to native code. The result of this is that those “hot code” sections execute faster since they’re running native code instead of the interpreted op codes. As HHVM becomes more popular, I think we’ll see cross pollination of JIT into Zend similar to how Firefox adopted JIT after Google released Chrome.

Anyway, it’s still early but the emergence of HHVM as an alternative PHP runtime will definitely have a positive influence on the PHP ecosystem. From technology sharing to increased competition, the future is bright and I’m excited to see how PHP evolves in the next few years.

Happy 2014! In between celebrating Christmas, hanging with family, and ringing in the New Year I managed to put together a visualization of the words used on avc.com. AVC, written by Fred Wilson, is probably one of the most popular “start up” blogs on the Internet. It covers a wide array of topics from “MBA Mondays”, USV portfolio companies and of course general startup and technology news. Given the range of topics and and that the blog has been active since 2003, it naturally seemed like generating a word cloud would produce interesting results. With the goal of generating word clouds in mind, I set off the day after Christmas.

Checkout the finished product at http://symf.setfive.com/d3_avc_blog_cloud/. I actually decided to use Scala to scrape and process the data, look for a follup post on coming to Scala from PHP.

Taking a quick glance at the clouds, a few things do jump out:

“Android” enters the top 100 in 2010 and has remained there since.
Amazon is surprisingly absent past 2007
Apple hasn’t made the top 100 in any year.
It’s interesting to see when USV portfolio companies like Disqus and Zemanta enter and exit.
Bitcoin makes the list for 2013
Blackberry, one and done
Facebook peaked in 2007 and then steadily declines until it drops out this year
Google hits the list for every year
Twitter gets in at 2007 and sticks through this year

A weeks ago, Facebook released a new open source project called PrestoDB which they billed as a market improvement over Hive and Hadoop. According to the PrestoDB site, Presto is a real time query engine that supports a SQL like syntax, similar to Hive. However, unlike Hive, Presto doesn’t execute queries using MapReduce jobs but instead uses its own internal distribution mechanism. According to the Presto site and current users, most queries will see an order of magnitude speedup compared to Hive. And the best part? PrestoDB can read metadata from Hive’s metastore and read files off HDFS just like Hive – pretty wild.

Anyway, since I love new toys (who doesn’t!?) I decided to try setting up PrestoDB on Amazon EMR to see how difficult it was and also experience the speedups. Turns out, once you have an Amazon EMR cluster running getting PrestoDB up is almost trivial. Just follow the PrestoDB deploying directions to get yourself situated. Make sure you create *all* the files or you’ll get some necessarily cryptic errors along the way.

The config files I ended up using were:

You’ll need to create the “/mnt/presto” directory and also make it accessible to whatever user you plan to run the daemon under.

The one huge gotcha I ran into was that I couldn’t figure out what port Hive’s Thrift service was running on. For some reason, it’s notably absent from Amazon’s documentation and I couldn’t find the hive-site.xml file on the EMR EC2. Completely randomly, I ran across this manual page from Jaspersoft enumerating which ports different versions of Hive run Thrift on when you use EMR. Turns out, its different per Hive version but 0.11.0 will use 10004.

Once you have everything configured, just follow the docs to start the server and you’ll be ready to query. One thing to note though is that you’ll need to setup PrestoDB manually on the rest of your machines and also enable the discovery service for this to “really” work.

Anyway, happy querying!

Last week, President Obama made headlines by suggesting that every American in school should learn how to code. Predictably, the comment sparked some heated discussion across the web from Fred Wilson’s blog to several threads on Hacker News. Surprisingly, some of the viewpoints were extremely polarized ranging from “its useless, some people will never get it” to “of course!”. Personally, I think everyone should definitely be exposed to some form of programming while they’re in school.

An inescapable reality is that in 2013 computers are a part of everyone’s personal and professional day to day. From non-technical roles in technical fields like account managers or project managers to traditionally non-technical jobs, like teachers, everyone is ultimately interacting with computers on a daily basis. With that in mind, having a basic understanding of how computing abstractions and programming work will benefit everyone. From being able to modify a VBA macro to construct a complex Gmail search query, having a basic understanding of how the pieces fit together certainly can’t hurt.

Looking back at high school, drawing an analogy between studying programming and studying a foreign language isn’t really accurate. A better analogy is really the general experience people have studying math in middle and high school. For people that don’t take a math class in college, that’ll normally be the last time they study math in an academic setting. Although most people forget most of the details they learned, they still retain the overarching fundamentals of how things like algebra and geometry work. Because of this, when people are faced with a basic math problem they generally know what they need to look up in order to solve it. Extending this, if people were introduced to basic programming early on they’d have a sense that there might be an easier way to approach certain tasks. Need to format a list of names in Excel? There might be a function for that.

So how can we make this happen? The good news is there’s already a push to make high quality, programming focused education material available to everyone. There are already dozens of masively online open course projects including Khan Academy, Coursera, and Code Academy providing free, interactive, computer science resource for everyone. The next step is pushing states and school systems to actively adopt CS education for their middle school and high school students. Hopefully it’ll prove and easy and effective step to keeping everyone competitive in an increasingly technology powered workplace.

As far as “big data” solutions go, Hive is probably one of the more recognizable names. Hive basically offers the end user an abstraction layer to run “SQL like” queries as MapReduce jobs across data that they have in HDFS. Concretely, say you had several hundred million rows of data and you wanted to count the number of unique IDs Hive would let you do that. One of the issues with Hadoop and by proxy Hive is that it’s notably difficult to setup a cluster to try things out. Tools like Whirr exist to make things easier they’re, a bit rough around the edges and in my experience hit up against “version hell”. One alternative that I’m surprised isn’t more popular is using Amazon’s Elastic Map Reduce to bootstrap a Hadoop cluster to experiment with.

Fire up the cluster

The first thing you’ll need to do is fire up an EMR cluster from the AWS backend. It’s mostly just point and click but the settings I used were:

Termination protection? No
Logging? Disabled
Debugging? Off since no logging
Tags – None
AMI Version: 2.4.2 (latest)
Applications to be installed:
Hive 0.11.0.1
Pig 0.11.1.1
Hardware Configuration:
One m1.small for the master
Two m1.small for the cores

The “security and access” section is important, you need to select an existing key pair that you have access to so that you can SSH into your master node to use the Hive CLI client.

Then finally, under Steps since you’re not specifying any pre-determined steps make sure you mark “Auto-terminate” as “No” so that the cluster doesn’t terminate immediately after it boots.

Click “Create Cluster” and you’re off to the races.

Pull some data, and load HDFS

Once the cluster launches, you’ll see a dashboard screen with a bunch of information about the cluster including the public DNS address for the “Master”. SSH into this machine using the user “hadoop” and whatever key you launched the cluster with:

Once you’re in, you’ll want to grab some data to play with. I pulled down Wikipedia Page View data since it’s just a bunch of gzipped text files which are perfect for Hive. You can pull down a chunk of files using wget, be aware though that the small EC2s don’t have much storage so you’ll need to keep an eye on your disk space.

Once you have some data (grab a few GB), the next step is to push it over to HDFS, Hadoop’s distributed filesystem. As an aside, Amazon EMR is tightly integrated with Amazon S3 so if you already have a dataset in S3 you can copy directly from S3 to HDFS. Anyway, to push your files to HDFS just run:

Build some tables, query some data!

And finally, it’s time to query some of the pageview data using Hive. The first step is to let Hive know about your data and what format it’s stored in. To do this, you need to create an external table that points to the location of the files that you just pushed to HDFS. Start the Hive client by running “hive” and then do the following:

Now select some data from your newly created table!

Pretty sweet huh? Now feel free to run any arbitrary query against the data. Note: since we used m1.small EC2s the performance of Hive/Hadoop is going to be pretty abysmal. But hey, give it a shot:

Anyway, don’t forget to tear down the cluster once you’re done. As always, let me know if you run into any issues!

Author: Ashish Datta

How will HHVM influence PHP?

Force the creation of a language spec

Willingness to introduce BC breakage

JIT into Zend

Fun: The AVC Word Cloud

PrestoDB: Running PrestoDB on Amazon EMR

Musing: Should everyone learn to code?

Hive: Hive in 15 minutes on Amazon EMR

Fire up the cluster

Pull some data, and load HDFS

Build some tables, query some data!