Big Data: What is “Big Data”?

Last week, I was catching up with a friend of mine and we started chatting about his most recent project. As we were chatting, he made an offhand comment about how some of the business guys on the team love to refer to what they are working on as a “big data” play, even though it really wasn’t. This stuck with me, since because of the vague definitions around “big data”, it’s easy to shoe horn problems into a “big data” play. Because of this, I think its worth taking a step back and discussing what big data really is and what tools are available to work with it.

It’s all just data

At the end of the day, data is data. It doesn’t really matter if its stored in a CSV text file, a MySQL database, or a NoSQL datastore like Cassandra or MongoDB. Typically though, web applications tend to use a relational database like MySQL or Postgres to persist data. Relational databases store data in a series of tables which are in turn arranged as a series of rows and columns. As an abstraction, think of a series of Excel worksheets which can have links between the rows of each sheet.

For most applications, this works out fine, the database ends up managing say a few thousand customer accounts, each with a few hundred thousand objects associated with them and the total dataset fits conveniently into the server’s RAM. Since the dataset is relatively small, things like retrieving information, updating records, and running ad-hoc analytics queries are all easy to implement and relatively fast. But what happens if your dataset doesn’t fit into memory of even the beefiest of servers? Therein lies the “big data” problem.

Certain applications generate an enormous amount of data on a daily basis. For example, look at Mixpanel, tracking discreet user interactions is going to produce hundreds of thousands of datapoints every day even with just a few clients. With this volume of data, typical relational databases quickly start performing sluggishly and eventually stop being effective entirely. Even simple queries like counting the “# of clicks by user” start to take hours to run, effectively becoming intractable. Although specialized relational databases like Vertica and Oracle 11g do exist to help solve this problem, they’re expensive and proprietery.

Enter the elephant

One of the first companies to publicly discuss their big data strategies was Google in Bigtable: A Distributed Storage System for Structured Data which described their BigTable datastorage system. Although a proprietary solution, the research paper was used as the basis for Apache Hadoop, an open source framework for running MapReduce style jobs over large datasets.

At this point, Hadoop has distinguished itself as the most popular open source big data solution with a rich ecosystem of tools and several companies providing professional services and support including Cloudera and Hortonworks. What Hadoop provides is a low level framework for allowing computation jobs to be distributed across several servers within a cluster. This allows tools to split up very large datasets into smaller chunks, distribute computational tasks across the cluster, and finally assemble the result. So with the Hadoop framework in place, you still need specific tools built to leverage the distributed framework.

The toolbox

There are several tools that effectively leverage Hadoop but here are some of my favorites for quickly building out a cluster:

Apache Whirr – Automates deploying, bootstrapping, and configuring a Hadoop cluster. Whirr will save you hours of time because instead of manually starting 4 EC2s and configuring them all you can kickstart a cluster with a single command.

Apache HBase – A column store database that is similar to Google’s original BigTable system. Great for storing billions of records across a Hadoop HDFS file system.

Apache Hive – A datawharehousing solution that allows you to run “SQL like” queries using Hadoop. It also has native support for pulling data out of MySQL, making it a convenient addition to a stack includes MySQL.

Apart from these, there are dozens of other Hadoop powered tools but its impossible to recommend a single silver bullet without knowing the details of your “big data” problem.

PHP: Quick and dirty CLI tasks

Something that comes up every so often in a sufficiently large PHP project is having to write helper scripts that run on the command line to complete various tasks. It might be periodically processing some images, updating cached analytics, etc. If the project is a Symfony project, it’s usually easy enough to add a Symfony task and be able to leverage the Symfony infrastructure to manage the individual “scripts” as tasks. This is equally true with Drupal, using Drush tasks to manage the individual scripts works well and lets you have a single, central spot for all your “helpers”. But what if its a vanilla PHP project or WordPress?

A technique I’ve started using is to create a class and then add each of the tasks as static functions. This allows you to keep all the tasks in one place, reuse code and configurations, and generally mimic how Symfony tasks and Drush work. From there, the file pulls off $argv to figure out what function to call and just passes $argv in as an argument as well.

Here’s a stub of a class to set something like this up:

SwiftMailer: Expected response code 250 but got code 421

Last week we deployed a background script for a client which was used intermitently to batch send a couple of hundred emails. We were using SwiftMailer but weren’t able to use the “Spool” strategy to send because the messages contained Unicode characters which was breaking the serialization. Anyway, we ended up with code that looked something like the following:

Nothing to crazy going on. We were also sending the emails through Amazon SES which is why we introduced the sleep(..) to keep ourselves below the sending limits.

Things seemed like they were fine but then we’d seemingly randomly get the following exception:

PHP Fatal error:  Uncaught exception 'Swift_TransportException' with message 'Expected response code 250 but got code "421", with message "421 Timeout waiting for data from client.
"' in /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/AbstractSmtpTransport.php:406
Stack trace:
#0 /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/AbstractSmtpTransport.php(290): Swift_Transport_AbstractSmtpTransport->_assertResponseCode('421 Timeout wai...',$
#1 /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/EsmtpTransport.php(197): Swift_Transport_AbstractSmtpTransport->executeCommand('MAIL FROM: <adm...', Array, Arra$
#2 /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/EsmtpTransport.php(267): Swift_Transport_EsmtpTransport->executeCommand('MAIL FROM: <adm...', Array)
#3 /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/AbstractSmtpTransport.php(441): Swift_Transport_EsmtpTransport->_doMailFromCommand('admin@chatthrea...')
#4 /usr/share/php/symfony/ve in /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/AbstractSmtpTransport.php on line 406

After doing some digging around, it turns out Amazon’s SES service has a connection timeout which SwiftMailer was tripping up on. I couldn’t actually find an official published timeout limit but looking at the SwiftMailer code it seemed like it was possible to set a timeout inside Swift. We added a “timeout: 5” setting to our Symfony factories.yml file inside the SwiftMailer settings and it seemed to fix our issues.

Twitter Bootstrap: What it (really) is

Early on Tuesday Bootstrap 2.2.0 was released which included a handful of improvements, a couple of bug fixes, and some documentation updates. News of the update made the front page of Hacker News and generated a heated debate surrounding the usefulness of Bootstrap itself. The top comment basically railed against Bootstrap saying it’s useless since it just introduces a load of boilerplate CSS without any added benefit. It struck me that looking through the Bootstrap documentation, there isn’t a straightforward explanation of what it really is and more importantly, when you should use it.

At a high level, Bootstrap is a “CSS framework” which contains a set of CSS classes to help you develop CSS for a project. Bootstrap wasn’t the first and isn’t the only CSS framework, projects like BlueprintCSS, 960gs, and Foundation are also competing CSS frameworks. What makes Bootstrap stand apart though, is the tight integration between “base” classes and “high level” classes along with the number of components included in the standard distribution. Using Bootstrap, you can effectively use a single toolkit to take your project from a layout grid, through styled HTML forms, and finally stylish Javascript plugins.

The next thing to consider is give your requirements and team, is Bootstrap an appropriate choice for your project. As CSS frameworks go, Bootstrap is pretty heavy and it’s going to introduce conventions and assumptions into your project that if you don’t use, you’ll end up fighting against. Given that, when is a good time to Bootstrap? In my opinion, Bootstrap will end up being the most useful when you have a fast moving, new project that needs a good “default” style for prototyping as well as a style guide for developers to follow. So concretely, a typical team of 3 engineers starting a new project will probably benefit from Bootstrap. Meanwhile, a digital agency designing a micro site for a client with exiting assets and branding probably isn’t going to. So what does Bootstrap get you?

Consistency and re-use

With a single developer using a single CSS file, its pretty easy to keep class names consistent and effectively re-use styles. However, as additional developers and additional CSS files are added to a project it becomes increasingly difficult to effectively re-use classes and often prevent near duplicate definitions. Bootstrap helps mitigate this by introducing classes for styles you’ll probably need right off the bat. Need a button? Use “btn”. Need a bordered table? Add a “table-striped”. Unfortunately, Bootstrap isn’t a silver bullet but it’s better than nothing.

Looks good out of the box

This one will be a bit controversial but it’s important for a lot of people. Out of the box, Bootstrap looks pretty good which gives you a more flexibility in rapidly developing prototypes, proof of concepts, and MVPs because you’re free to focus on the functionality instead of the design. If something ends up moving into production, you should obviously customize Bootstrap away from the stock theme. Bootstrap actually makes this significantly easier because it uses LESS to generate its CSS files. LESS introduces several pre-processors on top of CSS which makes it easy to make a cascading edit throughout the framework. Need to change the colors across your site? Just edit variables.less and re-generate the CSS.

It’s modern

As a framework, Bootstrap leverages several modern development techniques including responsive design, HTML5 data-*, and several others. Taken individually, none of these techniques are particularly notable but as part of an integrated framework they’ll help you write cleaner, more maintainable, and more compatible code.

This isn’t an exhaustive list by any means but hopefully it’ll serve as a good basis for what Bootstrap is and why you should consider using it.

HTML5: Masking file inputs with CSS3

This morning, I was implimenting an AJAX “profile photo uploader” using the new HTML5 File object. With the new File object, you’re able to do a file upload using an AJAX request instead of requiring a form submission to upload the file.

This works great with the caveat that there still needs to be a <input type="file" /> on the page in order for the user to select a file. This presents a UI/UX issue because there isn’t much leway in styling HTML file inputs and consequently the input was looking out of place on our form.

After poking around for a bit I ran across jQuery File Upload which seems to solve this problem. Inspecting the CSS, the CSS “masks” the file input by using a CSS3 “transform” and then setting the opacity value to 0 so the input is visually hidden.

Implimenting the transform+opacity trick was pretty straightforward but the issue I ran into was that I couldn’t get the mouse cursor being displayed to appear as a “hand”, I kept seeing the “text selector” cursor. I had changed the transform being used and unfortunately the jQuery File Upload CSS didn’t have any comments explaining what it was actually doing to get the cursor part of the CSS to work.

It turns out what you want to do is use the CSS3 “transform” function to skew the file input so that the “Browse” button is covering the element that you want to use to trigger the “select file” behavior and then just set the opacity to 0 to make the file input invisible.

You can see the progression here http://jsfiddle.net/jpvWh/2/ The first container has no styling applied on the file input, the second one has the “Browse” button skewed into the right spot and finally the last one has the opacity set to 0. If you change the “cursor” CSS directive on the file input you’ll notice that in the 2nd and 3rd examples the cursor will change wherever your mouse is in the container. Also, you’ll notice that if you click anywhere in the container the “select a file” dialog will get triggered as expected.

Anyway, as always comments/feedback/questions are welcome!