{5} Setfive - Talking to the World - Page 43 of 66 - Ramblings on code, startups, and everything in between

Last week, one of our projects hit a pretty odd limit that I’d never expected to reach. The project is an analytics platform that allows admins to “pull” data from another, third party application. To accomplish this, the application allows admin users to dynamically add and remove columns from SQL tables and then dynamically chart these columns. Because of this, one of the tables had gotten over 350 columns which had all been created dynamically at runtime.

Anyway, things were working fine until last week when the application started throwing the following fatal error: “Fatal error: Uncaught exception ‘Doctrine_Table_Exception’ with message ‘Invalid expression found: ‘ in /usr/share/php/symfony/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Table.php:2746″ Looking at the error, I noticed a warning was actually getting thrown right before the fatal error: “Warning: preg_replace(): Compilation failed: regular expression is too large at offset 32594 in /usr/share/php/symfony/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Table.php on line 2745″ Looking through the code of Table.php, its clear that because “preg_replace” fails the $expression is subsequently blank which causes Doctrine to throw an error. I wanted to see how bad the regex was so I updated the Table.php to dump the expression. Here is what Doctrine was trying to run:

/(lsc_calculated_promoterscore_delta_innovativeness_formula|Lsc_calculated_promoterscore_delta_innovativeness_formula|lsc_calculated_promoterscore_delta_innovativeness_order|lsc_calculated_promoterscore_delta_favorability_formula|Lsc_calculated_promoterscore_delta_favorability_formula|Lsc_calculated_promoterscore_delta_innovativeness_order|Lsc_calculated_promoterscore_delta_favorability_order|

[Lots of columns...]

|Grid_mac_total|GridTotalTotal|GridTotalWinpc|Grid_total_mac|GridWindowsMac|grid_both_mac|grid_mac_ipad|audience_type|GridBothWinpc|GridBothTotal|Grid_both_mac|GridTotalIpad|Audience_type|WindowsosFy13|Grid_mac_ipad|grid_mac_mac|Grid_mac_mac|Program_type|program_type|AudienceType|GridMacTotal|GridBothIpad|GridTotalMac|GridMacWinpc|venue_child|ProgramType|GridMacIpad|Venue_child|GridBothMac|Program_id|program_id|VenueChild|GridMacMac|Fiscalyear|fiscalyear|is_locked|Is_locked|ProgramId|IsLocked|Country|country|venue|Venue|os|OS|id|Os|Id)(Or|And)?/

Looking at the php.net documentation for preg_replace and preg_match neither actually mention a hard limit on the size of a regex that can be compiled. Obviously there is a limit though and I imagine it must depend on the underlying RegExp library that your PHP is compiled against so it might be platform dependent.

As for solutions for this problem? The best solution for an extreme case like this is probably to just manually fill those in with real methods in the Doctrine_Table classes:

Last week, I was catching up with a friend of mine and we started chatting about his most recent project. As we were chatting, he made an offhand comment about how some of the business guys on the team love to refer to what they are working on as a “big data” play, even though it really wasn’t. This stuck with me, since because of the vague definitions around “big data”, it’s easy to shoe horn problems into a “big data” play. Because of this, I think its worth taking a step back and discussing what big data really is and what tools are available to work with it.

It’s all just data

At the end of the day, data is data. It doesn’t really matter if its stored in a CSV text file, a MySQL database, or a NoSQL datastore like Cassandra or MongoDB. Typically though, web applications tend to use a relational database like MySQL or Postgres to persist data. Relational databases store data in a series of tables which are in turn arranged as a series of rows and columns. As an abstraction, think of a series of Excel worksheets which can have links between the rows of each sheet.

For most applications, this works out fine, the database ends up managing say a few thousand customer accounts, each with a few hundred thousand objects associated with them and the total dataset fits conveniently into the server’s RAM. Since the dataset is relatively small, things like retrieving information, updating records, and running ad-hoc analytics queries are all easy to implement and relatively fast. But what happens if your dataset doesn’t fit into memory of even the beefiest of servers? Therein lies the “big data” problem.

Certain applications generate an enormous amount of data on a daily basis. For example, look at Mixpanel, tracking discreet user interactions is going to produce hundreds of thousands of datapoints every day even with just a few clients. With this volume of data, typical relational databases quickly start performing sluggishly and eventually stop being effective entirely. Even simple queries like counting the “# of clicks by user” start to take hours to run, effectively becoming intractable. Although specialized relational databases like Vertica and Oracle 11g do exist to help solve this problem, they’re expensive and proprietery.

Enter the elephant

One of the first companies to publicly discuss their big data strategies was Google in Bigtable: A Distributed Storage System for Structured Data which described their BigTable datastorage system. Although a proprietary solution, the research paper was used as the basis for Apache Hadoop, an open source framework for running MapReduce style jobs over large datasets.

At this point, Hadoop has distinguished itself as the most popular open source big data solution with a rich ecosystem of tools and several companies providing professional services and support including Cloudera and Hortonworks. What Hadoop provides is a low level framework for allowing computation jobs to be distributed across several servers within a cluster. This allows tools to split up very large datasets into smaller chunks, distribute computational tasks across the cluster, and finally assemble the result. So with the Hadoop framework in place, you still need specific tools built to leverage the distributed framework.

The toolbox

There are several tools that effectively leverage Hadoop but here are some of my favorites for quickly building out a cluster:

– Apache Whirr – Automates deploying, bootstrapping, and configuring a Hadoop cluster. Whirr will save you hours of time because instead of manually starting 4 EC2s and configuring them all you can kickstart a cluster with a single command.

– Apache HBase – A column store database that is similar to Google’s original BigTable system. Great for storing billions of records across a Hadoop HDFS file system.

– Apache Hive – A datawharehousing solution that allows you to run “SQL like” queries using Hadoop. It also has native support for pulling data out of MySQL, making it a convenient addition to a stack includes MySQL.

Apart from these, there are dozens of other Hadoop powered tools but its impossible to recommend a single silver bullet without knowing the details of your “big data” problem.

Something that comes up every so often in a sufficiently large PHP project is having to write helper scripts that run on the command line to complete various tasks. It might be periodically processing some images, updating cached analytics, etc. If the project is a Symfony project, it’s usually easy enough to add a Symfony task and be able to leverage the Symfony infrastructure to manage the individual “scripts” as tasks. This is equally true with Drupal, using Drush tasks to manage the individual scripts works well and lets you have a single, central spot for all your “helpers”. But what if its a vanilla PHP project or WordPress?

A technique I’ve started using is to create a class and then add each of the tasks as static functions. This allows you to keep all the tasks in one place, reuse code and configurations, and generally mimic how Symfony tasks and Drush work. From there, the file pulls off $argv to figure out what function to call and just passes $argv in as an argument as well.

Here’s a stub of a class to set something like this up:

Last week we deployed a background script for a client which was used intermitently to batch send a couple of hundred emails. We were using SwiftMailer but weren’t able to use the “Spool” strategy to send because the messages contained Unicode characters which was breaking the serialization. Anyway, we ended up with code that looked something like the following:

Nothing to crazy going on. We were also sending the emails through Amazon SES which is why we introduced the sleep(..) to keep ourselves below the sending limits.

Things seemed like they were fine but then we’d seemingly randomly get the following exception:

PHP Fatal error:  Uncaught exception 'Swift_TransportException' with message 'Expected response code 250 but got code "421", with message "421 Timeout waiting for data from client.
"' in /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/AbstractSmtpTransport.php:406
Stack trace:
#0 /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/AbstractSmtpTransport.php(290): Swift_Transport_AbstractSmtpTransport->_assertResponseCode('421 Timeout wai...',$
#1 /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/EsmtpTransport.php(197): Swift_Transport_AbstractSmtpTransport->executeCommand('MAIL FROM: <adm...', Array, Arra$
#2 /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/EsmtpTransport.php(267): Swift_Transport_EsmtpTransport->executeCommand('MAIL FROM: <adm...', Array)
#3 /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/AbstractSmtpTransport.php(441): Swift_Transport_EsmtpTransport->_doMailFromCommand('admin@chatthrea...')
#4 /usr/share/php/symfony/ve in /usr/share/php/symfony/vendor/swiftmailer/classes/Swift/Transport/AbstractSmtpTransport.php on line 406

After doing some digging around, it turns out Amazon’s SES service has a connection timeout which SwiftMailer was tripping up on. I couldn’t actually find an official published timeout limit but looking at the SwiftMailer code it seemed like it was possible to set a timeout inside Swift. We added a “timeout: 5” setting to our Symfony factories.yml file inside the SwiftMailer settings and it seemed to fix our issues.

Early on Tuesday Bootstrap 2.2.0 was released which included a handful of improvements, a couple of bug fixes, and some documentation updates. News of the update made the front page of Hacker News and generated a heated debate surrounding the usefulness of Bootstrap itself. The top comment basically railed against Bootstrap saying it’s useless since it just introduces a load of boilerplate CSS without any added benefit. It struck me that looking through the Bootstrap documentation, there isn’t a straightforward explanation of what it really is and more importantly, when you should use it.

At a high level, Bootstrap is a “CSS framework” which contains a set of CSS classes to help you develop CSS for a project. Bootstrap wasn’t the first and isn’t the only CSS framework, projects like BlueprintCSS, 960gs, and Foundation are also competing CSS frameworks. What makes Bootstrap stand apart though, is the tight integration between “base” classes and “high level” classes along with the number of components included in the standard distribution. Using Bootstrap, you can effectively use a single toolkit to take your project from a layout grid, through styled HTML forms, and finally stylish Javascript plugins.

The next thing to consider is give your requirements and team, is Bootstrap an appropriate choice for your project. As CSS frameworks go, Bootstrap is pretty heavy and it’s going to introduce conventions and assumptions into your project that if you don’t use, you’ll end up fighting against. Given that, when is a good time to Bootstrap? In my opinion, Bootstrap will end up being the most useful when you have a fast moving, new project that needs a good “default” style for prototyping as well as a style guide for developers to follow. So concretely, a typical team of 3 engineers starting a new project will probably benefit from Bootstrap. Meanwhile, a digital agency designing a micro site for a client with exiting assets and branding probably isn’t going to. So what does Bootstrap get you?

Consistency and re-use

With a single developer using a single CSS file, its pretty easy to keep class names consistent and effectively re-use styles. However, as additional developers and additional CSS files are added to a project it becomes increasingly difficult to effectively re-use classes and often prevent near duplicate definitions. Bootstrap helps mitigate this by introducing classes for styles you’ll probably need right off the bat. Need a button? Use “btn”. Need a bordered table? Add a “table-striped”. Unfortunately, Bootstrap isn’t a silver bullet but it’s better than nothing.

Looks good out of the box

This one will be a bit controversial but it’s important for a lot of people. Out of the box, Bootstrap looks pretty good which gives you a more flexibility in rapidly developing prototypes, proof of concepts, and MVPs because you’re free to focus on the functionality instead of the design. If something ends up moving into production, you should obviously customize Bootstrap away from the stock theme. Bootstrap actually makes this significantly easier because it uses LESS to generate its CSS files. LESS introduces several pre-processors on top of CSS which makes it easy to make a cascading edit throughout the framework. Need to change the colors across your site? Just edit variables.less and re-generate the CSS.

It’s modern

As a framework, Bootstrap leverages several modern development techniques including responsive design, HTML5 data-*, and several others. Taken individually, none of these techniques are particularly notable but as part of an integrated framework they’ll help you write cleaner, more maintainable, and more compatible code.

This isn’t an exhaustive list by any means but hopefully it’ll serve as a good basis for what Bootstrap is and why you should consider using it.

Doctrine 1.2: To many columns causes findBy* to fail

Big Data: What is “Big Data”?

It’s all just data

Enter the elephant

The toolbox

PHP: Quick and dirty CLI tasks

SwiftMailer: Expected response code 250 but got code 421

Twitter Bootstrap: What it (really) is

Consistency and re-use

Looks good out of the box

It’s modern