{5} Setfive - Talking to the World - Page 42 of 66 - Ramblings on code, startups, and everything in between

With 2012 wrapping up, I thought it would be fun to make some predictions for 2013. I’m normally not a huge fan of these lists since they are usually a bit dull so I’ll try and make this one a bit scandalous. So here we go, looking at startups, technology, and a bit of everything else.

Series A Crunch

It’s over hyped but still probably worth mentioning. There was a huge uptick in angel activity in the last few years and as a result there’s a glut of start-ups looking to raise a Series A round. Unfortunately, there won’t be enough Series A deals to be had and with the general cooling in the consumer web sector VCs aren’t jumping for ways to make them happen. The net result of this situation will probably be a an increased number of “zombie” start-ups, aqui-hires, and obviously deadpooled companies. Fred Destin from Atlas has a full post discussing this, Series A Crunch, Seed Blues.

Breakout Health Company

The health and wellness vertical has been heating up for a couple of years, but we still haven’t seen a company have their “mint.com” moment in the space. I think we’ll see one in 2013 because of a combination of the penetration of smartphones and tablets, growing popularity of health data interoperability, and the federal mandate that electronic medical record keeping be implemented across the board. Deep pocketed companies like athenahealth have also hinted that they’d be open to acquiring startups in this space which should up the ante.

Meta Cloud APIs

Looking back at 2012, it could be affectionately nicknamed the “year of the outage”. Several providers were hit by downtime, but the Amazon Web Services disruptions were arguably more serious and frequent than they had been in the past. For some companies, the AWS issues highlighted the fact that relying on a single cloud provider was too risky given the potential of extended downtimes, even across regions. I think because of this, in 2013 we’ll start seeing services that make it easy to provision and configure identical environments at multiple cloud providers. This would essentially allow you to have live instances running on AWS but then have an identical setup running at Rackspace. The foundations for something like this already exist in the jclouds library.

One JS MV* To Rule Them All

There’s been an explosion in the number of Javascript MV* libraries in the last couple of years. Libraries like BackboneJS, AngularJS, and KnockoutJS are all meeting similar needs for the same group of developers. I think in 2013 we’ll see at least some consolidation in the number of libraries and maybe even the emergence of a clear favorite. Hopefully this will lead to some clear patterns for Javascript MVC development just like the paradigms used by backend frameworks like Rails or Symfony.

Google Releases a Play API

Its a bit surprising that in 2012 mobile app discovery is still a disaster. Google Now can automagically tell me my flight is delayed but Google Play can’t seem to offer up even the basic recommendations. What’s even more surprising is that there isn’t an official Google Play or iTunes app store API. It’s also a bit odd that even if I “sign in” with a Google Account on a site, I can’t automatically download apps to my phone from their site. I think Google is going to change this in 2013 and open up a wave of innovation around mobile app discovery and monetization. This should shake up the app store distribution problem and also reduce the friction to convert a web user to a mobile app one.

Second Screen Showdown

With the rising popularity of tablets, the second screen TV experience will become increasingly popular in 2013. You’ll be able to download network (Showtime, HBO, NFL, etc.) branded apps which will display enhanced content as you’re watching your favorite show or team. Look for networks will use this to boost engagement and drive revenues through sales of content related items.

Anyway, these are all a bit off the cuff but I’d love to discuss in the comments.

D3 is a “newish” visualization library that has been getting a lot of attention recently. The New York Times has been using it extensively to create visualizations, and in fact its creator is currently employed by the NYT. I’d been meaning to take D3 for a spin for a while but couldn’t find a dataset I wanted to play with until a few weeks ago.

At the end of November, the LA Times published a dataset titled Capital appreciation bonds which highlighted how various California school districts were funding various projects with extremely high interest rate bonds. The LA Times described the data as:

Hundreds of California school and community college districts have financed construction projects with capital appreciation bonds that push repayment far into the future and ultimately cost many times what the district borrowed. Government finance experts consider bonds imprudent if the total cost is more than four times the money borrowed or the maturity period is greater than 25 years.

Anyway, you can check out my attempt at a visualization here.

Last week, one of our projects hit a pretty odd limit that I’d never expected to reach. The project is an analytics platform that allows admins to “pull” data from another, third party application. To accomplish this, the application allows admin users to dynamically add and remove columns from SQL tables and then dynamically chart these columns. Because of this, one of the tables had gotten over 350 columns which had all been created dynamically at runtime.

Anyway, things were working fine until last week when the application started throwing the following fatal error: “Fatal error: Uncaught exception ‘Doctrine_Table_Exception’ with message ‘Invalid expression found: ‘ in /usr/share/php/symfony/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Table.php:2746″ Looking at the error, I noticed a warning was actually getting thrown right before the fatal error: “Warning: preg_replace(): Compilation failed: regular expression is too large at offset 32594 in /usr/share/php/symfony/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Table.php on line 2745″ Looking through the code of Table.php, its clear that because “preg_replace” fails the $expression is subsequently blank which causes Doctrine to throw an error. I wanted to see how bad the regex was so I updated the Table.php to dump the expression. Here is what Doctrine was trying to run:

/(lsc_calculated_promoterscore_delta_innovativeness_formula|Lsc_calculated_promoterscore_delta_innovativeness_formula|lsc_calculated_promoterscore_delta_innovativeness_order|lsc_calculated_promoterscore_delta_favorability_formula|Lsc_calculated_promoterscore_delta_favorability_formula|Lsc_calculated_promoterscore_delta_innovativeness_order|Lsc_calculated_promoterscore_delta_favorability_order|

[Lots of columns...]

|Grid_mac_total|GridTotalTotal|GridTotalWinpc|Grid_total_mac|GridWindowsMac|grid_both_mac|grid_mac_ipad|audience_type|GridBothWinpc|GridBothTotal|Grid_both_mac|GridTotalIpad|Audience_type|WindowsosFy13|Grid_mac_ipad|grid_mac_mac|Grid_mac_mac|Program_type|program_type|AudienceType|GridMacTotal|GridBothIpad|GridTotalMac|GridMacWinpc|venue_child|ProgramType|GridMacIpad|Venue_child|GridBothMac|Program_id|program_id|VenueChild|GridMacMac|Fiscalyear|fiscalyear|is_locked|Is_locked|ProgramId|IsLocked|Country|country|venue|Venue|os|OS|id|Os|Id)(Or|And)?/

Looking at the php.net documentation for preg_replace and preg_match neither actually mention a hard limit on the size of a regex that can be compiled. Obviously there is a limit though and I imagine it must depend on the underlying RegExp library that your PHP is compiled against so it might be platform dependent.

As for solutions for this problem? The best solution for an extreme case like this is probably to just manually fill those in with real methods in the Doctrine_Table classes:

Last week, I was catching up with a friend of mine and we started chatting about his most recent project. As we were chatting, he made an offhand comment about how some of the business guys on the team love to refer to what they are working on as a “big data” play, even though it really wasn’t. This stuck with me, since because of the vague definitions around “big data”, it’s easy to shoe horn problems into a “big data” play. Because of this, I think its worth taking a step back and discussing what big data really is and what tools are available to work with it.

It’s all just data

At the end of the day, data is data. It doesn’t really matter if its stored in a CSV text file, a MySQL database, or a NoSQL datastore like Cassandra or MongoDB. Typically though, web applications tend to use a relational database like MySQL or Postgres to persist data. Relational databases store data in a series of tables which are in turn arranged as a series of rows and columns. As an abstraction, think of a series of Excel worksheets which can have links between the rows of each sheet.

For most applications, this works out fine, the database ends up managing say a few thousand customer accounts, each with a few hundred thousand objects associated with them and the total dataset fits conveniently into the server’s RAM. Since the dataset is relatively small, things like retrieving information, updating records, and running ad-hoc analytics queries are all easy to implement and relatively fast. But what happens if your dataset doesn’t fit into memory of even the beefiest of servers? Therein lies the “big data” problem.

Certain applications generate an enormous amount of data on a daily basis. For example, look at Mixpanel, tracking discreet user interactions is going to produce hundreds of thousands of datapoints every day even with just a few clients. With this volume of data, typical relational databases quickly start performing sluggishly and eventually stop being effective entirely. Even simple queries like counting the “# of clicks by user” start to take hours to run, effectively becoming intractable. Although specialized relational databases like Vertica and Oracle 11g do exist to help solve this problem, they’re expensive and proprietery.

Enter the elephant

One of the first companies to publicly discuss their big data strategies was Google in Bigtable: A Distributed Storage System for Structured Data which described their BigTable datastorage system. Although a proprietary solution, the research paper was used as the basis for Apache Hadoop, an open source framework for running MapReduce style jobs over large datasets.

At this point, Hadoop has distinguished itself as the most popular open source big data solution with a rich ecosystem of tools and several companies providing professional services and support including Cloudera and Hortonworks. What Hadoop provides is a low level framework for allowing computation jobs to be distributed across several servers within a cluster. This allows tools to split up very large datasets into smaller chunks, distribute computational tasks across the cluster, and finally assemble the result. So with the Hadoop framework in place, you still need specific tools built to leverage the distributed framework.

The toolbox

There are several tools that effectively leverage Hadoop but here are some of my favorites for quickly building out a cluster:

– Apache Whirr – Automates deploying, bootstrapping, and configuring a Hadoop cluster. Whirr will save you hours of time because instead of manually starting 4 EC2s and configuring them all you can kickstart a cluster with a single command.

– Apache HBase – A column store database that is similar to Google’s original BigTable system. Great for storing billions of records across a Hadoop HDFS file system.

– Apache Hive – A datawharehousing solution that allows you to run “SQL like” queries using Hadoop. It also has native support for pulling data out of MySQL, making it a convenient addition to a stack includes MySQL.

Apart from these, there are dozens of other Hadoop powered tools but its impossible to recommend a single silver bullet without knowing the details of your “big data” problem.

Something that comes up every so often in a sufficiently large PHP project is having to write helper scripts that run on the command line to complete various tasks. It might be periodically processing some images, updating cached analytics, etc. If the project is a Symfony project, it’s usually easy enough to add a Symfony task and be able to leverage the Symfony infrastructure to manage the individual “scripts” as tasks. This is equally true with Drupal, using Drush tasks to manage the individual scripts works well and lets you have a single, central spot for all your “helpers”. But what if its a vanilla PHP project or WordPress?

A technique I’ve started using is to create a class and then add each of the tasks as static functions. This allows you to keep all the tasks in one place, reuse code and configurations, and generally mimic how Symfony tasks and Drush work. From there, the file pulls off $argv to figure out what function to call and just passes $argv in as an argument as well.

Here’s a stub of a class to set something like this up:

Just for fun: Some tech predictions for 2013

Series A Crunch

Breakout Health Company

Meta Cloud APIs

One JS MV* To Rule Them All

Google Releases a Play API

Second Screen Showdown

D3: Taking a dip

Doctrine 1.2: To many columns causes findBy* to fail

Big Data: What is “Big Data”?

It’s all just data

Enter the elephant

The toolbox

PHP: Quick and dirty CLI tasks