PHP: Does “big-o” complexity really matter?

Last week, a client of ours as us to look at some code that was running particularly slowly. The code was powering an autocompleter that searched a list of high schools in the US and returned the schools that matched and an identifying code. We took a look at the code, and it turns out the original developers had implemented a naivete solution that was choking up since the list had gotten to ~45k elements and I imagine they had only tested with a dozen or so. During the process of implementing a slicker solution, we decided to benchmark a couple of different approaches to see how much the differences in “big-o” complexity really mattered.

The Problem

What we were looking at was the following:

– There is a CSV file that looks something like:

ID, STATE, SCHOOL NAME
2,NMSC DEPT OF ED & SVCS,IL
3,MY SCHOOL IS NOT LISTED DOMEST,NY
4,MY SCHOOL IS NOT LISTED-INTRNT,NY
8,DISTRICT COUNCIL 37 AFSCME,NY
20,AMERICAN SAMOA CMTY COLLEGE,AS
81,LANDMARK COLLEGE,VT

With data for about 45k schools.

  • On the frontend, there was a vanilla jQuery UI autocompleter that passed a state as well as “school name part” to the backend to retrieve autocomplete results.
  • The endpoint basically takes the state and school part, parses the available data, and returns the results as a JSON array.
  • So as an example, the function accepts something like {state: “MA”, query: “New”} and returns:
[
  {name: "New School", code: 1234}.
  {name: "Newton South", code: 1234},
  {name: "Newtown High", code: 1234},
]

The Solutions

In the name of science, we put together a couple of solutions, benchmarked them by running them 1000 times and calculating the min/max/average times, and those values are graphed below. Each of the solutions is briefly described below along with how they’re referenced in the graph.

The initial solution that our client had been running read the entire CSV into a PHP array, then searched the PHP array for schools that matched the query. (readMemoryScan)

A slightly better approach is doing the search “in-place” without actually reading the entire file into memory. (unsortedTableScan)

But can we take advantage of how the data is structured? Turns out we can. Since we’re looking for schools in a specific state whose name’s start with a search string we can sort the file by STATE then SCHOOL NAME which will let us abort the search early. (sortedTableScan)

Since we’re always searching by STATE and SCHOOL NAME can we exploit this to cut down on the number of elements that need to be searched even further?

Turns out we can by transforming the CSV file into a PHP array indexed by state and then writing that out as a serialized PHP object. Another detail we can exploit is that the autocompleter has a minimum search length of 3 characters so we can actually build sub-arrays inside the list of schools keyed on the first 3 letters of their name (serializednFileScan).

So the data structure we’d end up creating looks something like:

{
...
  "MA": {
  ...
   "AME": [...list of schools in MA starting with AME...],
   "NEW": [...list of schools in MA starting with NEW...],
  ...
  },
  "NJ": {
  ...
   "AME": [...list of schools in NJ starting with AME...],
   "NEW": [...list of schools in NJ starting with NEW...],
  ...
  },
  "CA": {
  ...
   "AME": [...list of schools in CA starting with AME...],
   "NEW": [...list of schools in CAA starting with NEW...],
  ...
  },
...
}

The results

Running each function 1000 times, recording the elapsed time between results, and calculating the min / max / and average times we ended up with these numbers:

test_name min (sec.) max (sec.) average (sec.)
readMemoryScan .662 .690 .673
unsortedTableScan .532 .547 .536
sortedTableScan .260 .276 .264
serializednFileScan .149 .171 .154

And then graphing the averages gets you a graphic that looks like:

The most interesting metric is how the different autocompleters actually “feel” when you use them. We setup a demo at http://symf.setfive.com/autocomplete_test/ Turns out, a few hundred milliseconds makes a huge difference

The conclusion

Looking at our numbers, even with relatively small data sets (<100k elements), the complexity of your algorithms matter. Even though the actual number differences are small, the responsiveness of the autocompleter between the three implementations varies dramatically. Anyway, so long story short? Pay attention in algorithms class.

Symfony2: Using kernel events like preExecute to log requests

A couple of days ago, one of our developers mentioned wanting to log all the requests that hit a specific Symfony2 controller. Back in Symfony 1.2, you’d be able to easily accomplish this with a “preExecute” function in the specific controller that you want to log. We’d actually set something similar to this up and the code would end up looking like:


Symfony2 doesn’t have a “preExecute” hook in the same fashion as 1.2 but using the event system you can accomplish the same thing. What you’ll basically end up doing is configuring an event listener for the “kernel.controller” event, inject the EntityManager (or kernel) and then log the request.

The pertinent service configuration in YAML looks like:

And then the corresponding class looks something like:

And thats about it.

Symfony2 forms without an entity and with a conditional validator

Recently on a project I had a situation where I was using the Symfony2 forms component without an entity. In addition to each field’s constraints, I needed to something similar to symfony 1.4’s conditional validator so that I could make sure that the form on the whole was valid. There are a bunch of docs out there on how to use callback functions on an entity to do this, however I didn’t see much on how to get the entire form that has no entity to do a callback. After reading some of the source code, found that you can set up some ‘form level’ constraints in the setDefaultOptions method. So it will look something like this:

You pass the Callback constraint an array methods which it can call. If you pass one of those methods is an array it is parsed as class::method. In my case by passing $this it uses the currently instantiated form, rather than trying to call the method statically.

From there you can do something like this:

The first parameter is the form’s data fields. From there you can add global level errors to the form, such as if a combination of fields are not valid.

Good luck out there.

PHP: How do you evaluate PHP skill?

We’re always on the hunt for talented LAMP developers and as a consequence we end up evaluating a decent amount of fairly diverse PHP code. We always ask potential employees for a code sample so that we can get a sense of their style and generally make sure they have their heads screwed on right. Because of this, we’ve been evaluating PHP samples from everything from Drupal modules, to batch processing scripts, and even “hardware hacks”.

During this process, one of the issues we’ve had is coming up with an objective rubric to evaluate the relative skill of a PHP developer. Although there are several broad criteria for evaluating code, I’ve been interested in coming up with PHP-centric benchmarks since they’re more directly applicable. Here’s a list of criteria that we’ve been working on to help us identify how familiar an engineer is with PHP.

Negative Signals

Unfortunately, it’s sometimes easier to spot negative signals so here are a few PHP specific “code smells” we’ve identified.

Re-inventing standard library functions

Often, inexperienced engineers won’t search for standard functions that’ll do exactly what they’re looking for. Not always a bad thing but it’s a sign of inexperience. An example would be:

Not really understanding the ORM (or SQL)

This one isn’t really PHP specific but we’ve noticed it a couple of times anyway. We’ll often see code that looks something like:

As you can see, the code returns all the results and then evaluates the criteria as opposed to passing the selection criteria to the ORM or SQL.

Everything is a global

Due to PHP’s global keyword it’s unfortunately really easy to throw encapsulation to the wind and just make everything a global. Because of that, we’ve seen code that looks like:

Positive Signals

What we look for next is usually positive signals that an engineer is generally familiar with PHP. These are generally things you’d pick up after you’ve written a fair amount of PHP.

Has implemented __toString() somewhere

I know this one will be controversial, but my sense is that if an engineer implements __toString() somewhere in PHP they probably have a decent familiarity with the language since its something you have to “seek out”. A canonical example would be something like:

Uses output buffering

Output buffering is a bit exotic but it’s indispensable when building web apps without a framework. It also certainly demonstrates a level of familiarity with PHP. A good example would be capturing output from a template and returning it inside some JSON:

Voodoo Positive Signals

Finally, the last couple of things we’ve been looking for are “exotic” techniques that really demonstrate that someone “gets” PHP. Granted, some (most?) of these are bad ideas in production they do convey a certain level of understanding.

Classes that implement the ArrayAccess interface

Since PHP relies so heavily on arrays, building classes that implement the ArrayAccess interface makes them work more naturally in PHP and definitely demonstrates a strong level of familiarity with the language. An example would be:

Using __set, __get, or __call

Although, using any of these in production is questionable at best, they undeniably do convey a sense that whoever used them knows their way around PHP. An example (from the documentation) would be:

Anyway, like everything I’d take this list with a big grain of salt. We’d also love any input or feedback.