PHP: Does “big-o” complexity really matter?

Last week, a client of ours as us to look at some code that was running particularly slowly. The code was powering an autocompleter that searched a list of high schools in the US and returned the schools that matched and an identifying code. We took a look at the code, and it turns out the original developers had implemented a naivete solution that was choking up since the list had gotten to ~45k elements and I imagine they had only tested with a dozen or so. During the process of implementing a slicker solution, we decided to benchmark a couple of different approaches to see how much the differences in “big-o” complexity really mattered.

The Problem

What we were looking at was the following:

– There is a CSV file that looks something like:

ID, STATE, SCHOOL NAME
2,NMSC DEPT OF ED & SVCS,IL
3,MY SCHOOL IS NOT LISTED DOMEST,NY
4,MY SCHOOL IS NOT LISTED-INTRNT,NY
8,DISTRICT COUNCIL 37 AFSCME,NY
20,AMERICAN SAMOA CMTY COLLEGE,AS
81,LANDMARK COLLEGE,VT

With data for about 45k schools.

  • On the frontend, there was a vanilla jQuery UI autocompleter that passed a state as well as “school name part” to the backend to retrieve autocomplete results.
  • The endpoint basically takes the state and school part, parses the available data, and returns the results as a JSON array.
  • So as an example, the function accepts something like {state: “MA”, query: “New”} and returns:
[
  {name: "New School", code: 1234}.
  {name: "Newton South", code: 1234},
  {name: "Newtown High", code: 1234},
]

The Solutions

In the name of science, we put together a couple of solutions, benchmarked them by running them 1000 times and calculating the min/max/average times, and those values are graphed below. Each of the solutions is briefly described below along with how they’re referenced in the graph.

The initial solution that our client had been running read the entire CSV into a PHP array, then searched the PHP array for schools that matched the query. (readMemoryScan)

A slightly better approach is doing the search “in-place” without actually reading the entire file into memory. (unsortedTableScan)

But can we take advantage of how the data is structured? Turns out we can. Since we’re looking for schools in a specific state whose name’s start with a search string we can sort the file by STATE then SCHOOL NAME which will let us abort the search early. (sortedTableScan)

Since we’re always searching by STATE and SCHOOL NAME can we exploit this to cut down on the number of elements that need to be searched even further?

Turns out we can by transforming the CSV file into a PHP array indexed by state and then writing that out as a serialized PHP object. Another detail we can exploit is that the autocompleter has a minimum search length of 3 characters so we can actually build sub-arrays inside the list of schools keyed on the first 3 letters of their name (serializednFileScan).

So the data structure we’d end up creating looks something like:

{
...
  "MA": {
  ...
   "AME": [...list of schools in MA starting with AME...],
   "NEW": [...list of schools in MA starting with NEW...],
  ...
  },
  "NJ": {
  ...
   "AME": [...list of schools in NJ starting with AME...],
   "NEW": [...list of schools in NJ starting with NEW...],
  ...
  },
  "CA": {
  ...
   "AME": [...list of schools in CA starting with AME...],
   "NEW": [...list of schools in CAA starting with NEW...],
  ...
  },
...
}

The results

Running each function 1000 times, recording the elapsed time between results, and calculating the min / max / and average times we ended up with these numbers:

test_name min (sec.) max (sec.) average (sec.)
readMemoryScan .662 .690 .673
unsortedTableScan .532 .547 .536
sortedTableScan .260 .276 .264
serializednFileScan .149 .171 .154

And then graphing the averages gets you a graphic that looks like:

The most interesting metric is how the different autocompleters actually “feel” when you use them. We setup a demo at http://symf.setfive.com/autocomplete_test/ Turns out, a few hundred milliseconds makes a huge difference

The conclusion

Looking at our numbers, even with relatively small data sets (<100k elements), the complexity of your algorithms matter. Even though the actual number differences are small, the responsiveness of the autocompleter between the three implementations varies dramatically.

Anyway, so long story short? Pay attention in algorithms class.

Posted In: General, PHP

Tags: ,

  • http://dinhe.net/~aredridel Aria Stewart

    Even more so, algorithmic complexity affects more than computation. If you put units on the notation, you start to see why it matters: “O(parents * children) HTTP requests” starts looking pretty scary even if “parents” = 1 and “children” = 10 when your RTT is 1s.

    Units matter a lot. I’m not scared of O(n^2) CPU operations unless N is sizable. Even O(n) scares me when we’re talking about HTTP requests.

  • http://www.setfive.com/ Ashish Datta

    True – at the end of the day as web engineers we all live and die by latency. https://gist.github.com/jboner/2841832

  • http://www.nonuby.com/ nonuby

    Even with the best solution here you deserializing on each request, that sucks from a perf perspective regardless of algorithm, anything on a GET request shouldnt be doing too much and certainly not be touching disk (even if kernel caches save you ass). I know you are limited with PHP (build the world on each request) but if you’re going to put this amount of effort into it perhaps find a stack that can save some context between requests (i.e. anything apart from apache/php)

  • http://www.setfive.com/ Ashish Datta

    Fair point. We could have popped the unserialized PHP object into memcache and then just retrieved it on each GET request but that seemed a bit like “cheating” for this discussion.

  • Nicholas Andre

    Have you heard of SQLite? It does this wondrous thing called “querying a database” and it’s extremely fast at it due to many many hours of time spent developing a program in a compiled language to parse non-human-readable files.

    What you’ve done is re-implement a simple database in a scripting language.

  • http://photogabble.co.uk/ Simon

    While I agree with your sentiment and indeed the first thing on my mind was why not use an actual database as the data-store – I believe the point of this article was to show us that our choice of algorithm is important and that we should take future scaling into consideration.

  • http://dinhe.net/~aredridel Aria Stewart

    Indeed: SQLite actually has the same performance character (though the constant factors may be much smaller). Indexes are equivalent to re-arranging the data for faster access. If you don’t index your table, then it’s the old linear case just like the original.

    Algorithmic complexity doesn’t change just because you made a different component do the job.

  • Pingback: PHP: How fast is HipHop PHP? « {5} Setfive – Talking to the World()