Making Doctrine and Symfony Logged Queries Runnable

On many of our projects we use Gearman to do background processing.  One of problems with doing things in the background is that the web debug toolbar isn’t available to help with debugging problems, including queries.  Normally when you want to see your queries you can look at the debug toolbar and get a runnable version of the query quickly.  However, when its running in the background, you have to look at the application logs to see what the query is.  The logs don’t contain a runnable format of the query, for example they may look like this:

Problem is you can’t quickly take that to your database and run it to see the results. Plugging in the parameters is easy enough, but it takes time. I decided to quickly whip up a script that will take what is in the gist above and convert it to a runnable format. I’ve posted this over at http://code.setfive.com/doctrine-query-log-converter/ . This hopefully will save you some time when you are trying to debug your background processes.

It should work with both Doctrine 1.x/symfony 1.x and Doctrine2.x/Symfony2.x. If you find any issues with it let me know.

Good luck debugging!

SQL: 5 Tips and tricks to impress your DBA friends

I was poking around the ThoughtBot blog a couple of days ago and ran across a post titled Refactoring Ruby Iteration Patterns to the Database. At a high level, the post was summarizing how you can take an ActiveRecord aggregation (a sum in this case) and run it in directly in your RDMS with SQL. Not really rocket science, but it was a keen reminder of how ORMs often mask over much of the power of “regular” SQL. This isn’t a specific criticism of ActiveRecord, it’s an issue with every ORM from Doctrine to Hybernate.

We’ve actually been writing some straight SQL lately, mostly for analytics work, so I had the team shoot over their favorite “maybe hidden” SQL feature. Since life is better with examples, the sample use cases and queries are written against a schema describing the “items” found on a “receipt” which are optionally related to a “category”. The SQL to create the schema is:

Note: SQLFiddle is unfortunately down right now or I’d add this as a fiddle. Anyway, if you have the schema setup feel free to run the queries as you go down the list.

Order rows by fixed ordering of a column

At some point, you might find yourself needing to sort a list of rows by a column in an arbitrarily enforced order. For example, say on our item table, you needed to sort the rows by the “quadrant” column such that WE was first, followed by AX, and finally BT.

Turns out, it’s possible to specify an arbitrary ordering using the ORDER BY FIELD statement:

Check that a LEFT JOIN relation exists

If you’re only running JOINs on columns with foreign key relations this isn’t an issue, but what happens if you need to run a JOIN where a FK doesn’t necessarily exist? In our example, lets say you wanted to select only the items which had a corresponding row in the “receipt” table.

The most straightforward way to accomplish this is generally to check that the JOIN’ed column on the related table isn’t NULL:

Assign an aggregate value to a variable and re-use

One of the SQL features that’s usually glossed over or ignored in web development is the ability to create variables and then reuse them in subsequent statements. With this schema, an example would be calculating the “% of total spend” for the individual items – most people would run one query to generate the total and then a separate query to calculate the % of spend. For something trivial like this it doesn’t matter but if you were involving complex WHERE predicates it could be a nice performance boost.

The syntax for variables is relatively easy and it’s actually a powerful concept:

Add synthetic “pseudo” columns using variables

This one is a Matt Daum favorite and pretty handy. Looking at the example, say that you wanted to assign a “sequence” value to each item depending on their rank order based on “total” within their “category_id”. In plain English, for each “category_id” you want to assign the most expensive item a “1”, the second most a “2”, and so on.

This seems straightforward, but try and construct a result set using only a GROUP BY or some combination of sub queries, I’ll wait. Turns out, the easiest way to accomplish this is to use variables to construct a “pseudo” column that increments and resets when the category changes.

Select only GROUP’ed rows that fulfill a second clause

Sorry for the terrible description, an example will make it clearer. Given our schema, lets say you wanted to select *only* the most expensive items per category, how could you set about doing it? The obvious approach would be using some combination of GROUP BY and MAX but unfortunately because of the semantics of GROUP BY that wont work as expected.

A better approach, is to leverage an INNER JOIN along with MAX() to only select the rows that match the max total per category:

The caveat here is that you’re really selecting the highest total, so if two rows have the same total you’re not guaranteed which one you’ll end up with. This approach also scales out, in the sense that you can add additional INNER JOINs to limit the resultset in situations where you’re getting tripped up by GROUP BYs and ORDER BYs.

Anyway, as always, we’d love to hear your favorite tips and tricks in the comments!

SQL Join Checker, Making Sure Your Joins Are Right

Recently on a project we came across the need to generate a bunch of different reports from the database.  Due to different requirements we weren’t able to use the ORM (Doctrine2 on the specific project), so we wrote the queries by hand.  As we continued to build the different reports we noticed sometimes we’d typo a join, for example join something on `id` versus `user_id`.  These small typos would cause the reports to still run fine, however have the incorrect data, often it was difficult to pinpoint the exact issue in the given report, as only certain conditions could reproduce the results.  After a while Ashish said it’d be great if we had some sort of sanity checker to make sure the queries we were writing were going across the proper joins.  To me, this was:

challenge-accepted

 

At first I thought about just using Regular Expressions to parse out the join parts of the SQL queries.  However, I found http://code.google.com/p/php-sql-parser/ which appears to do the job.  I downloaded it and wrote a class which uses it and some expressions to discover FK’s in the database.  I ended up with something which, albeit not the most elegant, gets the job done.  Here is an example output of it:

Basically it will run through whatever query you give it, and make sure that the columns you are joining on are defined in the DB. If you are trying to join on a column that is defined as a constraint, it will output the part of the join that failed the check as well as what FK’s currently do exist. Another issue this may help with, is if your database is missing a constraint (FK) that should be defined it will point it out.

I wrote this really quickly, so let me know (or make a pull request) if you find any bugs. I’ve put the code up on Github. Let me know if it helps out!

sfPropelPager and GROUP BY criteria

So for one reason or another (actually a few bad ones) I ended up having to use a Criteria object looking like this:

It makes SQL that looks something like this:

“SELECT tag_id, tag_model, id, COUNT(*) AS the_count FROM sf_tagging WHERE tag_id IN (1,2,3) GROUP BY taggable_model, taggable_id HAVING the_count > int”

The query sucks but whatever it works.

My issue came when I tried to use it with a sfPropelPager. I set up the pager per usual but for some reason the results that were coming back weren’t correct. For some reason, the COUNT being returned by the sfPropelPager was completely wrong. It turns out the offending lines are here in sfPropelPager.class.php :

For whatever reason, sfPropelPager clears the GROUP BY clauses before it calculates the COUNT for a criteria object. I’m not sure why it does this – but it certainly is unexpected and breaks my query in particular.

There are a handful of posts about this on the Symfony forums and it looks like the Propel people know about the issue to.

The solution to this is to use the setPeerCountMethod() from sfPropelPager. The setPeerCountMethod() function allows you to specify a custom COUNT() method inside the peer for your Criteria. I went ahead and added a new function to put the GROUP BY columns back in:

This solution works but it is extremely rigid. Since the custom count function has to be static you’d really be out of luck if you had variable columns or other dynamic requirements.

I’d love to know if someone has a cleaner/better/more elegant solution for this.