I was poking around the ThoughtBot blog a couple of days ago and ran across a post titled Refactoring Ruby Iteration Patterns to the Database. At a high level, the post was summarizing how you can take an ActiveRecord aggregation (a sum in this case) and run it in directly in your RDMS with SQL. Not really rocket science, but it was a keen reminder of how ORMs often mask over much of the power of “regular” SQL. This isn’t a specific criticism of ActiveRecord, it’s an issue with every ORM from Doctrine to Hybernate.
We’ve actually been writing some straight SQL lately, mostly for analytics work, so I had the team shoot over their favorite “maybe hidden” SQL feature. Since life is better with examples, the sample use cases and queries are written against a schema describing the “items” found on a “receipt” which are optionally related to a “category”. The SQL to create the schema is:
Note: SQLFiddle is unfortunately down right now or I’d add this as a fiddle. Anyway, if you have the schema setup feel free to run the queries as you go down the list.
At some point, you might find yourself needing to sort a list of rows by a column in an arbitrarily enforced order. For example, say on our item table, you needed to sort the rows by the “quadrant” column such that WE was first, followed by AX, and finally BT.
Turns out, it’s possible to specify an arbitrary ordering using the ORDER BY FIELD statement:
If you’re only running JOINs on columns with foreign key relations this isn’t an issue, but what happens if you need to run a JOIN where a FK doesn’t necessarily exist? In our example, lets say you wanted to select only the items which had a corresponding row in the “receipt” table.
The most straightforward way to accomplish this is generally to check that the JOIN’ed column on the related table isn’t NULL:
One of the SQL features that’s usually glossed over or ignored in web development is the ability to create variables and then reuse them in subsequent statements. With this schema, an example would be calculating the “% of total spend” for the individual items – most people would run one query to generate the total and then a separate query to calculate the % of spend. For something trivial like this it doesn’t matter but if you were involving complex WHERE predicates it could be a nice performance boost.
The syntax for variables is relatively easy and it’s actually a powerful concept:
This one is a Matt Daum favorite and pretty handy. Looking at the example, say that you wanted to assign a “sequence” value to each item depending on their rank order based on “total” within their “category_id”. In plain English, for each “category_id” you want to assign the most expensive item a “1”, the second most a “2”, and so on.
This seems straightforward, but try and construct a result set using only a GROUP BY or some combination of sub queries, I’ll wait. Turns out, the easiest way to accomplish this is to use variables to construct a “pseudo” column that increments and resets when the category changes.
Sorry for the terrible description, an example will make it clearer. Given our schema, lets say you wanted to select *only* the most expensive items per category, how could you set about doing it? The obvious approach would be using some combination of GROUP BY and MAX but unfortunately because of the semantics of GROUP BY that wont work as expected.
A better approach, is to leverage an INNER JOIN along with MAX() to only select the rows that match the max total per category:
The caveat here is that you’re really selecting the highest total, so if two rows have the same total you’re not guaranteed which one you’ll end up with. This approach also scales out, in the sense that you can add additional INNER JOINs to limit the resultset in situations where you’re getting tripped up by GROUP BYs and ORDER BYs.
Anyway, as always, we’d love to hear your favorite tips and tricks in the comments!
Posted In: Tips n' Tricks