Despite how important they are, MySQL indexes are a bit of a dark art. Sure everyone knows indexes are important but details on how they’re implemented and when they’ll be used are hard to come by. Beyond regular indexes, MySQL’s composite indexes are especially opaque in regards to how and when they’ll be used. As the name suggests composite indexes are an index constructed across two columns versus a regular index on a single column. So when might a composite index come in handy? Let’s take a look!
We’ll look at a table “client_order” that captures some fictional orders from our fictional clients:
And we’ll fill it up with 5 million fictional orders with dates spanning the last 10 years. You can grab the data from https://setfive-misc.s3.amazonaws.com/client_order.sql.gz if you want to follow along locally.
To get started, let’s figure out the total amount spent for a couple of clients:
~1.5 seconds to calculate the sums and according to the EXPLAIN MySQL had to use a temporary table and a filesort. Will an index help here? Lets add one and find out.
~0.2 seconds and looking at the EXPLAIN we’ve cut down the number of rows MySQL has to look at to 424, much better. OK great, but now what if we’re only interested in looking at data from Christmas Eve in 2016?
(Note: Details on why we’re querying with full timestamps below)
As you can see, MySQL is still using the client_id index but we’re left still scanning 281,308 rows even though only 335 are actually relevant to us. So how do we fix this? Enter, the composite index! Let’s add one on (client_id, created_at) and see if it helps our query:
It helps but we’re clearly still looking a lot more rows than we need. So what gives? It turns out the order of the composite index is actually critically important since that dictates how MySQL assembles the b-tree for the index. Let’s flip the order of our index and try again:
And there you go! MySQL only has to look at 1360 rows as expected.
So what’s up with having to query with the full timestamps vs. just using DATE(created_at)? It turns out MySQL can’t use datetime indexes when you apply functions to the column you’re querying on. And beyond that, even certain ranges cause MySQL to not select indexes that would work fine:
Which then leads to the unintuitive conclusion that if you actually needed to implement any sort of aggregation by day you’d be better off adding a “date” column calculated from the “created_at” and indexing on that:
Anyway, as always comments and feedback welcome!
Posted In: Big Data
I was poking around the ThoughtBot blog a couple of days ago and ran across a post titled Refactoring Ruby Iteration Patterns to the Database. At a high level, the post was summarizing how you can take an ActiveRecord aggregation (a sum in this case) and run it in directly in your RDMS with SQL. Not really rocket science, but it was a keen reminder of how ORMs often mask over much of the power of “regular” SQL. This isn’t a specific criticism of ActiveRecord, it’s an issue with every ORM from Doctrine to Hybernate.
We’ve actually been writing some straight SQL lately, mostly for analytics work, so I had the team shoot over their favorite “maybe hidden” SQL feature. Since life is better with examples, the sample use cases and queries are written against a schema describing the “items” found on a “receipt” which are optionally related to a “category”. The SQL to create the schema is:
Note: SQLFiddle is unfortunately down right now or I’d add this as a fiddle. Anyway, if you have the schema setup feel free to run the queries as you go down the list.
At some point, you might find yourself needing to sort a list of rows by a column in an arbitrarily enforced order. For example, say on our item table, you needed to sort the rows by the “quadrant” column such that WE was first, followed by AX, and finally BT.
Turns out, it’s possible to specify an arbitrary ordering using the ORDER BY FIELD statement:
If you’re only running JOINs on columns with foreign key relations this isn’t an issue, but what happens if you need to run a JOIN where a FK doesn’t necessarily exist? In our example, lets say you wanted to select only the items which had a corresponding row in the “receipt” table.
The most straightforward way to accomplish this is generally to check that the JOIN’ed column on the related table isn’t NULL:
One of the SQL features that’s usually glossed over or ignored in web development is the ability to create variables and then reuse them in subsequent statements. With this schema, an example would be calculating the “% of total spend” for the individual items – most people would run one query to generate the total and then a separate query to calculate the % of spend. For something trivial like this it doesn’t matter but if you were involving complex WHERE predicates it could be a nice performance boost.
The syntax for variables is relatively easy and it’s actually a powerful concept:
This one is a Matt Daum favorite and pretty handy. Looking at the example, say that you wanted to assign a “sequence” value to each item depending on their rank order based on “total” within their “category_id”. In plain English, for each “category_id” you want to assign the most expensive item a “1”, the second most a “2”, and so on.
This seems straightforward, but try and construct a result set using only a GROUP BY or some combination of sub queries, I’ll wait. Turns out, the easiest way to accomplish this is to use variables to construct a “pseudo” column that increments and resets when the category changes.
Sorry for the terrible description, an example will make it clearer. Given our schema, lets say you wanted to select *only* the most expensive items per category, how could you set about doing it? The obvious approach would be using some combination of GROUP BY and MAX but unfortunately because of the semantics of GROUP BY that wont work as expected.
A better approach, is to leverage an INNER JOIN along with MAX() to only select the rows that match the max total per category:
The caveat here is that you’re really selecting the highest total, so if two rows have the same total you’re not guaranteed which one you’ll end up with. This approach also scales out, in the sense that you can add additional INNER JOINs to limit the resultset in situations where you’re getting tripped up by GROUP BYs and ORDER BYs.
Anyway, as always, we’d love to hear your favorite tips and tricks in the comments!
Posted In: Tips n' Tricks
Recently for a client we had a very peculiar problem: a nightly script which checked if a person has done a certain action that day was always flagging everyone. We tested the script multiple times on our servers and it always worked fine. The query had something similar to this:
SELECT * FROM action_table WHERE DATE(action_table.time)=DATE(NOW())
Well after a while of trouble shooting we found out that the system clock on the clients server was skewed and in the wrong time zone. We synced the machines clock and update its time zone. Before it had been in UTC time and we switched it over to EST. What is interesting is that MySQL did not respect the new time zone of the clock, it was still reporting as if the system was set to UTC. We then did a soft restart(reload) on the MySQL service, but it still maintained that it was in a UTC timezone and not EST. It took a hard restart of the MySQL service to have it respect the EST timezone.
All in all, as far as we can tell you need to restart the MySQL service to have it respect a new timezone.
Posted In: General