Despite how important they are, MySQL indexes are a bit of a dark art. Sure everyone knows indexes are important but details on how they’re implemented and when they’ll be used are hard to come by. Beyond regular indexes, MySQL’s composite indexes are especially opaque in regards to how and when they’ll be used. As the name suggests composite indexes are an index constructed across two columns versus a regular index on a single column. So when might a composite index come in handy? Let’s take a look!
We’ll look at a table “client_order” that captures some fictional orders from our fictional clients:
And we’ll fill it up with 5 million fictional orders with dates spanning the last 10 years. You can grab the data from https://setfive-misc.s3.amazonaws.com/client_order.sql.gz if you want to follow along locally.
To get started, let’s figure out the total amount spent for a couple of clients:
~1.5 seconds to calculate the sums and according to the EXPLAIN MySQL had to use a temporary table and a filesort. Will an index help here? Lets add one and find out.
~0.2 seconds and looking at the EXPLAIN we’ve cut down the number of rows MySQL has to look at to 424, much better. OK great, but now what if we’re only interested in looking at data from Christmas Eve in 2016?
(Note: Details on why we’re querying with full timestamps below)
As you can see, MySQL is still using the client_id index but we’re left still scanning 281,308 rows even though only 335 are actually relevant to us. So how do we fix this? Enter, the composite index! Let’s add one on (client_id, created_at) and see if it helps our query:
It helps but we’re clearly still looking a lot more rows than we need. So what gives? It turns out the order of the composite index is actually critically important since that dictates how MySQL assembles the b-tree for the index. Let’s flip the order of our index and try again:
And there you go! MySQL only has to look at 1360 rows as expected.
So what’s up with having to query with the full timestamps vs. just using DATE(created_at)? It turns out MySQL can’t use datetime indexes when you apply functions to the column you’re querying on. And beyond that, even certain ranges cause MySQL to not select indexes that would work fine:
Which then leads to the unintuitive conclusion that if you actually needed to implement any sort of aggregation by day you’d be better off adding a “date” column calculated from the “created_at” and indexing on that:
Anyway, as always comments and feedback welcome!
Posted In: Big Data