Despite how important they are, MySQL indexes are a bit of a dark art. Sure everyone knows indexes are important but details on how they’re implemented and when they’ll be used are hard to come by. Beyond regular indexes, MySQL’s composite indexes are especially opaque in regards to how and when they’ll be used. As the name suggests composite indexes are an index constructed across two columns versus a regular index on a single column. So when might a composite index come in handy? Let’s take a look!
We’ll look at a table “client_order” that captures some fictional orders from our fictional clients:
CREATE TABLE client_order ( | |
id int(11) NOT NULL AUTO_INCREMENT, | |
client_id int(11) DEFAULT NULL, | |
total decimal(6, 2) NOT NULL, | |
created_at datetime DEFAULT NULL, | |
PRIMARY KEY (`id`) | |
); |
And we’ll fill it up with 5 million fictional orders with dates spanning the last 10 years. You can grab the data from https://setfive-misc.s3.amazonaws.com/client_order.sql.gz if you want to follow along locally.
To get started, let’s figure out the total amount spent for a couple of clients:
https://gist.github.com/adatta02/f675b2c7b0659ab960d791b44ee02861
~1.5 seconds to calculate the sums and according to the EXPLAIN MySQL had to use a temporary table and a filesort. Will an index help here? Lets add one and find out.
mysql> ALTER TABLE client_order ADD INDEX client_idx (client_id); | |
Query OK, 0 rows affected (15.44 sec) | |
Records: 0 Duplicates: 0 Warnings: 0 | |
mysql> SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) GROUP BY client_id; | |
+------------+-----------+ | |
| SUM(total) | client_id | | |
+------------+-----------+ | |
| 23342.81 | 111929 | | |
| 23182.22 | 180565 | | |
| 23186.94 | 372530 | | |
+------------+-----------+ | |
3 rows in set (0.02 sec) | |
mysql> EXPLAIN SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) GROUP BY client_id; | |
+----+-------------+--------------+------------+-------+---------------+------------+---------+------+------+----------+-----------------------+ | |
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | | |
+----+-------------+--------------+------------+-------+---------------+------------+---------+------+------+----------+-----------------------+ | |
| 1 | SIMPLE | client_order | NULL | range | client_idx | client_idx | 5 | NULL | 424 | 100.00 | Using index condition | | |
+----+-------------+--------------+------------+-------+---------------+------------+---------+------+------+----------+-----------------------+ | |
1 row in set, 1 warning (0.00 sec) |
~0.2 seconds and looking at the EXPLAIN we’ve cut down the number of rows MySQL has to look at to 424, much better. OK great, but now what if we’re only interested in looking at data from Christmas Eve in 2016?
mysql> SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) AND created_at >= "2016-12-24 00:00:00" AND created_at <= "2016-12-24 24:59:59" GROUP BY client_id; | |
+------------+-----------+ | |
| SUM(total) | client_id | | |
+------------+-----------+ | |
| 2280.06 | 111929 | | |
| 1403.08 | 180565 | | |
| 2841.30 | 372530 | | |
+------------+-----------+ | |
3 rows in set, 1 warning (1.08 sec) | |
mysql> EXPLAIN SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) AND created_at >= "2016-12-24 00:00:00" AND created_at <= "2016-12-24 24:59:59" GROUP BY client_id; | |
+----+-------------+--------------+------------+-------+---------------+------------+---------+------+--------+----------+------------------------------------+ | |
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | | |
+----+-------------+--------------+------------+-------+---------------+------------+---------+------+--------+----------+------------------------------------+ | |
| 1 | SIMPLE | client_order | NULL | range | client_idx | client_idx | 5 | NULL | 281308 | 11.11 | Using index condition; Using where | | |
+----+-------------+--------------+------------+-------+---------------+------------+---------+------+--------+----------+------------------------------------+ | |
1 row in set, 2 warnings (0.00 sec) | |
mysql> SELECT COUNT(*) FROM client_order WHERE client_id IN (111929, 372530, 180565) AND created_at >= "2016-12-24 00:00:00" AND created_at <= "2016-12-31 24:59:59"; | |
+----------+ | |
| COUNT(*) | | |
+----------+ | |
| 335 | | |
+----------+ | |
1 row in set, 1 warning (1.03 sec) |
(Note: Details on why we’re querying with full timestamps below)
As you can see, MySQL is still using the client_id index but we’re left still scanning 281,308 rows even though only 335 are actually relevant to us. So how do we fix this? Enter, the composite index! Let’s add one on (client_id, created_at) and see if it helps our query:
mysql> ALTER TABLE client_order ADD INDEX client_id_created_at_idx (client_id, created_at); | |
Query OK, 0 rows affected (20.10 sec) | |
Records: 0 Duplicates: 0 Warnings: 0 | |
mysql> EXPLAIN SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) AND created_at >= "2016-12-24 00:00:00" AND created_at <= "2016-12-24 24:59:59" GROUP BY client_id; | |
+----+-------------+--------------+------------+-------+-------------------------------------+--------------------------+---------+------+-------+----------+-----------------------+ | |
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | | |
+----+-------------+--------------+------------+-------+-------------------------------------+--------------------------+---------+------+-------+----------+-----------------------+ | |
| 1 | SIMPLE | client_order | NULL | range | client_idx,client_id_created_at_idx | client_id_created_at_idx | 11 | NULL | 11100 | 100.00 | Using index condition | | |
+----+-------------+--------------+------------+-------+-------------------------------------+--------------------------+---------+------+-------+----------+-----------------------+ | |
1 row in set, 3 warnings (0.00 sec) |
It helps but we’re clearly still looking a lot more rows than we need. So what gives? It turns out the order of the composite index is actually critically important since that dictates how MySQL assembles the b-tree for the index. Let’s flip the order of our index and try again:
mysql> ALTER TABLE client_order DROP INDEX client_id_created_at_idx; | |
Query OK, 0 rows affected (0.06 sec) | |
Records: 0 Duplicates: 0 Warnings: 0 | |
mysql> ALTER TABLE client_order ADD INDEX created_at_client_id_idx (created_at, client_id); | |
Query OK, 0 rows affected (20.21 sec) | |
Records: 0 Duplicates: 0 Warnings: 0 | |
mysql> EXPLAIN SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) AND created_at >= "2016-12-24 00:00:00" AND created_at < "2016-12-25 00:00:00" GROUP BY client_id; | |
+----+-------------+--------------+------------+-------+----------------------------------+-----------------------+---------+------+------+----------+--------------------------------------------------------+ | |
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | | |
+----+-------------+--------------+------------+-------+----------------------------------+-----------------------+---------+------+------+----------+--------------------------------------------------------+ | |
| 1 | SIMPLE | client_order | NULL | range | client_idx,created_at_client_idx | created_at_client_idx | 11 | NULL | 1360 | 5.64 | Using index condition; Using temporary; Using filesort | | |
+----+-------------+--------------+------------+-------+----------------------------------+-----------------------+---------+------+------+----------+--------------------------------------------------------+ |
And there you go! MySQL only has to look at 1360 rows as expected.
So what’s up with having to query with the full timestamps vs. just using DATE(created_at)? It turns out MySQL can’t use datetime indexes when you apply functions to the column you’re querying on. And beyond that, even certain ranges cause MySQL to not select indexes that would work fine:
mysql> EXPLAIN SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) AND created_at >= "2016-12-24 00:00:00" AND created_at <= "2016-12-25 24:59:59" GROUP BY client_id; | |
+----+-------------+--------------+------------+-------+----------------------------------+------------+---------+------+--------+----------+------------------------------------+ | |
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | | |
+----+-------------+--------------+------------+-------+----------------------------------+------------+---------+------+--------+----------+------------------------------------+ | |
| 1 | SIMPLE | client_order | NULL | range | client_idx,created_at_client_idx | client_idx | 5 | NULL | 281308 | 15.22 | Using index condition; Using where | | |
+----+-------------+--------------+------------+-------+----------------------------------+------------+---------+------+--------+----------+------------------------------------+ | |
mysql> EXPLAIN SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) AND created_at >= "2016-01-01 00:00:00" AND created_at < "2016-01-10 00:00:00" GROUP BY client_id; | |
+----+-------------+--------------+------------+-------+----------------------------------+-----------------------+---------+------+-------+----------+-------------------------------------------------------------------+ | |
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | | |
+----+-------------+--------------+------------+-------+----------------------------------+-----------------------+---------+------+-------+----------+-------------------------------------------------------------------+ | |
| 1 | SIMPLE | client_order | NULL | range | client_idx,created_at_client_idx | created_at_client_idx | 11 | NULL | 22828 | 5.64 | Using index condition; Using MRR; Using temporary; Using filesort | | |
+----+-------------+--------------+------------+-------+----------------------------------+-----------------------+---------+------+-------+----------+-------------------------------------------------------------------+ | |
1 row in set, 1 warning (0.01 sec) |
Which then leads to the unintuitive conclusion that if you actually needed to implement any sort of aggregation by day you’d be better off adding a “date” column calculated from the “created_at” and indexing on that:
mysql> ALTER TABLE client_order ADD COLUMN date date default null; | |
Query OK, 0 rows affected (46.72 sec) | |
Records: 0 Duplicates: 0 Warnings: 0 | |
mysql> UPDATE client_order SET `date` = STR_TO_DATE(DATE(created_at), "%Y-%m-%d"); | |
Query OK, 5000000 rows affected (9 min 52.98 sec) | |
Rows matched: 5000000 Changed: 5000000 Warnings: 0 | |
mysql> ALTER TABLE client_order ADD INDEX date_client_idx (date, client_id); | |
Query OK, 0 rows affected (29.43 sec) | |
Records: 0 Duplicates: 0 Warnings: 0 | |
mysql> EXPLAIN SELECT SUM(total), client_id FROM client_order WHERE client_id IN (111929, 372530, 180565) AND `date` = "2016-12-24" GROUP BY client_id; | |
+----+-------------+--------------+------------+-------+--------------------------------------------------+-----------------+---------+------+------+----------+-----------------------+ | |
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | | |
+----+-------------+--------------+------------+-------+--------------------------------------------------+-----------------+---------+------+------+----------+-----------------------+ | |
| 1 | SIMPLE | client_order | NULL | range | client_idx,created_at_client_idx,date_client_idx | date_client_idx | 9 | NULL | 44 | 100.00 | Using index condition | | |
+----+-------------+--------------+------------+-------+--------------------------------------------------+-----------------+---------+------+------+----------+-----------------------+ | |
1 row in set, 1 warning (0.00 sec) |
Anyway, as always comments and feedback welcome!