MySQL — Using index in join query

6 min readOct 21, 2016

Some friends asked me how MySQL uses index in join queries. Absolutely, this is a truly interesting question, which many people concern. Some days ago, I discovered that MySQL Workbench can generate a very informative graph of execution plan for each query executed [1]. So I decided to write this article to share some experience about using index in join queries.

Database

First things first, following is the database, which we will do some experiments on it later.

Figure 1: The database has three tables: `user`, `question` and `answer`.

`user` table has 512.038 rows. Each user is identified by an unique ID in `id` column. One user may have several questions or answers.

`question` table has 262.924 rows. One question belongs to one user indicated by `user_id` column and one question can have more than one answer.

`answer` table has 528.066 rows. One answer belongs to one question represented by `question_id` column and one user denoted by `user_id`.

`user` table has a `PRIMARY` key on `id` column and a unique key `UNIQUE_username` on `username` column.

`question` table has a `PRIMARY` on `id` column and an index `IDX_SCORE` on `score` column.

`answer table has a `PRIMARY` on `id` column, one single-column index `IDX_score` on `score` column and one multi-column index `IDX_userid_score` on `user_id` and `score` column.

Experiments

Join and group by

Assume that we want to get top 5 users who have at least one answer with high scores. Let take a look on the following queries to see how where conditions and forcing index affect to performance.

Because one user may have several answers along with different scores, we need to group answers by `username`.

The explain result shows that MySQL chooses `answer` as a base table. With each row in `answer` table, it finds exactly one corresponding row in `user` table based on the primary key then joins the row into `answer` table. Consequently, the above query takes about 4.74 seconds.

In fact, GROUP BY is also a time-consuming task beside JOIN. Because MySQL uses only one index for each table in one execution, we need to decide using index for join or group by. Let give GROUP BY a chance and see what happens.

According to the explain chart, MySQL will join `answer` table into `user` table. In the row of `user` table, value of type column is “index” and value of Extra column is “Using temporary; Using filesort”. That means a full table scan is performed using reads from the index to take data rows in index order for group by [2]. As a result, the query now takes 1.83 seconds. Needless to say, this query is 2 times faster than the previous query at least because MySQL used `UNIQUE_username` index to speed up GROUP BY.

In summary, we need to decide which heavy task we should use index for, JOIN or GROUP BY. But is there any way to use index for both JOIN and GROUP BY in one query? The answer is YES. Let’s try to optimize the above query a little bit.

In the new query, we simply change the column of GROUP BY from `username` in `user` table to `user_id` in `answer` table. Of course, we will get the same result from this query.

Value in Extra column is “Using index; Using temporary; Using filesort”. This means the index is a covering one and it is used to satisfy all data required from the table, only the index tree is scanned. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data. [2]. Fortunately, execution time of the optimized query is 0.68 second, 6.5 times faster than the original query.

Figure 2: This is the execution plan graph of the optimized query, which is generated by MySQL Workbench. The graph demonstrates clearly how MySQL executes the query step by step. First of all, MySQL does a full index scan on `IDX_userid_score` to get pairs of `user_id` and `max_score` of the `user_id`. With each pair, it finds exactly one appropriate row in `user` table based on the primary key for joining. After this step, MySQL already got GROUP BY result without any other extra tasks on temporary tables. The last, MySQL does a filesort for `max_score` on a temporary table to get the final result.

Auto key

Sometimes, MySQL does not only use existing indices to join two tables, It also generates auto indices to join temporary tables. The cost of generating an auto index on a temporary table is negligible compared to the cost of execution without the index. This process is called materialization [3]. MySQL always try to use materialization to improve sub-queries as much as possible. Let take a look on the following query.

select good_question.id, good_answer.id
from (
    (select * from question where question.score > 5000) as good_question
    inner join (select * from answer where answer.score > 1000) as good_answer
    on good_answer.question_id = good_question.id
)

This query will retrieve top questions in which one question has a score bigger than 5000 and has at least one answer which has score bigger than 1000. Definitely, this query is not an optimized one. However, it is a good example to show how MySQL generates an auto_key automatically to improve a join query.

<auto_key0> is an index which is generated automatically on `id` column of the temporary table <derived3>. Benefits of this index is similar to benefits of a normal index. It will help to reduce the cost of finding an appropriate row in a temporary table. [4]

Figure 3: MySQL does an Index Range Scan on `IDX_SCORE` to get satisfied rows from `question` table and creates a temporary `good_question` table (materialized). It does the same way to generate `good_answer` table (materialized). Beside that, it also makes an auto key on `question_id` column in `good_answer` table. Once the derived tables are created, MySQL will join them together based on the auto key.

Conclusion

The experiments show that using index in join queries is extremely important. Without index, execution time of a join query can be very slow. That is the reason why MySQL automatically generate auto keys in materialization. We also note that the cost of auto key generation is significantly less than the cost of joining without auto key.

The optimized query in the first experiment still takes much time because it must generate a temporary table and does file sort for ORDER BY. Readers are welcome to leave comments here if you have any idea to make the query better.

References

[1] https://dev.mysql.com/doc/workbench/en/wb-performance-explain.html

[2] https://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain-join-types

[3] http://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html

[4] http://dev.mysql.com/doc/refman/5.7/en/explain-extended.html

[5] http://mysqlserverteam.com/better-performance-for-joins-not-using-indexes/

[6] http://mysqlserverteam.com/mysql-5-7-improved-performance-of-queries-with-derived-tables/