title

An overview of the

Since learning MySQL, we have heard and seen a lot of optimization advice, such as do not use select * query, use the same field query; It is recommended to use auto-increment primary keys for tables, and so on. These suggestions sound like common sense in MySQL development, but with these optimization suggestions, have we ever wondered why we do them? In this blog post we will start from the principles of MySQL to explain why there are these optimization recommendations.

u

The experimental environment of this article is MySQL 5.7.25

Preliminary knowledge

B + tree index

MySQL’s default storage engine, InnoDB, uses B+ trees to store data, so it is important to understand the basic principles of B+ tree indexes before analyzing optimization recommendations.

B+ tree index schematic

The diagram above shows a B+ tree index (the definition of a B+ tree can be seen here). Each node represents a disk block, which can also be understood as a page in a database. If I want to query data with primary key 35, what will happen to the index? First, it will judge that 35 is less than the root node 37, continue to query the left subtree, judge that 35 is greater than 22 and 33, then enter the right subtree, find the leaf node 33, continue to traverse to find 35, and finally extract its data. In the case of indexes, query 35 takes only 3 IO operations, which is very efficient. In a real world scenario, a 3-tier B + tree can represent millions of data. If millions of data lookups take only three IO, the performance improvement will be huge. If there is no index and each data item takes one IO, the total will be millions of IO, which is obviously very, very expensive. The figure above also shows that as long as the height of the maintenance tree is low enough, the I/O operations will be small enough, and the NUMBER of I/OS will be small, and the query performance will be high.

Explain execution plan

The execution plan

This is an explian execution plan. What do the fields above mean?

  • Id: Number of the Query in the execution plan selected by the Query Optimizer.

  • Select_type: specifies the query type used. There are several query types:

    Type the name instructions
    SIMPLE Queries other than subqueries or UNION queries.
    PRIMARY The outermost query in the subquery,Note that this is not a primary key query.
    UNION SELECT PRIMARY from the first SELECT in the UNION statement.
    DEPENDENT UNION The UNION in the subquery, and all subsequent SELECT from the second SELECT in the UNION, also depend on the result set of the external query.
    UNION RESULT The result of the merge in UNION.
    SUBQUERY The first SELECT of a subquery’s inner query, the result of which is independent of the external query result set.
    DEPENDENT SUBQUERY The first SELECT of a subquery inner query whose results depend on the external query result set.
    DERIVED The driver table is the main table
    MATERIALIZED The results of subqueries are saved as virtual temporary tables
    UNCACHEABLE SUBQUERY A subquery whose result set cannot be cached
    UNCACHEABLE UNION A UNION query whose result set cannot be cached
  • Table: Displays the names of tables in the database accessed by performing this step.

  • Partitions: query for matching partitions in partitioned tables and display NULL for non-partitioned tables.

  • Type: indicates the type used to query the table. The types are as follows:

    Type the name instructions
    all Full table scan.
    const Read constants, at most one record matches, because it is constant, so it really only needs to be read once.
    system System table, which contains only one piece of data, which is of special const type.
    eq_ref At most, only one result will be matched, typically accessed by primary key or unique index.
    ref The index query of the driven table in the Join statement
    full_text Use the full_TEXT index
    ref_or_null The only difference from ref is that a null-value query is added to the indexed query.
    index_merge Use two (or more) indexes in a query, merge the index results, and then read the table data.
    unique_subquery The result field combination in a subquery is either a primary key or a unique index
    index_subquery The result field combination in a subquery is an index, but not a primary key or unique index
    range Index range scan, which often appears in comparison conditions such as <, >,BETWEEN, etc
    Index Full index scan

    Their performance from good to bad is: system > const > eq_ref > ref > full_text > ref_or_null > unique_subquery > index_subquery > range > index_merge > index > all.

  • Possible_keys: Possible index in the query.

  • Key_len: indicates the length of the index used.

  • Ref: Shows which columns or constants are compared to hit indexes.

  • Rows: Number of rows to perform this query scan.

  • When 100 is displayed, it indicates that no rows are filtered. Rows shows the estimated number of rows checked. Multiplied by the filtered percentage, it shows the number of rows connected to the table below. For example, if the number of rows is 1000 and the filter condition is 50.00 (50%), the number of rows joined with the table below is 1000× 50% = 500.

  • Extra: Perform query for additional criteria, detailed criteria can be viewed here.

With this knowledge in hand, let’s use this knowledge to analyze some of the MySQL optimizations that we often recommend.

Why is autoincrement primary key recommended

What if every time we create a table we’re thinking about using the table’s auto-increment primary key? Or UUID? But for performance reasons, we still recommend using an incremented Id. Why? MySQL > update B+ tree index; MySQL > update B+ tree index; In other words, whichever is faster and cheaper to update the B+ tree with an autoincrement Id or a non-autoincrement Id is the better choice. Let’s simulate auto-increment Id insertion and non-auto-increment Id insertion.

Auto-increment Id insertion case: We insert two data points, 10 and 11, into a B + number that already has 10 data points, and see how the tree changes.

On the id

We can find two characteristics here:

1. The scope of influence of self-increasing data insertion is always only the right-most subtree. Either the node is directly inserted into the subtree, or the subtree is split and its parent node is affected.

2. All subtrees except the rightmost are full.

What are the effects of these two features? According to the previous B+ tree index diagram, each point is a disk block, and the operation of each node is equivalent to an IO. Since the nodes affected by each insertion are only the right-most subtree, the range of disk IO can be controlled. The most important point is that the nodes of all subtrees except the rightmost subtree are full. In this case, the physical continuity of data of leaf nodes is better, and the query performance is also higher according to the principle of locality.

Insert non-increment Id:

Not since the id

Non-auto-increment Id insert features comparison with auto-increment Id insert we can easily know:

1. Insert the influence node is uncontrollable and unpredictable.

2. Every subtree has unsatisfactory leaf nodes.

Following the previous analysis, we can see the performance disadvantages of non-auto-increment Id inserts. Nodes cannot be controlled due to data insertion, leading to more frequent node splitting. Node splitting is also an I/O operation, and performance is affected. If the leaf nodes in the subtree are not satisfied, the physical continuity of the leaf nodes is not good. Finally, if we are UUID, if the Id is too long, the node space will be occupied, the number of nodes per page will be reduced, the page will be split more, and the performance will be affected. This is why it is recommended to use auto-increment primary keys.

Why not use a SELECT * query

We often hear query table, as long as you want to query the field, do not need to query the field, strictly prohibit the use of SELECT *, we can think of a very intuitive reason is that the database to help you translate into each field name to query, then query the excess field will occupy memory, bandwidth and other resources. That’s one reason, and it’s an important one, but I want to talk about another reason, overwriting indexes. My previous index article also covered covered indexes, so those interested can click here. Overwriting an index means that a query overwrites the field to be queried using a federated index so that the database does not have to run back to the table, thereby reducing IO and improving performance.

Here I use MySQL official list data to conduct an experiment, the data address can be downloaded here.

I chose the employees table data for the demonstration. By default, there is no federated index, so let’s add a federated index:

- the employee table structure -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + | Field | Type | Null | Key | Default | Extra | +------------+---------------+------+-----+---------+-------+ | emp_no | int(11) | NO | PRI | NULL | |  | birth_date | date | NO | | NULL | | | first_name | varchar(14) | NO | MUL | NULL | | | last_name | varchar(16) | NO |  | NULL | | | gender | enum('M','F') | NO | | NULL | | | hire_date | date | NO | | NULL | | +------------+---------------+------+-----+---------+-------+Copy the code
ALTER TABLE employees.employees add Index `idx_first_name_last_name` (first_name,last_name);
Copy the code

To check the index of this table, run the show index from employees.employees command.

YxQ4yQ.png

The table has successfully created matching indexes for first_name and last_name, and profiles is enabled to monitor SQL execution.

SET SESSION profiling = 1;
Copy the code

Then execute the following SQL separately

SELECT first_name,last_name  FROM employees.employees WHERE first_name='Eric';
SELECT *  FROM employees.employees WHERE first_name='Eric';
Copy the code

View Profiles;

profiles

Here we see that using the SELECT * field is about 4 times slower than using the SELECT field. Why is this? Let’s look at the execution plan.

The execution plan

Select * from first_name where last_name = ‘last_name’; select * from last_name where last_name = ‘first_name’; select * from last_name where last_name = ‘last_name’; Select * from gender; hire_date; gender; hire_date; gender; hire_date

conclusion

In this article, we analyze two common suggestions in principle. Why do we recommend using autoincrement primary keys? Why not use a select * query? In fact, the main reason is related to index, since we use index to improve our efficiency, we should make full use of it.

1. The efficiency of B+ tree query is affected by the height of the tree. The lower the height of the tree, the fewer I/O query times and the higher the performance.

2. The types of execution plans from best to worst are: system > const > eq_ref > ref > full_text > ref_or_null > unique_subquery > index_subquery > range > index_merge > index > all.

3. The advantage of self-increasing primary keys is continuity, relatively low cost of insertion and maintenance, and most of the leaf nodes of the subtree are full nodes, which has good physical continuity and better query performance.

4. The length of the UUID primary key is too long, resulting in fewer primary keys stored in a single child node and more common start page splitting, which affects performance. This is why it is recommended not to have too long indexes.

5, overwrite index is a good optimization technique, can let the query directly through the index to return data, rather than back to the table, reduce IO, improve performance.

reference

1, https://dev.mysql.com/doc/refman/5.7/en/explain-output.html

2, https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain-join-types

3, http://blog.codinglabs.org/articles/theory-of-mysql-index.html


Pay attention to the public number, enjoy the first time update.