Index usage strategy and optimization

The optimization of MySQL is mainly divided into Scheme optimization and Query optimization. The high-performance indexing strategies discussed in this chapter are mainly structural optimization. The content of this chapter is entirely based on the theory above, but in fact, once you understand the mechanism behind indexing, choosing high-performance strategies becomes pure reasoning, and you can understand the logic behind those strategies.

Sample database

To discuss indexing strategies, you need an example of a database with a relatively small amount of data. This article uses one of the sample databases provided in the official MySQL documentation: Employees. This database relationship is moderately complex and has a large amount of data. Here is the e-R diagram of the database (quoted from the MySQL official manual) :

​​

​​

Figure 12

MySQL dev.mysql.com/doc/employe is the official documentation about this database page…

Left-most prefix principle and related optimization

The first condition for efficient use of indexes is to know which queries will use the index. This problem is related to the “left-most prefix principle” in B+Tree. The following example illustrates the left-most prefix principle.

Here’s the concept of a federated index. In this article, we assumed that an index refers to a single column. In fact, an index in MySQL can refer to multiple columns in a certain order. This type of index is called a joint index. An >, where each element is a column of the data table, and actually to define the index rigorously you need to use relational algebra, but I don’t want to talk too much about relational algebra here, because that would be boring, so I’m not going to do it rigorously here. In addition, a single-column index can be regarded as a special case where the number of elements in a joint index is 1.

Using employees.titles as an example, let’s take a look at what indexes are on it:

​​

​​

From the result, the primary index to the titles table is < EMP_no, title, from_date>, and a secondary index < EMP_no >. In order to avoid complex things with multiple indexes (MySQL SQL optimizer behaves more complex with multiple indexes), here we drop secondary indexes:

​​​

​​​

This allows you to focus on the behavior of the PRIMARY index.

Case 1: All columns match.

​​

​​

Obviously, an index can be used when an exact match (meaning “=” or “IN” match) is performed against all columns IN the index. One thing to note here is that indexes are theoretically order sensitive, but since MySQL’s query optimizer automatically adjusts the order of conditions in the WHERE clause to use the appropriate index, for example we reversed the order of conditions in where:

​​

​​

The effect is the same.

Case two: The leftmost prefix matches.

​​

​​

When a query condition matches exactly one or more consecutive columns to the left of the index, such as <emp_no> or <emp_no, title>, it can be used, but only in part, the left-most prefix of the condition. The above query uses the PRIMARY index, but key_len is 4, indicating that only the first column prefix of the index is used.

Case 3: The query criteria use an exact match for the columns in the index, but one of the intermediate criteria is not provided.

​​

​​

Index usage at this time and the situation is the same, because the title is not provided, so queries using only the first column of the index, and then the from_date has in the index, but can not connect and left prefix due to the title does not exist, so need to scan the results to filter from_date (as emp_no only here, So there is no scan). If you want from_date to also use indexes instead of where filtering, you can add a secondary index < emp_NO, from_date> that will be used in the above query. In addition, you can use an optimization called “quarantine column” to fill in the “pit” between EMP_NO and from_date.

First let’s take a look at the different values of title:

​​​

​​​

There are only seven. IN cases where the number of columns called pits is small, consider filling the pit with “IN” to form the left-most prefix:

​​

​​

This time key_len is 59, indicating that the index is fully used, but from the type and rows we can see that IN actually performs a range query, where seven keys are checked. Take a look at the performance comparison of the two queries:

​​

​​

Performance improved a bit after “pit filling”. If there is a lot of data left over after emp_NO filtering, the latter performance advantage becomes even more pronounced. Of course, if the title value is too large, pit padding is not appropriate and a secondary index must be created.

Case 4: The query condition does not specify the first index column.

​​

​​

Since it is not the left-most prefix, a query like index obviously does not use an index.

Case 5: Matches the prefix string of a column.

​​

​​

You can use an index at this point, but you cannot use an index if the wildcard does not appear only at the end. If the wildcard % character does not appear at the beginning, you can use the index, but only one of the prefixes may be used depending on the situation.

Case 6: Range query.

​​

​​

A range column can use an index (it must be the left-most prefix), but the column after the range column cannot use an index. At the same time, the index can be used for at most one range column, so if there are two range columns in the query criteria, the index cannot be used for all of them.

​​

​​

You can see that the index does nothing for the second range index. One interesting aspect of MySQL in particular is that it may not be possible to distinguish range indexes from multi-value matches using explain alone, because both are displayed as range in Type. Also, using “between” does not mean a range query, as in the following query:

​​

​​

It looks like two range queries are being used, but “BETWEEN” on EMP_no is actually equivalent to “IN”, which means emp_no is actually a multi-value exact match. You can see that the query uses all three columns of the index. So be careful to distinguish between multi-value matching and range matching in MySQL, otherwise you will get confused about MySQL’s behavior.

Case 7: Query conditions contain functions or expressions.

Unfortunately, MySQL does not use indexes for columns that contain functions or expressions in the query condition (although some can be used in the mathematical sense). Such as:

​​

​​

Although this query performs the same function as in Case 5, the title column cannot be indexed by using the function left, as case 5 does by using LIKE. Such as:

​​

​​

Obviously this query is equivalent to querying emp_no for 10001, but since the query condition is an expression, MySQL cannot use an index for it. MySQL is not smart enough to automatically optimize constant expressions, so when writing a query, try to avoid expressions in the query.

Index selectivity and prefix index

Since indexes can speed up queries, should we build indexes whenever a query statement needs them? The answer is no. Because indexes speed up queries, they also come at a cost: index files themselves consume storage space, and indexes increase the burden of inserting, deleting, and modifying records. In addition, MySQL also consumes resources to maintain indexes at runtime, so more indexes are not always better. In general, indexes are not recommended in two cases.

The first case is that the table records are relatively small, such as 1000 or even only a few hundred records of the table, there is no need to build an index, let the query do the full table scan. As for how many records is too many, I have my own opinion. My personal experience is to take 2000 as the dividing line. If the number of records does not exceed 2000, we can consider not building indexes, and if the number of records exceeds 2000, we can consider indexes as appropriate.

Another case where indexes are not recommended is when they are less selective. Selectivity of indexes refers to the ratio of non-repeating index values (Cardinality) to the number of table records (#T) :

Index Selectivity = Cardinality / #T

Obviously, the selectivity range is (0, 1), and the higher the selectivity, the greater the index value, which is determined by the nature of B+Tree. For example, if the title field on employees.titles is frequently queried separately, does it need to be indexed? Let’s look at its selectivity:

​​​

​​​

Title has less than 0.0001 selectivity (precise value 0.00001579), so there’s really no need to index it separately.

There is an index optimization strategy related to index selectivity called prefix index, which is to replace the whole column with the prefix of the column as the index key. When the prefix length is appropriate, the selectivity of the prefix index can be close to the whole column index, and the index file size and maintenance cost can be reduced because the index key is shorter. The following uses the employees.employees table as an example to describe how to select and use a prefix index.

As you can see from Figure 12, the employees table has only one index <emp_no>, so if we wanted to search for a person by name, we would have to do a full table scan:

​​

​​

Searching for employees by name frequently is obviously inefficient, so we can consider indexing. <first_name> or <first_name, last_name>

​​

​​

<first_name> <first_name> <first_name, last_name> <first_name, last_name> Select first_name, left(last_name, 3)> from first_name and last_name.

​​​

​​​

The selectivity is good, but it is still a bit far from 0.9313, so add the last_name prefix to 4:

​​​

​​​

<first_name, last_name> <first_name, last_name> <first_name, last_name>

​​​

​​​

SQL > select * from ‘name’ where ‘name’ = ‘name’;

​​

​​

The performance improvement was significant, with query speeds up more than 120 times.

Prefix indexes have both index size and query speed, but their disadvantage is that they cannot be used for ORDER BY and GROUP BY operations, and they cannot be used for Covering indexes (i.e. the data file itself is not accessed when the index itself contains all the data required for the query).

InnoDB primary key selection and insert optimization

When using the InnoDB storage engine, always use a business-independent increment field as the primary key unless you have a specific need.

It is common to see posts or blogs discussing the choice of primary keys. Some people suggest using an auto-increment primary key that is irrelevant to the business. Some people think it is unnecessary to use a unique field such as student number or ID number as the primary key. Whichever argument is supported, most of the arguments are business. Using the InnoDB engine without auto-increment primary keys is definitely a bad idea from a database index optimization perspective.

InnoDB’s index implementation is discussed above. InnoDB uses clustered indexes, and the data records themselves are stored on the leaf node of the main index (a B+Tree). This request within the same leaf node (the size of a memory or disk pages) of the individual data records in the primary key order, so every time when we have a new record into the MySQL will according to its primary key node and insert it into the appropriate position, if the page to load factor (InnoDB default for 15/16), then create a new page (nodes).

If the table uses auto-increment primary keys, each time a new record is inserted, the records are sequentially added to the place following the current index node, and when a page is full, a new page is automatically opened. As shown below:

​​

​​

Figure 13

This results in a compact index structure that is approximately sequentially filled. Since there is no need to move the existing data with each insert, this is efficient and does not add much overhead to maintaining the index.

If you use a non-increment primary key (if id number, student number, etc.), each new record is inserted somewhere in the middle of the existing index page because the value of the primary key is approximately random:

​​

​​

Figure 14

The MySQL had to in order to insert a new record to the appropriate location and mobile data, or even the target page might have been back to disk and clear it from the cache, at this time again want to read from the disk back, it cost a lot of, also frequently mobile, paging are responsible for a large number of pieces, got enough compact index structure, The OPTIMIZE TABLE then had to be used to rebuild the TABLE and OPTIMIZE the fill page.

Therefore, whenever possible, use auto-increment field primary keys on InnoDB.

Afterword.

This article has been written intermittently for half a month, and the main content is the above. Admittedly, this article is somewhat armchair, as I am a novice user of MySQL and do not have much experience with database tuning, so it is a bit of a lie to talk about database index tuning here. Think of it as my personal study notes.

In fact, database index tuning is a technical work, not only by theory, because the actual situation is changing, and MySQL itself has a very complex mechanism, such as query optimization strategy and various engine implementation differences will make the situation more complicated. But at the same time these theories are the basis of index tuning, only on the basis of understanding the theory, can we reasonably infer the tuning strategy and understand the mechanism behind it, and then combine the practice of continuous experiment and exploration, so as to truly achieve the purpose of efficient use of MySQL index.

In addition, MySQL indexes and their optimizations cover a wide range of topics, of which this article only covers a portion. Such as and ORDER BY related index optimization and Covering index (Covering index) topic is not involved in this article, at the same time in addition to b-tree index MySQL also according to different engine support hash index, full text index and so on this article is not involved. If there is a chance, I hope to supplement the parts not covered in this article.

reference

[1] Baron Scbwartz et al. Translated by Wang Xiaodong et al. MySQL (High Performance MySQL); Publishing House of Electronics Industry, 2010

[2] Michael Kofler, Translated by Yang Xiaoyun et al. The Definitive Guide to MySQL5; Posts and Telecommunications Press, 2006

[3] Jiang Chengyao; InnoDB storage Engine; China Machine Press, 2011

[4] D Comer, Ubiquitous B-tree; ACM Computing Surveys (CSUR), 1979

[5] Codd, E. F. (1970). “A relational model of data for large shared data banks”. Communications of the ACM, , Vol. 13, No. 6, pp. 377-387

This is the end of today’s sharing, you must learn C++ yo ~

For those of you who are ready to learn C/C++ programming, if you want to improve your core programming skills, start now!

Wechat official account: C language programming Learning base