Since I learned about index two years ago, I have always wanted to write an article about index. But I’m a procrastinator with cancer, and I’ve been putting it off for two years. This article is my own note, welcome criticism.

An overview of the

What is the index? Many books and articles use the analogy of a book’s table of contents. The purpose of the table of contents is to make it easy for us to find the location of specific content, the scope of specific chapters. Similarly, indexes in MySQL are used to help speed up queries and sorts.

Index types in InnoDB include hash index, B+ tree index, and full-text index. Hash indexes are designed to be adaptive in InnoDB and are not discussed. After InnoDB1.12, we have full-text index, which is also used in reverse order. We haven’t stepped into the pit yet (it is said that we don’t support Chinese), you can study it if you have time.

This article focuses on B+ tree indexes.

B + tree

To learn about MySQL indexes, you must first understand how they work, otherwise many questions will be confused. The principles of index data structures are described below, rather than their complex implementation.

Before we talk about B+ trees, why do indexes speed up queries? Why use B+ tree as index instead of B tree? Why not use hash indexes when the query complexity is O(1)?

BST and AVL

Before we look at B trees and B+ trees, let’s look at binary search trees (BST) and balanced binary trees (AVL).

In sequential storage structures (such as arrays), the fastest case is when the first value is the target value, and the worst case is when the last value is the target value. Assuming n elements, the average time complexity using large O notation is O(n).

Using binary search tree can effectively optimize the search time. Using the characteristics of binary search tree, the left child node is smaller than the parent node, and the right child node is larger than the parent node, which can be very convenient for binary search and effectively optimize the search time.

Under normal circumstances, we can use binary search tree, but if the following situation occurs, the binary search tree will not work, the average time complexity of the search is still O(n).

The balanced binary tree is introduced, the depth difference is not more than 1, so as to ensure the non-tilt, or shorter, to ensure its search efficiency.

B tree and B plus tree

Balanced binary trees already speed up queries, but InnoDB doesn’t actually use them. When thinking about B tree and B+ tree related problems, can not do without a problem – disk IO. Index files are stored on disk, and given a balanced binary tree with tree height of 30, you might have to scan the disk 30 times to complete the search.

When disk IO is required, using balanced binary trees is still bad, so you need to introduce multi-path trees, namely B trees and B+ trees, to make the tree “shorter”.

If the above B tree is changed to a binary tree, the height of the tree will be much higher, in other words, more disk I/OS will be required.

The B+ tree is a variant of the B tree. Non-leaf nodes of a B+ tree do not store data, and all leaf nodes are connected in a bidirectional linked list.

The current index model is basically B+ tree.

Compared with B trees, B+ trees are more stable because some of the data in B trees is distributed on non-leaf nodes.

The leaf nodes of THE B-tree are connected in the form of a linked list and arranged in accordance with the rules. The range data can be obtained more conveniently through the B+ index.

This is why hash indexes are not used. Although the time of hash index search is O(1), most of the time we do not query only one record, in which case the use of hash index is not sufficient.

Clustered index

Clustered indexes, also known as primary key indexes. A table has only one clustered index.

A clustered index is a B+ tree sorted by its primary key, searched by its primary key.

The data of the whole record is stored on the leaf node.

InnoDB B + tree is stored in the disk in the form of data page, in the middle of the tree to insert and delete operation involves column page “split” and “pages” of the complex process (personally, I also don’t understand on this, but can go to understand the rotation of the AVL tree analogy), very consumption performance, and directly insert the tail is a fast way, So in many specifications, it is strongly recommended to use a business-independent increment ID as the primary key when using the InnoDB engine.

In addition, the same is true of deletes, which are often required for pseudo-deletes, not only for data analysis, but also for index performance.

Nonclustered index

A non-clustered index, also known as a secondary index, is built with non-primary key columns. There can be multiple non-clustered indexes. The difference between a non-clustered index and a clustered index is that the leaf node of a non-clustered index does not store the data for the entire record, but instead stores a pointer to the primary key. Therefore, when searching with a non-primary key index, you also need to retrieve the entire data through the primary key index.

Single value index

A single-value index is the creation of a single value on each column of a data form.

CREATE INDEX index_name ON table_name(column);
Copy the code

Like a primary key index, a single-valued index builds a binary tree sorted by the specified column. When a single value is used to search for the target, the primary key index is used to read the entire data.

The only index

A unique index differs little from a single-valued index, except that the value of a unique index does not repeat.

In addition to improving some efficiency, unique indexes are sometimes used to ensure the uniqueness of columns, such as the user’s mobile phone number and ID card. I won’t go into too much detail here.

Joint index

Specify multiple columns when creating a federated index.

CREATE INDEX index_name ONtable_name(column1, columm2, column3 [,...] )Copy the code

A federated index sorts each field in the order in which the index was created. The first field is sorted, then the second field, then the third field…

Cover index

As mentioned earlier, the primary key index is also required to search for records in a non-clustered index, but if the column is just the same as the joint index field, there is no need to search for the primary key index, just take the leaf node value, which is the overridden index.

That’s why you don’t use SELECT *, not only to reduce the overhead of reading more columns, but also to be able to use overwrite indexes. Using overwrite indexes can reduce disk IO and improve performance.

More details on federated indexes will be covered below.

Left-most prefix rule

We know that the node data of the joint index is sorted according to the order in which the index is built. Therefore, we introduce the principle of left-most prefix. In the joint index, if the index field is used, the preceding field cannot be skipped. If the example above, suppose to find column2= “CCC” records, the approximate SQL is as follows

SELECT some_column FROM table_name WHERE column2="ccc"
Copy the code

In this case, the index is not useful, because the index is column1, then column2, and the B+ tree doesn’t know how to search column2.

Index of the failure

In addition to index invalidation under the left-most prefix principle described above, there are other index invalidation cases.

  • Column indexes that are evaluated using MySQL’s built-in functions are invalidated.

  • Use! =, is null, is not null.

    Let’s say you look for id! = 500 records, similar to scan id<500, and id >500 records, essentially full table scan no difference.

  • The column after the range query is unavailable.

    Again using the example above, suppose the query column1 <= 4

    SELECT some_column FROM table_name WHERE column1 <= 4
    Copy the code

    Since column1 is sorted, the joint index column1 can still be used, but column1 is a range data, in which column2 is not ordered.

  • Fuzzy queries starting with wildcards (LIKE “%string”).

    It is worth noting that LIKE “string%” can be indexed, similar to a range query, from the smallest string beginning with string to the largest string beginning with stirng. If you know that LIKE “string%” can use an index, you can understand why LIKE “%string” can’t use an index.

There are other cases that can be analyzed by MySQL’s query analyzer.

The locks used by InooDB are row locks, but if the index fails during an update, row locks become table locks and should be avoided in development.

Index using TIP

  • Fields commonly used for grouping and sorting can be indexed.

    The index is used to query and sort the order by and group by columns.

  • Fields that are frequently queried can be indexed.

  • Do not index fields that are frequently updated.

    If frequently updated fields are created to be indexed, not only the data will be updated, but also the B+ tree of the index will change, which costs a lot and is not worth the loss.

  • Do not index columns with low selectivity.

    For example, if you have a gender field, only male or female or unknown, there are only three values in millions of data, there is no point in building an index.

  • Indexes use equivalence matching whenever possible.

  • Use overwrite indexes whenever possible.

summary

Through the establishment of index, can effectively speed up the database query and sorting. When it comes to database optimization, index optimization is a no-go. The use of index has the skills summarized by the industry leaders, but many things are not absolute, can not be generalized, in the big data and complex business, the maintenance of index is metaphysical, need to constantly find the best index scheme.