What is the index

  • An index is a data structure that helps MySQL retrieve data efficiently. More generally, a database index is like a table of contents at the front of a book, which speeds up database queries.
  • Indexes themselves are generally too large to be stored in memory, so indexes are often stored in files on disk (either in separate index files or in data files with data).
  • We usually refer to the index, including clustered index, overwrite index, composite index, prefix index, unique index, etc., without special description, the default is using B+ tree structure (multi-way search tree, not necessarily binary) index.

Advantages and disadvantages of indexes

Advantage:

  • It can improve the efficiency of data retrieval and reduce the IO cost of database, similar to the catalog of books.
  • Sorting data through indexed columns reduces the cost of sorting data and CPU consumption.
    • Indexed columns are sorted automatically, including single-column indexes and composite indexes, but the sorting of composite indexes is more complicated.
    • Sorting by index column order is much more efficient for the order by statement.

Disadvantage:

  • Indexes take up disk space
  • Indexes improve query efficiency, but reduce the efficiency of updating tables. For example, every time a table is added or deleted, MySQL not only saves the data, but also saves or updates the corresponding index file.

The index type

The primary key index

Values in index columns must be unique and null values are not allowed.

Normal index

The basic index type in MySQL, which has no restrictions, allows the insertion of duplicate and null values in the column where the index is defined.

The only index

Values in index columns must be unique, but null values are allowed.

The full text indexing

Full-text indexes can only be created on fields of the TEXT types CHAR,VARCHAR, or TEXT. If the field length is large and creating a common index is inefficient in like fuzzy query, you can create a full-text index. Full-text indexes can be used in both MyISAM and InnoDB.

Spatial index

MySQL supports spatial indexing in versions after 5.7 and also supports the OpenGIS geometric data model. MySQL follows the OpenGIS geometric data model rules for spatial indexing.

The prefix index

When creating an index on a TEXT type such as CHAR,VARCHAR, or TEXT, you can specify the length of the index column, but not the numeric type.

Others (by number of indexed columns)

  1. Single index

  2. Composite index

    The use of composite indexes must follow the leftmost prefix matching rule (leftmost prefix matching rule). In general, composite indexes are used instead of multiple single-column indexes when conditions permit.

The data structure of the index

The Hash table

TreeMap is a Hash table structure that stores data as key-value pairs. We use Hash tables to store table data. Keys can store index columns, and values can store row records or row disk addresses. Hash table is efficient in equivalent query, and the time complexity is O(1). However, fast range lookup is not supported, and range lookup can only be performed by scanning the full table.

Obviously, this is not suitable for database indexes that require frequent lookups and range lookups.

Binary search tree

Binary trees, I think you all have a picture in your head.

Binary tree features: Each node has a maximum of two forks. The data sequence of the left and right subtrees is smaller on the left and larger on the right.

This feature is to ensure that each search can be halved to reduce the NUMBER of IO, but binary trees are very difficult to test the value of the first root node, because it is easy to happen under this feature we think “tree does not fork”, which is very uncomfortable and unstable.

Obviously this is an unstable situation and we have to design to avoid this situation

Balanced binary trees

Balanced binary tree adopts dichotomy thinking. In addition to having the characteristics of binary tree, the most important feature of balanced binary search tree is that the hierarchy difference between the left and right subtrees is at most 1. When inserting and deleting data, the binary tree is balanced by left-rotation/right-rotation operations, so that the left subtree is not high and the right subtree is short.

The performance of balanced binary search tree query is close to binary search with O(log2n) time complexity. Query id=6, only need two IO.

So with that in mind, you might think that’s good enough to get to the ideal situation of a binary tree. However, some problems remain:

  1. Time complexity is related to tree height. The height of the tree is determined by how many times it needs to be retrieved. For each node read, there is a disk IO operation. The height of the tree is equal to the number of DISK I/O operations performed per data query. The disk search time is 10ms at a time. Therefore, the query performance deteriorates when the table data is large. (1 million data volume, log2n is approximately equal to 20 disk I/OS, time 20*10=0.2s)
  2. The balanced binary tree does not support quick search for range query. Therefore, the range query requires multiple traversal from the root node, resulting in low query efficiency.

B tree: transform binary tree

MySQL data is stored in disk files. When querying and processing data, data on disk needs to be loaded into memory first. Disk I/O operations are time-consuming, so we focus on minimizing disk I/O operations. One I/O occurs for each node accessing the binary tree. If you want to reduce disk I/O operations, you need to reduce the tree height as much as possible. So how do you lower the height of the tree?

If key is bigint=8 bytes, each node has two Pointers, each pointer is 4 bytes, and each node occupies 16 bytes of space (8+4*2=16).

Because the InnoDB storage engine of MySQL will read one page of data (default page 16K) in one IO, while the effective data amount of binary tree in one IO is only 16 bytes, the space utilization is very low. To maximize IO space at a time, a simple idea is to store multiple elements per node, storing as much data as possible on each node. Each node can store 1000 indexes (16K /16=1000), thus transforming a binary tree into a multi-fork tree, changing the tree from tall and thin to short and fat by increasing the number of branches of the tree. To build 1 million pieces of data, the height of the tree only needs 2 layers (1000*1000=1 million), which means it only takes 2 disk I/OS to query the data. Disk I/O times are reduced, and data query efficiency is improved.

This data structure is called B-tree. B-tree is a multi-fork balanced search tree. The main features are as follows:

  1. The nodes of a B-tree store multiple elements, and each inner node has multiple branches.
  2. The elements in the node contain key values and data, and the key values in the node are arranged from largest to smallest. That is, data is stored at all nodes.
  3. Elements in the parent node do not appear in the child node.
  4. All leaf nodes are in the same layer, leaf nodes have the same depth, and there are no pointer connections between leaf nodes.

For example, when querying data in a B tree:

Suppose we query for data with a value equal to 10. Query the path Disk block 1 > Disk Block 2 > Disk Block 5.

First disk I/O: Load disk block 1 into memory. Compare disk block 1 in memory. 10<15, go left, and address disk block 2.

Second disk IO: Load disk block 2 into memory, compare the disk from the beginning in memory, 7<10, address disk to locate disk block 5.

Third disk I/O: Load disk block 5 into the memory, compare 10 in the memory, find 10, fetch data, if data stores row records, fetch data, the query ends. If the disk address is stored, you need to retrieve data from the disk based on the disk address, and the query ends.

Compared with binary balanced search tree, although the number of data comparison is not significantly reduced in the whole search process, disk I/O times will be greatly reduced. Also, since our comparison is in memory, the time of comparison is negligible. The height of b-tree is usually two or three layers, which can meet most application scenarios. Therefore, b-tree index construction can improve the efficiency of query.

The process is shown as follows:

B trees are ideal, but your predecessors will tell you that there are still areas where you can optimize:

  1. B tree does not support fast lookup of range query. If you think about a situation like this, if we want to find the data between 10 and 35, after finding 15, we need to go back to the root node and iterate again. We need to iterate from the root node many times, so the query efficiency needs to be improved.
  2. If data stores row records, the size of rows will increase as the number of columns increases. In this case, the amount of data that can be stored in a page is reduced, the tree is correspondingly higher, and disk I/OS are increased.

B+ Tree: Transform the B tree

MySQL uses the B+ tree to build indexes. MySQL uses the B+ tree to build indexes. The main difference between B+ trees and B trees is whether non-leaf nodes store data

  • B tree: Both non-leaf and leaf nodes store data.
  • B+ tree: Only leaf nodes store data, non-leaf nodes store key values. Leaf nodes are connected by bidirectional Pointers, and the lowest leaf node forms a bidirectional ordered linked list.

B+ tree data structure

The lowest leaf node of the B+ tree contains all the indexes. As can be seen from the figure, when B+ tree searches for data, the data is stored in the lowest leaf node, so the data can be queried only when the leaf node is retrieved.

So in case of need to query data of each disk IO have direct relationship with tree height, but on the other hand, since the data is on the leaf node, index of disk blocks for storage index is will increase with the number of, relative to the B tree, tree height of B + tree is theoretically is shorter than B tree.

There is also the case of index overwriting the query, in which the data in the index meets all the data required by the current query statement. At this time, the index can be returned immediately after it is found, without the need to retrieve the lowest leaf node.

Take an example: equivalent query

Suppose we query for data equal to 9. Query the path Disk block 1 > Disk Block 2 > Disk Block 6.

First disk I/O: Load disk block 1 into memory, compare the disk from the beginning in memory, 9<15, go left to disk address disk block 2.

Second disk I/O: Load disk block 2 into memory, compare the disk from the beginning in memory, 7<9<12, address disk to locate disk block 6.

Third disk I/O: Load disk block 6 into the memory, compare the data in the memory from the beginning, find 9 in the third index, fetch data, if data stores row records, fetch data, the query ends. If the disk address is stored, you need to retrieve data from the disk based on the disk address, and the query ends. (The difference here is that Data in InnoDB stores row Data, while MyIsam stores disk addresses.)

The process is shown as follows:

B+ tree query process based on index equivalence

Range query:

Let’s say we want to find something between 9 and 26. The search path is Disk Block 1 > Disk Block 2 > Disk Block 6 > Disk Block 7.

We first look for data with a value equal to 9 and cache the data with a value equal to 9 into the result set. This step is the same as the previous equivalent query process, and three disk I/OS occur.

After finding 15, the underlying leaf node is an ordered list, and we filter all the data that meet the filtering conditions by traversing backwards from disk block 6 and key value 9.

Fourth disk I/O: Locate disk block 7 based on the following pointer to disk 6, load disk 7 into the memory, compare the data in the memory from the beginning, 9<25<26, 9<26<=26, cache data in the result set.

The primary key is unique (there is no data <=26 behind it), so there is no need to look back and the query terminates. Return the result set to the user.

It can be seen that B+ tree can guarantee the fast lookup of equivalence and range query. MySQL index adopts the data structure of B+ tree.

Mysql index implementation

MyISAM index and InnoDB index are two storage engines for Mysql

MyIsam index

Take a simple user table for example. The user table has two indexes: the ID column is the primary key index and the AGE column is the normal index

CREATE TABLE `user`
(
  `id`       int(11) NOT NULL AUTO_INCREMENT,
  `username` varchar(20) DEFAULT NULL,
  `age`      int(11)     DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_age` (`age`) USING BTREE
) ENGINE = MyISAM
  AUTO_INCREMENT = 1
  DEFAULT CHARSET = utf8;
Copy the code

MyIsam_user Queries data

MyISAM data files and index files are stored separately. When MyISAM uses B+ tree to build an index tree, the key value stored in the leaf node is the value of the index column, and the data is the disk address of the index row.

The primary key index

MyIsam Primary key index

The index of the table user is stored in the index file user.myi, and the data file is stored in the data file user.myd.

The following is a brief analysis of disk I/O status during query:

Query data according to primary key equivalent value:

select * from user where id = 28;
Copy the code
  1. Search the primary key tree from the root node, load the root node into memory, compare 28<75, go left. (1 disk I/O)
  2. Load the left subtree node into memory, compare 16<28<47, and search down. (1 disk I/O)
  3. The leaf node was retrieved, and the node was loaded into memory for traversal, comparing 16<28, 18<28, 28=28. The index entry whose value is 30 was found. (1 disk I/O)
  4. Get the disk address from the index entry, and then get the entire row of records from the data file user.myd. (1 disk I/O)
  5. Returns the record to the client.

Disk I/O count: 3 index searches + record data searches.

Query data by primary key range:

select * from user where id between 28 and 47;
Copy the code
  1. Search the primary key tree from the root node, load the root node into memory, compare 28<75, go left. (1 disk I/O)

  2. Load the left subtree node into memory, compare 16<28<47, and search down. (1 disk I/O)

  3. The leaf node was retrieved, and the node was loaded into memory for traversal comparison 16<28, 18<28, 28=28<47. Find the index entry whose value is 28.

    Get row records from data files based on disk addresses and cache them in the result set. (1 disk I/O)

    Our query statement is a range search, which needs to traverse the bottom leaf linked list backward until the last one does not meet the filtering condition.

  4. The bottom leaf linked list is traversed backwards, and the next node is loaded into memory. The traversal comparison is 28<47=47, and the row records are obtained from the data file according to the disk address and cached in the result set. (1 disk I/O)

  5. Finally, two matching filtering conditions are obtained and the query result sets are returned to the client.

Disk I/O count: 4 index searches + record data searches.

MyIsam index range query procedure

Note: The above analysis is for reference only, MyISAM will cache the index node in the MySQL cache during query, and the data cache depends on the operating system’s own cache, so it is not always the disk, here is just to analyze the process of using the index.

Secondary index

In MyISAM, secondary indexes are structured the same as primary key indexes without any difference, and the data stored in leaf nodes is the disk address of row records. However, the key value of the primary key index is unique, while the key value of the secondary index can be repeated.

When querying data, because the key value of the secondary index is not unique, there may be multiple records with the same value. Therefore, even for equivalent query, data needs to be retrieved in the secondary index tree in the way of range query.

InnoDB index

Primary key index (cluster index)

Each InnoDB table has a clustered index, which is built using B+ trees, and leaves store data as whole rows of records. In general, a clustered index is equivalent to a primary key index. When a table does not have a primary key index, InnoDB automatically creates a ROWID field to build the clustered index. InnoDB index creation rules are as follows:

  1. The PRIMARY KEY is defined on the table. InnoDB uses the PRIMARY KEY index as the cluster index.
  2. If the table does not have a primary key defined, InnoDB selects the first non-null unique index column as the cluster index.
  3. If neither of these is present, InnoDB uses an implicit 6-byte integer ROWID field to build the clustered index. The ROWID field is automatically incremented when a new row is inserted.

All indexes other than clustered indexes are called secondary indexes. In middle InnoDB, the leaf node in the secondary index stores data that is the primary key of the row. InnoDB uses this primary key value to search for row records in the clustered index at retrieval time.

User_innodb is used as an example. The ID column of user_InnoDB is the primary key and the age column is the normal index.

CREATE TABLE `user_innodb`
(
  `id`       int(11) NOT NULL AUTO_INCREMENT,
  `username` varchar(20) DEFAULT NULL,
  `age`      int(11)     DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_age` (`age`) USING BTREE
) ENGINE = InnoDB;
Copy the code

The user data

InnoDB data and indexes are stored in a file t_user_Innodb.ibd. InnoDB organizes data in a clustered index.

Leaf nodes of primary key indexes store rows, while secondary indexes store only primary key values.

InnoDB primary key index

Equivalent query data:

select * from user_innodb where id = 28;
Copy the code
  1. Search the primary key tree from the root node, load the root node into memory, compare 28<75, go left. (1 disk I/O)

    Load the left subtree node into memory, compare 16<28<47, and search down. (1 disk I/O)

    The leaf node was retrieved, and the node was loaded into memory for traversal, comparing 16<28, 18<28, 28=28. If the value of the index is 28, the entire row can be retrieved. Returns the change record to the client. (1 disk I/O)

    Number of disk I/OS: 3

    \

Secondary index

All indexes except clustered indexes are called secondary indexes and InnoDB secondary indexes only store primary key values rather than disk addresses.

Take the age column of user_Innodb as an example. The age index results are shown in the following figure.

InnoDB secondary index

The bottom leaf nodes are sorted according to the order of (age, ID). First, they are sorted according to the age column from small to large, and at the same time, they are sorted according to the ID column from small to large.

Using a secondary index requires retrieving the index twice: first retrieving the secondary index for the primary key, and then using the primary key to retrieve the record from the primary index.

Drawing analysis of equivalent query:

select * from t_user_innodb where age=19;
Copy the code

InnoDB assists index queries

The process of retrieving data into a primary key index tree based on the primary key ID retrieved in the secondary index tree is called a back-table query.

Number of disk I/OS: Secondary index 3 + record retrieval 3 times

Composite index

Table abc_innodb, id primary key index, create a joint index idx_ABC (a,b,c).

CREATE TABLE `abc_innodb`
(
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a`  int(11)     DEFAULT NULL,
  `b`  int(11)     DEFAULT NULL,
  `c`  varchar(10) DEFAULT NULL,
  `d`  varchar(10) DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_abc` (`a`, `b`, `c`)
) ENGINE = InnoDB;
Copy the code

select * from abc_innodb order by a, b, c, id;

Data structures for composite indexes:

Composite index structure 1

Combined index query process:

select * from abc_innodb where a = 13 and b = 16 and c = 4;
Copy the code

The query process for composite indexes

Leftmost matching principle:

The principle of left-most prefix matching is related to index storage structure and retrieval method of joint index.

In the composite index tree, the lowest leaf nodes are arranged in ascending order from left to right according to the first column A, but columns B and C are disordered. Column B can only increase order in a small range when columns A are equal, while column C can only increase order in a small range when columns A and B are equal.

Like the query above, the B+ tree first compares column A to determine which direction to search next, left or right. If column A is the same, compare column B. But if the query condition does not have a column, the B+ tree does not know which node to look up first.

Idx_abc (a,b,c); idx_ABC (a,b,c) ,

Rule of left-most prefix matching for composite indexes: When using composite indexes, mysql will keep matching to the right until it encounters a range query (>, <, between, like).

See here, you are not for your OWN SQL statements in the index of more optimization ideas.

Such as:

To avoid back to the table

In the storage engine of InnoDB, the secondary index query is used, because the data saved by the secondary index leaf node is not the data of the current record, but the primary key index of the current record. If the index needs to obtain the complete data of the current record, it must continue to query from the primary key index according to the primary key value. We do this in bits. Think back to the table is bound to cost performance and affect performance. So how do you avoid it?

Use index overwrites, for example: existing User tables (ID (PK),name(key), Sex, Address,hobby…)

Select id,name,sex from user where name =’zhangsan’; This statement is used more frequently than the other columns in the user table. In this case, if we create an index for the name field, instead of a single index, we use a joint index (name, Sex) if the query is executed, the complete data of the current statement can be obtained based on the secondary index.

This can effectively avoid the table to retrieve sex data.

Here is a typical case of using an overwrite index optimization strategy to reduce back to the table.

Use of federated indexes

Joint index, in the establishment of the index, as far as possible in multiple single-column index to judge whether the use of joint index. The use of federated indexes not only saves space, but also makes it easier to use index coverage.

Think about it. The more fields you index, the easier it is to meet the requirements of the query. For example, the joint index (a_b_c) is equivalent to having three indexes: A, A_b, and a_b_c, which saves space. Of course, the space saved is not three times as much as that saved by (A, a_b, and a_b_c), because the data in the index tree does not change, but the data in the index data field is really saved.

Joint index creation principle, in the create index of time due to the frequent use of the column, high degree of differentiation of the column in front, frequent use of representative indexes utilization rate is high, high degree of differentiation on behalf of the filter size is large, these are all in the optimization of index creation needs to consider the scenario, can also be on the fields often need to be as a query returns to joint index, If an overwrite index is used by adding a field to the federated index, I recommend using the federated index in this case.

Use of federated indexes

  1. Consider whether there are already multiple single-column indexes that can be merged, and if so, create the current multiple single-column indexes as a single federated index.
  2. If the current index has columns that are frequently used as return fields, you can consider whether the current column can be added to the existing index so that the query statement can use the overwriting index.

So that’s my own summary of the index, a little bit long and boring, for bookmarking.