Database index

A database index is a data structure that speeds up data retrieval operations on database tables at the expense of additional write and storage space.

MySQL officially defines an index as a data structure used by the storage engine to quickly find records.

  • An index is a physical data Page, and the database Page Size determines how many index rows can be stored on a Page and how many pages are needed to store an index of a given Size.

Since the indexes are so good, can we create an index for each column?

  • Index can speed up the retrieval speed, but also reduce the index column insert, delete, update speed, index maintenance costs.

MySQL has Hash index, B+Tree index.

A Hash index

A Hash index is a data structure that uses a Hash table to store indexes. The Hash table uses the Hash function to calculate the index column into the bucket. The Hash table uses the Hash function to convert the key into a specific position and stores the value into this position. When accessing the bucket, the user can quickly locate the storage position based on the key. As shown below, yesnameFields form hash indexes. Buckets store the array subscripts calculated by key using the hash function, and stores the location of this record in buckets.

  • How is hash indexing implemented?

Hash index Calculates the data in the hash index column through the hash function to obtain the hash code of the corresponding index column. The hash index consists of the hash code and the pointer to the real data row that the hash code points to. Hash indexes are used in scenarios where only equivalent queries for hash index value columns are valid.

Hash index only stores hash value and row pointer, do not store the actual field value, its structure is compact, query speed is very fast, access hash index time complexity is O(1), but hash index is only suitable for equivalent query, including=.IN().< = >, does not support range queries, for example>.<And so on. The performance of a hash index is inversely proportional to the number of hash conflicts. The more hash conflicts there are, the worse the performance of a hash index query is.

  • What about hash collisions?

There are usually open address method and chain address method to solve hash collision. The open address method is to use the probe function to find the next hash address in case of a collision, until an empty hash address is found, and then store the element. Reference: open address method. The linked address method is to store a list node instead of a single element at the conflicting location, so that when a conflict occurs, the next empty list node can be found along the list node at the conflicting location, and then the element can be stored in the list. Regardless of the open address method or chained address method, the search efficiency of hash index will be reduced when hash conflicts occur. Therefore, hash conflicts are inevitable. You need to minimize hash conflicts. Hash conflicts are related to the selected hash algorithm.

To reduce the probability of Hash collisions, select Hash algorithms that avoid Hash collisions. For example, the Percona Server function FNV64(), whose Hash value is 64 bits, has a lower probability of Hash collisions than CRC32. Secondly, considering performance, numeric Hash algorithms are preferred because string Hash algorithms waste space and are inconvenient for comparison.

  • How to use hash index in MySQL?

In MySQL, there are three types of hash index supported by Memory storage engine, InnoDB adaptive hash index and NDB cluster hash index. While InnoDB cannot create hash indexes explicitly natively, it can forge hash indexes to speed up the performance of fixed-value queries. For example, to obtain the hash index function, create a B+Tree index for the url_hash field of a very long text (such as a website URL).

The InnoDB storage engine monitors queries on each index page of a table. When InnoDB notices that certain index values are accessed frequently, The system creates a hash index based on the B+Tree index in memory. In this way, the B+Tree index in memory has the hash index function, that is, it can quickly and accurately access frequently accessed index pages.

InnoDB builds AHI entries for the entire block when three criteria are met:

  1. Analyze whether the number of successful queries using an adaptive hash index (AHI) exceeds 1/16 of the number of records on the block;
  2. Btr_search_info ::n_hash_potential is greater than or equal to BTR_SEARCH_BUILD_LIMIT (100), indicating that the SQL query can use AHI successfully 100 times in a row;
  3. No index has been constructed for the current block or an AHI index has been built on the current block and block-> n_hash_HELPS is greater than twice the number of records on the page or the recommended prefix index column on the current block has changed.
  • Why create an adaptive hash index twice for the B+Tree index page? This is because the query efficiency of the B+Tree index depends on the height of the B+Tree, which is usually 3 in a database systemLayer 4, so access data needs to do 34 queries. However, hash index access requires only one query without hash conflict. The efficiency of hash index is higher than that of B+Tree index in equivalent query.

The creation of adaptive hash indexes enables InnoDB storage engine to automatically create hash indexes for some hot pages based on the frequency and pattern of page access to speed up access. In addition, users can only choose to turn on or off the function of InnoDB’s adaptive hash index, without manual intervention.

B + Tree index

Most indexes in the database are implemented through B+Tree. B+Tree is an upgraded version of B-tree. B-Tree is a balanced multi-path lookup Tree. In MySQL, indexing is implemented at the storage engine layer rather than the server layer, so different storage engines can support different types of indexes. For example, although MyISAM and InnoDB both use B+Tree for their indexes, their actual data storage structures are quite different. When searching for records using B+Tree indexes, the storage engine does not need to scan all tables. Instead, the storage engine performs binary search starting from the root node of the index. The slots of the root node store Pointers to child nodes.

Cluster index and secondary index

A clustered index is a data storage method. It indicates that the data in a table is stored in the order of the primary key and organizes the table by index. InnoDB’s clustered index is to build a B+Tree according to the primary key order. The leaf nodes of the B+Tree are row records. Data rows and primary key values are stored together in a compact way. This also means that InnoDB’s primary key index is the table itself, which stores the entire table in primary key order. InnoDB secondary index (also known as secondary index) only builds A B+Tree based on index columns, but stores primary key information in each row of the B+Tree to speed up table operations. The space occupied by a clustered index is the size of the data volume of the entire table, while secondary indexes are much smaller than clustered indexes. Secondary indexes are usually created to improve query efficiency. InnoDB can only create one clustered index, but it can create multiple secondary indexes. In contrast to indexed organized tables, there is another type of heap table, which is stored directly on disk based on the order in which data is written. For a heap table, the only difference between the primary key and secondary index is whether the key value is unique. Both of them build a B+Tree based on the index column sorting. Heap tables are also widely used in all kinds of databases. MyISAM storage engine’s tables are heap tables.

The index type

Indexes in MySQL are implemented at the storage engine layer, not the server layer, so different storage engines support indexes differently. Common index types supported by different storage engines in MySQL are as follows:

The index type The storage engine
The hash index Memory/InnoDB Adaptive Hash Index/NDB
B + Tree index MyISAM/InnoDB
The full text indexing MyISAM/InnoDB
Spatial index MyISAM R-Tree
Fractal tree index TokuDB Fractal Tree Index

In MySQL InnoDB, indexes can be divided into two categories: primary key index (i.e. clustered index) and secondary index (non-clustered index). For tables that do not specify a primary key, InnoDB selects the appropriate primary key in the following order:

  1. Display primary key;
  2. The first unique index (requiring the unique index columns to be non-null);
  3. Built-in 6-byte ROWID.

It is recommended to create a primary key using an UNSIGNED autoincrement column display. Why is that?

UNSIGNED has one less sign bit than SIGNED, and since you don’t need negative numbers to increment a primary key, you can use UNSIGNED, which stores more data.

Based on the number of index columns and function description, indexes can also be classified into union indexes and overwrite indexes.

  • A joint index is one that combines multiple fields to form an index.
  • When all records can be queried through an index without the need to return the table to the clustered index, this type of index is also called an overwrite index.
  • Primary key queries are natural override indexes, and union indexes can be override indexes.

How do I check if an overwrite index is used in my SQL statement? Usually when viewing the execution plan, the Extra column Using index indicates that the optimizer uses an overwrite index. Overwrite indexes are generally recommended because if SQL needs to query a column that is not included in the secondary index, it will need to find the primary key value through the secondary index, and then go back to the table to query another column through the primary key (i.e., back to the table query), requiring two queries. An overwrite index can directly obtain all the data required by the query from the index, avoiding the secondary search back to the table, saving IO and high efficiency.

Why use B+Tree index instead of red black Tree, AVL Tree?

This should be combined with the application scenario of database index. Usually, the amount of data stored in the database is very large. Due to the memory limitation, the data cannot be loaded into the memory for sorting at one time. So you have to choose external sort, sort the data. External sort requires I/O operations. Because of the large amount of data stored in disks, I/O operations take longer than sort operations. If the red-black tree and AVL tree is ordered for the scene so that can be competent, but sorting out, due to the memory limit, if still using the red-black tree, or AVL tree, which would cause the height of the tree is very high, when to sort data of each layer in the data from the database read into memory to carry on the merge sort, The sorted data is then written to disk. Because of the height of the tree, there are more merging trips and more I/O operations. It takes more time, so choose B+Tree. The number of layers of B+Tree can already process billions of data at four levels. This means that using B+Tree to query billions of data requires only 4 I/ OS.

Reference:High performance MySQL practice – Zhou Yanwei