What is an index

An index is a structure that sorts the values of one or more columns of a database table. Using an index, you can quickly access specific information in a database table.

The main purpose of index is to speed up the retrieval of data in the table, is a data structure to assist query.

Second, index model, implementation

There are many data structures that can be used to improve read and write efficiency. Three common data structures are hash table, ordered array and search tree.

Hash table

A structure that stores data in key-value pairs. You can find the value of the key directly according to the key to be searched.

Hash implementation ideas store the value in the array, through a hash function to calculate the key into a certain position of the array, and then put the value in the position, that is, the storage position = F (key), F is the hash function.

Hash collision, also known as hash collision, is when multiple keys are computed by the hash function to obtain the same value (address).

Hashing conflict solutions mainly include:

  1. Open addressing – In the event of a conflict, continue to find the next unoccupied storage address.
  2. Rehash function.
  3. Chain address method. HashMap takes this approach, that is, array + linked list (single linked list).

Application Scenario The hash table applies to the scenario where only equivalent query is performed. For interval query, scan all entries.

Orderly array

Application Scenario Ordered arrays perform very well in the equivalent query and range query scenarios, but ordered array indexes are only suitable for static storage engines, because the update operation may involve moving all subsequent data records, which is too expensive.

Search tree

Binary search trees can also be used as indexed data models. The time complexity of the query is O(log(N)). To maintain the query complexity, it is necessary to keep the tree as a balanced binary tree, and to ensure this, the update time complexity is also O(log(N)).

Binary search trees are the most efficient, but in practice most database storage does not apply to binary trees because indexes are not only stored in memory, but also written to disk.

In order for a query to read as little disk as possible, it is necessary to allow the query to access as few blocks as possible, which should apply to an N-tree rather than a binary tree. N tree is widely used in database engine because of its advantages in read/write performance and disk access mode.

InnoDB index model

In MySQL, indexes are implemented at the engine level, so there is no uniform index standard, that is, indexes of different storage engines work differently.

Index-organized tables In InnoDB, tables are stored as indexes according to the order of primary keys. Such tables are called index-organized tables.

InnoDB uses a B+ tree index model, so all data is stored in B+ trees. Each index in InnoDB corresponds to a B+ tree.

The index type

The indexes are classified into primary key indexes, non-primary key indexes (unique indexes and common indexes).

Leaf nodes indexed by primary keys hold entire rows of data. In InnoDB, it is also called a clustered index.

Leaf node contents that are not primary key indexes are primary key values. In InnoDB, it is also called secondary index.

Primary key indexes differ from normal index queries

  • Primary key index query, only need to search the primary key index of B+Tree.
  • For a common index query, search the common index tree to obtain the primary key column value, and then search the primary key index tree again. This process is called back to the table.

In summary, queries based on non-primary key indexes require an additional index tree scan, so primary key queries should be used as much as possible.

InnoDB index structure diagram is shown below, where ID is the primary key and K is the non-primary key index column.

4. Index maintenance

B+ trees need to perform necessary maintenance when inserting new values in order to maintain index order.

Page splitting If new data is inserted into the data and the corresponding data page is full, according to the B+ tree algorithm, a new data page needs to be applied for and part of the data is moved. This process is called page splitting.

Page splitting effects:

  1. Performance deteriorates.
  2. The utilization of data pages is affected. The data that could be placed on one page is divided into two pages, and the overall space utilization is reduced by about 50%.

Page merge If the utilization of two adjacent pages is low due to data deletion, the data pages are merged. The process of page merge is called page merge.

Application scenario of the auto-add primary key

In which scenarios should auto-increment primary keys be used and in which scenarios business field primary keys should be used? (PS: The primary key of the business field must be unique)

From the point of view of performance, the auto-increment primary key mode is consistent with the incremental insertion scenario, which is an append operation without data moving or triggering the splitting of leaf nodes. However, the mode of business field master key is not easy to ensure orderly insertion, and the cost of writing data is relatively high.

In terms of storage space, each leaf node that is not a primary key index contains the value of the primary key. The smaller the primary key length is, the smaller the leaf node of the common index is, and the smaller the space occupied by the common index is. So in your choice, consider the length of the business field and the increment primary key field itself.

In summary, auto-increment of primary keys is often a more reasonable choice for performance and storage space.

Scenarios where business fields are directly home keys

  1. There is only one index;
  2. The index must be unique.

That is, when there is no other index, there is no need to worry about the size of the leaf node of the other index.

Five, the summary

References:

1. Simple index (1)

2. One of the Java collections – HashMap

This article is published by OpenWrite!