Index design and how it works

Let’s take a look at index design and how it works. To create a high-performance index, you first need to understand what an index is. A database index is a data structure that speeds up data retrieval operations on database tables at the expense of additional write and storage space. In layman’s terms, an index is similar to a table of contents of a book, which can be found quickly by the page numbers recorded in it.

An Index is a data structure used by the storage engine to quickly find records.

An index is a physical data Page, and the database Page Size determines how many index rows a Page can store and how many pages are needed to store an index of a given Size.

  • Index can speed up the retrieval speed, but also reduce the index column insert, delete, update speed, index maintenance costs.
  • The theoretical knowledge involved in indexing includes binary search, hash table and B+Tree.

Binary search method

Binary search is also called split – half search, it is in the ordered array to find the specified data search algorithm. Wikipedia’s definition of binary lookup is shown below. Its advantage is the equivalent query, the range query performance is excellent, the disadvantage is the update data, new data, delete data maintenance cost is high.

For example, the ordered array [1-71] has 17 values, that is, in the ordered array [A0-a16], we want to find the location of Target(7), first make sure that the subscript L is 0, the subscript R is 16, and the subscript m is floor [(L+R)/2].

  1. First Query

Subscript L=0, R=16, m= floor[(0+16)/2]=8, obtaining a value of 14 for A8, since A8(14) >Target(7) set R= M-1 =7, as shown below.

  1. Second Query

Subscript L=0, R=7, m=floor[(0+7)/2]=3, obtain A3 value 6, A3(6) < Target(7) set subscript L= M +1=4, as shown in the figure below.

  1. Third Query

Subscript L=4, R=7, m=floor[(4+7)/2]=5, the value of A5 obtained is 8, then A5(8) > Target(7) set subscript R= M-1 =4, as shown in the figure below.

  1. Fourth Query

Subscript L=4, R=4, m=floor[(4+4)/2]=4, obtain A4 value of 7, A4(7) = Target(7), end of query, as shown in the figure below.

The query finds the target data 7 after four binary searches. If the subscript L>R appears in the query process, it means that the target element is not in the ordered array, and the query ends.

Binary search is the theoretical basis of index implementation, which is very helpful for the following index learning.

The index principle

We all know that database query is the core function of database, and index is an important technical means to speed up the query. The essence of the selection of index data structure is to choose an excellent data structure for data storage and traversal according to the current hardware environment of data reading and writing. Most indexes in the database are realized by B+Tree. In addition to B+Tree indexes, we also need to pay attention to Hash indexes.

Next, we will study Hash index and B+Tree index one by one. Since most of the rest of the content is about B trees, to make the content of B trees more coherent, we will talk about Hash indexes first.

A Hash index

A hash table is the basis of a hash index in a database. It is a structure that stores data according to the key value <key,value>. To put it simply, a hash table is an array that calculates index columns to buckets or slots using hash functions. Actual storage is to convert keys into certain storage locations based on hash functions and store values in the array locations. When accessing, you only need to input the key to be searched, and then you can calculate the storage location and read the data through the hash function.

As shown in the figure below, the name is used as the key, and the hash function is used to calculate the name field data, and the hash code is obtained and stored in the bucket or slot array. At the same time, Pointers to real data rows are stored as values to form a hash table.

Next, we will learn how Hash indexes are implemented, how Hash collisions are handled, and how MySQL uses Hash.

How do you implement hash indexes first? The Hash index in the database is realized based on the Hash table. The Hash code of the corresponding index column is calculated by the Hash algorithm to form the Hash table. The Hash index is composed of the Hash code and the pointer to the real data row that the Hash code points to. Hash indexes are only valid for equivalent queries of hash index columns.

As shown in the following figure, the Hash index is constructed based on the name field in the table, and the Hash algorithm is used to calculate the data in each row of the name field to obtain the Hash code. Hash indexes consist of Hash codes and Pointers to real data rows.

Because hash indexes only store hash values and row Pointers, not actual field values, they have a compact structure and very fast query speed. In the case of no hash conflict, a hash index can be hit once. However, hash indexes are only applicable to equivalent queries, including =, IN(), <=> (safe equals, select NULL <=> NULL and SELECT NULL = NULL are different results), and range queries are not supported.

In addition, the performance of hash index is inversely proportional to the number of hash conflicts. The more hash conflicts, the more maintenance cost, the lower performance.

Now let’s look at how do Hash collisions work? A Hash collision means that the same Hash code is computed for different index column values. As shown in the figure above, the name field of the table is John Smith and Sandra Dee, and the Hash code calculated by the Hash algorithm is 152, indicating that a Hash collision has occurred. The general method for dealing with Hash collisions is to use a linked list, which forms a linked list of Hash collision elements. When a collision occurs, the linked list is traversed twice to find data.

  • Hash collisions are related to the Hash algorithm selected. To reduce the probability of Hash collisions, the Hash algorithm that avoids Hash collisions is preferred. For example, the function FNV64() of Percona Server, whose Hash value is 64 bits, The probability of Hash collisions is much lower than CRC32.

  • Secondly, considering performance, numeric Hash algorithms are preferred because string Hash algorithms waste space and are inconvenient for comparison.

The following figure shows the return values generated by the common CRC32, SHA1, and MD5 Hash functions.

The Hash algorithm priority is as follows: FNV64 > CRC32 (Hash conflict probability is high in a large amount of data) > MD5 > SHA1.

Finally, how to use Hash index in MySQL? In MySQL, it is mainly divided into Hash index supported by Memory storage engine, InnoDB adaptive Hash index and Hash index of NDB cluster.

Memory Hash indexes natively supported by the Memory storage engine. As shown in the figure above, the Memory storage engine can create and use Hash indexes natively and explicitly when creating tables.

Compared to InnoDB, although it cannot create Hash index natively, it can forge Hash index to speed up the performance of fixed value query. For example, to obtain the Hash index function, create a B+Tree index for the url_hash field of a very long text (such as a website URL).

About hash index, InnoDB provides the powerful function of InnoDB adaptive hash index, next focuses on InnoDB adaptive hash index.

InnoDB’s adaptive hash index is designed to improve query efficiency. InnoDB’s storage engine monitors queries on each index page of the table. When InnoDB notices that some index values are accessed frequently, it creates a hash index in memory based on the B+Tree index. Enable the B+Tree index in memory to have the function of hash index, that is, to quickly and fixed-value access to frequently accessed index pages.

B + Tree index

Most indexes in the database are implemented through B+Tree. For details about B+Tree, refer to data Structures and other related books. If the type of an index is not specified, the default value is B+Tree, which is equivalent to B+Tree, B-tree, or BTREE (don’t be surprised if the index statement is BTREE, which is equivalent to B+Tree).

The following figure shows a simple, standard B+tree with K keys and K+1 Pointers per node.

For the MySQL storage engine, the actual USE of B+Tree index is to meet the data read and write performance and adapt to the optimized data structure of disk access mode. Each leaf node contains a pointer to the next leaf node.

In MySQL, indexing is implemented at the storage engine layer rather than the server layer, so different storage engine layers can support different types of indexes. For example, although MyISAM and InnoDB both use B+Tree for their indexes, their actual data storage structures are quite different. Each page in the figure has been randomly numbered (the number can be identified as the page number). The page with page number 20 is the root page of the B+Tree (the root page is usually stored in memory). The root page stores <key+pageno>. Pageno is the page number pointing to a specific leaf node. Other pages are leaf nodes that store specific data <key+data>.

B + Tree index to fast access to data, it is because the storage engine can no longer need to get the data through a full table scan, but from the index of root nodes (usually in memory) to binary search, the root node of the slot in the deposit is a pointer to the child node, the storage engine based on these Pointers to quickly iterate through the data. For example, the root node with the page number of 20 can quickly know that the data with Key<10 is in pageno 33, and the data with Key in the range of [10,16) is in Pageno 56.

The <key+data> stored in leaf nodes depends on whether the B+Tree is a Clustered Index or a Secondary Index.

Cluster index and secondary index

A clustered index is a data storage method. It indicates that the data in a table is stored in the order of the primary key and organizes the table by index. InnoDB’s clustered index is to build a B+Tree according to the primary key order. The leaf nodes of the B+Tree are row records. Data rows and primary key values are stored together in a compact way. This also means that InnoDB’s primary key index is the table itself, which stores the entire table in primary key order.

InnoDB secondary index (also known as secondary index) only builds A B+Tree based on index columns, but stores primary key information in each row of the B+Tree to speed up table operations.

The space occupied by a clustered index is the size of the data volume of the entire table, while secondary indexes are much smaller than clustered indexes. Secondary indexes are usually created to improve query efficiency

InnoDB can only create one clustered index, but it can create multiple secondary indexes.

In contrast to indexed organized tables, there is another type of heap table, which is stored directly on disk based on the order in which data is written. For a heap table, the only difference between the primary key and secondary index is whether the key value is unique. Both of them build a B+Tree based on the index column sorting. Heap tables are also widely used in all kinds of databases. MyISAM storage engine’s tables are heap tables.