MySQL 10 - Indexed data structures, from binary trees to B+ trees
Hello, everyone, I am Wang Lao shi, index enables us to learn a very core database ability, but also for our SQL optimization and improve the efficiency of the query has a great relationship. So how does indexing work? What does his data structure look like? Why use such a data structure? Let's take a look at index data structures with some questions in mind.
MySQL index data structure
1.1. What is an index
An Index is a data structure that helps MySQL obtain data efficiently. You get the essence of an index: an index is a data structure.
The above understanding is quite abstract. For example, when reading any book, the first thing you see is the catalogue. It will be very quick to search the contents of the book through the catalogue.
Each node is the ID of the primary key. When we use the ID to query the content, we first check the index library. When we find the index, we can quickly locate the specific location of the data according to the index.
1.2. Common classification of indexes
InnoDB storage engine supports the following common indexes: B+ tree index, full-text index, hash index, among which the key is the B+ tree index.
B+ tree index is the traditional index, which is the most common and effective index in the current relational database system. The structure of a B+ tree index is similar to a binary tree, which can quickly find data according to the Key Value. Note that the B ina B+ tree does not stand for binary, but for balance, since B+ trees evolved from the earliest balanced binary trees, but B+ trees are not binary trees.
Before understanding B+ tree index, we should first know some algorithms and data structures closely related to it, which helps us better understand the working mode of B+ tree index, because B+ tree is through binary search tree, and then by balancing binary tree, B tree evolution.
Binary search, also known as binary search, is used to find a particular record in an ordered array of records.
Its basic idea is: the records are arranged in order (increasing or decreasing), and in the search process, the jump way is used to search, that is, the midpoint position of the ordered sequence is taken as the comparison object first. If the element value to be found is smaller than the midpoint element, the sequence to be searched is reduced to the left half, otherwise it is the right half. The search interval is halved by a single comparison.
Select * from '48' where '48' = 5, 10, 19, 21, 31, 37, 42, 48, 50, 52
Subscripts: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
- Step 1: Set low to the minimum subscript value 0 and high to the maximum subscript value 9.
- Step 2: Obtain mid from low and high, mid= (low + high) / 2, initial mid is subscript 4 (can also =5, depending on the specific algorithm implementation);
- Step 3: Mid =4 = 31, 31 48 (the number we are looking for);
- Step 4: Through the idea of binary search, set low to the corresponding subscript 4 of 31, keep high not changed to 9, and then mid is 6.
- Step 5: Mid =6 = 42, 42 48;
- Step 6: Through the idea of binary search, set low to the subscript 6 corresponding to 42, keep high not changed to 9, then mid is 7;
- Step 7: The corresponding data value of mid=7 is 48, 48 == 48 (the number we are looking for), end of search; It took three binary lookups to find the number we wanted, as opposed to eight sequential lookups.
So binary search is more efficient (on average) than sequential search. But if you look up 5, it only takes one sequential search, as opposed to four binary searches. Let's see, for the above 10 numbers, the average number of sequential searches is (1+2+3+4+5+6+7+8+9+ 10)/10=5.5 times. The binary search is (4+3+2+4+3+ 1+4+3+2+3)/10=2.9 times. In the worst case, the number of sequential lookups is 10, and the number of binary lookups is 4.
3. Evolution of tree structure
According to the characteristics of arrays, we know that there are optimal search efficiency and space requirements; In order to combine the flexibility of linked list insertion and high efficiency of ordered array search, we introduce binary search tree. So first let's see what a binary tree is.
3.1. The binary tree
Each node has at most two subtrees, left and right, and the order cannot be reversed. Logically there are five basic forms of binary trees: empty binary trees, binary trees with only one root, only left subtrees, only right subtrees, and complete binary trees (full binary trees are the special cases). Traversal is the most basic operation of tree. The so-called traversal of binary tree means to walk through all nodes of binary tree according to certain rules and order, so that each node is visited once and only once, including pre-order, middle-order and post-order traversal, as shown in the following figure.
3.2. Binary lookup (search) trees
A binary search tree must first be a binary tree. In addition, it is also stipulated that in a binary search tree, the node value of the left subtree is always less than the node value of the root, and the node value of the right subtree is always greater than the node value of the root, namely, the node value of the left subtree the node value of the root the node value of the right subtree. Therefore, the sorting output of node values can be obtained through middle-order traversal. Take a sorted array:
10 、25 、34 、48 、61 、73 、81
The array can be converted to binary tree structure, middle element of the array as the root node of the tree, in the middle of the left half of the element as the root node of the left child, right in the middle of the element as the root node of the children the right and the second layer of nodes and so on, the last of the tree layer will be more and more, so that we will build up a binary tree.
And this tree, for each node, the left child is smaller than its parent, and the right child is greater than or equal to its parent. So we call such a binary tree also a binarysearch tree, which has the same search time complexity as binarysearch, O(logN).
But a binary lookup tree, if poorly designed, can become an extremely unbalanced binary lookup tree:
Search efficiency and sequential search basically no difference. Therefore, to construct a binary search tree with maximum performance, the binary search tree needs to be balanced, which leads to a new definition -- balanced binary tree, or AVL tree.
3.3. Balanced Binary Tree (AVL-tree)
A balanced binary tree is defined as follows: first it conforms to the definition of a binary search tree, and second it must satisfy that the height difference between the two subtrees of any node is at most 1.
The lookup performance of balanced binary trees is relatively high, but not the highest, just close to the highest. The best performance requires the establishment of an optimal binary tree, but the establishment and maintenance of the optimal binary tree requires a lot of operation, so users generally only need to establish a balanced binary tree.
The query speed of balanced binary tree is very fast, but the cost of maintaining a balanced binary tree is very high. In general, one or more left and right rotations are required to achieve tree balance after insertion, update, and deletion.
What's the problem with balancing binary trees for a database? Because each node of a binary tree has only two direct child nodes at most, the height of the binary tree increases rapidly when the number of nodes is large. For example, when there are 1000 nodes, the height of the tree is almost 9 to 10 layers. We know that the database is persistent, data is to be read from the disk, the general mechanical disk can do at least 100 IO per second, the time of one IO is basically in 0.01 seconds, 1000 nodes in the search will need 0.1 seconds, if it is 10000 nodes, 100000 nodes? So for database, in order to reduce the height of the tree, the data structure of B+ tree is proposed.
3.4. B + 树
Definition of B+ tree: B+ tree, like binary tree and balanced binary tree, is a classical data structure. B+ tree is evolved from B tree and sequential access method, but there is almost no use of B tree in reality.
The definition of B+ tree can be found in many data structure books, it is very complicated, we outline its definition: B+ tree is a deformed form of B tree, the leaf node of B+ tree stores the key word and the corresponding record address, the layer above the leaf node is used as the index. A B+ tree of order m is defined as follows:
- Each node can have up to m elements;
- Each node has at least (m/2) elements except the root node;
- If the root node is not a leaf node, then it has at least 2 child nodes;
- All the leaf nodes are in the same layer;
- A non-leaf node with K child nodes has (K-1) elements, in ascending order;
- All the elements in the left subtree of an element are smaller than it, and all the elements in the right subtree are greater than or equal to it;
- Non-leaf nodes only store keywords and indexes pointing to the next child node. Records are only stored in leaf nodes.
- Adjacent leaf nodes are connected by Pointers.
The variation of B+ tree is B tree, which adds Pointers to brothers in non-root and non-leaf nodes of B+ tree. B tree defines that the number of keywords of non-leaf nodes is at least (2/3)*M, that is, the minimum usage of block is 2/3 (replacing 1/2 of B+ tree). So let's look briefly at B trees and B+ trees.
A B+ tree is a balanced lookup tree designed for disks or other direct access assistive devices. In B+ tree, all record nodes are stored on leaf nodes of the same layer in order of key values, which are connected by Pointers of leaf nodes. Such as:
From the above figure, we can summarize several characteristics of B+ tree, which are actually included in the brief definition of B+ tree:
- When the number of nodes is the same, the height of B+ tree is much lower than that of balanced binary tree.
- Non-leaf nodes only store index information and pointer information of next-layer nodes, but do not store actual data records.
- Each LeafPage stores actual data. For example, in the figure above, each LeafPage stores 4 data records, of course there can be more. Leaf nodes are connected in series from small to large (orderly), and the data in leaf pages are also in good order.
- An index node indicates that the left subtree of the node is smaller than the index value and the right subtree is greater than or equal to the index value.
Note: The data in leaf nodes can be physically unordered, just logically ordered (strung together by Pointers).
3.5. Disks and B+ trees
Why relational databases choose B+ trees has a lot to do with the nature of the disk.
So if we simplify this, we can view it this way
A disk consists of circular disks of the same size, which can be rotated (all disks must be rotated synchronously). On one side of the disk is a magnetic head holder, which holds a group of magnetic heads, each responsible for accessing the contents of a disk. The head does not rotate, but moves in the direction of the disk radius.
The disk is divided into a series of concentric rings with the center being the center of the disk. Each concentric ring is called a track, and all tracks of the same radius form a cylinder. The track is divided into small segments along the radius line. Each segment is called a sector. Each sector is the smallest storage unit of the disk and the smallest read/write unit. Currently, the disk sector is generally 512 to 4k bytes.
Data on the disk must be uniquely identified with a three-dimensional address: cylinder number, disk number, sector number.
To read/write a specified data on a disk, perform the following steps:
(1) First, the moving arm moves the magnetic head to the required cylinder according to the cylinder number. This process is called positioning or search.
(2) After all the magnetic heads are located on the track, the specific track on the specified disk is determined according to the disk number.
(3) After the disk surface is determined, the disk starts to rotate and moves the track segment of the specified block number to the magnetic head.
After the above steps, the storage location of the specified data is found. This is when you can start reading/writing.
As you can see, disk reading depends on mechanical movement, which is divided into three parts: seek time, rotation delay, and transfer time. The sum of the three parts is the time of a disk IO, which is generally about 9ms. Seek time is the time required to move the read/write head to the correct track, which is the most expensive; Rotation is the time required by the disk rotation to move the target sector below the read/write head, depending on the disk rotation speed. Data transfer is the time required to complete data transmission, which depends on the data transmission rate of the interface. At the nanosecond level, it is much smaller than the consumption time of the first two parts. Disk read time costs anywhere from hundreds to tens of thousands of times more than accessing memory.
For efficiency, minimize disk I/O. To achieve this goal, the disk does not read data strictly on demand. Instead, the disk prereads data every time. Even if only one byte is required, the disk reads a certain length of data backward from this position and stores the data into the memory. The theory for doing this is based on the famous locality principle in computer science:
When a piece of data is used, nearby data is usually used immediately. The data required during the running of a program is usually concentrated.
Due to the high efficiency of sequential disk reads (no seek time required, very little rotation time required), sequential disk reads are generally 40 to 400 times more efficient than random reads, and sequential writes are 10 to 100 times more efficient than on-machine reads (the difference is much smaller on SSDS. Sequential reads and writes are 7 to 10 times more efficient than random reads and writes, but some evaluations show that sequential writes on mechanical drives are slightly better than SSDS. In general, Mysql database performance can be greatly improved if the disk is replaced by SSD. Therefore, preread can improve I/O efficiency for local programs.
The prefetch length is an integer multiple of a page. A page is a logical block of computer management memory. Hardware and operating systems often divide main memory and disk storage into contiguous blocks of equal size. Each block is called a page, and the page size is usually 4K but also 16K. When the data the program is trying to read is not in main memory, a page missing exception will be triggered. At this time, the system will send a disk read signal to the magnetic disk. The disk will find the starting position of the data and read one or more pages back to load into memory.
For example, InnoDB defines the size of a B+ tree node as 16KB by default. This means that if a Key is 8 bytes, So a node can hold about 1000 keys, which means that B+ numbers can have 1000 forks. Meanwhile, InnoDB reads 16KB multiples of data for every disk I/O. In other words, InnoDB can make full use of the high-speed read and write feature of sequential DISK I/O on nodes.
At the same time, according to the logical structure of B+ tree, the primary keys of all records are arranged in order from small to large at the leaf node level, and a bidirectional linked list is formed. Non-leaf nodes in the same layer are also connected in series to form a bidirectional linked list. In this case, in actual read/write operations, adjacent nodes are likely to be placed on adjacent pages, and high-speed read/write features of sequential DISK I/OS can be fully utilized. So one of the ways we optimize MySQL is to make data read and write more sequentially and less randomly.
4. Summarize the function of B+ tree
- On block devices, data can be stored efficiently through B+ tree.
- All records are stored on leaf nodes, while non-leaf stores keys information. And the records are sorted by the value of the index column from smallest to largest.
- B+ trees have very high fanouts, usually more than 100, which can effectively reduce IO operations when looking up a record;
Tips: Fan out is a pointer of each non-LeafPage index node to each LeafPage;
Number of Fans = Maximum number of keywords that can be stored on a non-LeafPage + 1