Why can MySQL support fast query of tens of millions of data scale?

takeaway

In this article, we will start by designing a user-centered table structure. I will create a user table, user, as follows:

CREATE TABLE `user` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `user_id` int(8) DEFAULT NULL COMMENT 'user id',
  `user_name` varchar(29) DEFAULT NULL COMMENT 'Username',
  `user_introduction` varchar(498) DEFAULT NULL COMMENT 'User Profile',
  `sex` tinyint(1) DEFAULT NULL COMMENT 'gender',
  `age` int(3) DEFAULT NULL COMMENT 'age',
  `birthday` date DEFAULT NULL COMMENT 'birthday'.PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;
Copy the code

All the fields in this table have been annotated so that you can clearly understand the meaning of each field. At the same time, I insert 8 more records:

INSERT INTO `user` (`id`, `user_id`, `user_name`, `user_introduction`, `sex`, `age`, `birthday`)
VALUES
	(1.10001.'Jack'.'I\'m Jack', 1, 25, '1998- 01.'), (2, 10002, 'Nancy', 'I\'m Nancy'.0.15.'2008-02-03'),
	(3.10003.'Bruce'.'I\'m Bruce', 1, 17,2006- 03- 03'), (4, 10004, 'John', 'I\'m John'.1.18.'2005-03-05'),
	(5.10005.'Amy'.'I\'m Amy', 0, 15, '2008.- 06'), (6, 10006, 'Lucy', 'I\'m Lucy'.0.16.'2007-06-06'),
	(7.10007.'Mike'.'I\'m Mike', 1, 17,200607 -- 01'), (8, 10008, 'Henry', 'I\'m Henry'.1.16.'2007-06-07');
Copy the code

Then I create a federated index index_age_birth as follows:

ALTER TABLE `user` ADD INDEX `index_age_birth` (`age`, `birthday`);
Copy the code

Now, let’s start to analyze why MySQL can support the scale of tens of millions of data fast query? There are many factors that can affect the performance of MySQL queries, such as indexes, optimizer, Query Cache, Lock, various buffers, etc. In this article, I will focus on indexes because they are the most commonly used in our daily work.

In fact, there is a lot of information on MySQL index structure on the Internet, but most of it is not detailed enough, so I will focus on the parts that may be overlooked on the Internet, and explain the MySQL index structure in more detail.

We all know that MySQL’s index structure is inseparable from its storage engine. In daily work, the most commonly used storage engines are MyISAM and InnoDB. Memory, Federated, Merge, and so on are less frequently used. InnoDB is the most popular storage engine, so I will explain in detail InnoDB’s index structure in MySQL.

Index structure

InnoDB engine index structure is divided into two main types: clustered index and auxiliary index. What do they look like? Using the user table above as an example, I’ll see what these two indexes look like:

Clustering index

Above is the cluster index of the User table, which is stored in InnoDB as a B+Tree.

(1) Non-leaf node: The uppermost node is called the root node, among other nodes, the node initiating the pointer is called the parent node, and the node pointed to by the pointer is called the child node. They are all non-leaf nodes. This node contains multiple < primary key id, page_no > tuples of records, at the same time, it also has a page number, including page_no element has a pointer, pointing to the following a non-leaf nodes or leaf nodes, multiple tuples are connected by a singly linked list pointer, each non-leaf nodes are connected by two two-way chain table pointer.
- As shown in the figure above, the root node page 1 contains two tuple records <1, 2> and <5, 3>, with its own page number 1. The page_no element 2 in the first tuple record points to the node on page 2, and the page_no element 3 in the second tuple record points to the node on page 3.
- As shown in the figure above, the page 2 node contains two tuple records <1, 4> and <3, 5>. The page_no element 4 in the first tuple record points to the node on page 4, and the page_no element 5 in the second tuple record points to the node on page 5. The two tuples are connected by a one-way pointer to form a one-way linked list.
- The page 2 and page 3 nodes are connected by two bidirectional list Pointers
(2) Leaf node: this node contains multiple < primary key ID, non-primary key field value > tuples, and it also has a page number, the tuples are connected by a one-way list pointer, and each leaf node is connected by two two-way list Pointers.
- As shown in the figure above, I only have two columns for non-primary key field values: user_id and user_name, and use…… for the rest <7 10007, Mike, I’m Mike, 1, 17, 2006-07-01> and <8 10008, Henry, I’m Henry, 1,16, 2007-06-07>, the two tuples are connected by a unidirectional linked list pointer.
- The nodes on page 6 and page 7 are connected by two two-way list Pointers.

The clustered index has one characteristic: the records in all nodes are arranged in ascending order of primary key ID.

Secondary index

The figure above shows the structure of the index index_age_birthday, which is also called the secondary index. The secondary index in InnoDB is also stored in a B+Tree.

(1) Non-leaf nodes: Among them, the top node called the root node, other nodes, a pointer is called a parent node, called the child pointer to the node, they are leaf node, the node contains multiple < index column values, page_no > tuples of records, at the same time, it also has a page number, including page_no element has a pointer, Points to a non-leaf node or leaf node below. Multiple tuples are connected by a one-way list pointer, and each non-leaf node is connected by two two-way list Pointers.

Therefore, the tuple structure in B+Tree non-leaf is

.
,>
- As shown in the figure above, the root tuple page 1 contains two tuple records <15, 2008-02-03, 2> and <17, 2006-03-03, 3>, and its own page number is 1. The page_no element 2 in the first tuple record points to the page 2 node, and the page_no element 3 in the second tuple record points to the page 3 node.
- As shown in the figure above, the page 2 node contains two tuple records <15, 2008-02-03, 4> and <16, 2007-06-06, 5>. The page_no element 4 in the first tuple records points to the page 4 node, and the page_no element 5 in the second tuple records points to the page 5 node. Two tuples are connected by a one-way pointer.
- The page 2 node and page 3 node are connected by two bidirectional linked list Pointers.
(2) Leaf node: this node contains a record composed of multiple < index column value, primary key ID > tuples. At the same time, it has a page number. The tuples are connected by a one-way list pointer, and each leaf node is connected by two two-way list Pointers.

Also, since index index_age_birth has an index column (age, birthday), the tuple structure in the B+Tree leaf in figure B is

.
,>
- As shown in the figure above, the page 7 node contains two tuple records <18, 2005-03-05, 4> and <25, 1998-01-02, 1>, which are connected by a unidirectional linked list pointer.
- The nodes on page 6 and page 7 are connected by two two-way list Pointers.

Secondary indexes have one feature: the records in all nodes are arranged in ascending order of index column values. For example, for the index_age_birth index, first, the records are arranged in ascending order by age, and then by birthday if the ages are the same.

Search algorithm

InnoDB’s index structure can be inferred from the above structure why MySQL queries one or more records so quickly.

Both the cluster index and the auxiliary index are B+Tree, so we can quickly locate the record to be queried by binary search.

Primary key query: Binary search of the cluster index tree based on the primary key ID can quickly locate the records on the leaf node. There’s a lot of information on the Internet about this, so I’m not going to give you a lot of examples.
Select (age=25, age=25, age=25);
- Page 1 -> page 3: iterates through the tuple record in page 1. Since the age element of the second record in page 1 is 17, and the query condition age= 25,25 > 17, go to the right branch, so, navigate to the node of page 3 by following the pointer to page 3 from page 1.
- Page 3 -> page 7: walk through the tuple record in page 3, because the age element of the second record in page 3 is 18, the query condition age= 25,25 > 18, go to the right branch, so, follow the pointer of page 3 to page 7, locate to the leaf node page 7.
- In the node of page 7, the tuple record in page 7 is traversed. The value of age element in the second tuple record is 25, which satisfies the query condition of age=25. Therefore, the record <25, 1998-01-02, 1> is located as the record to be searched in the auxiliary index.
- Finally, take the primary key ID =1 to the cluster index leaf node traversal search, locate the complete record on the leaf node primary key ID =1. Do InnoDB search b-tree leaves sequentially? Explain in detail.
Because the records inside the nodes on the auxiliary index B+Tree are arranged in ascending order, and the unidirectional linked list is formed between the records and the bidirectional linked list is formed between the nodes. Therefore, I can quickly complete a range query by linear search (that is, search by the order of column values). For example, I now want to execute the following SQL:
```
SELECT * FROM user WHERE age > = 15
Copy the code
```
See the green arrow above:
- (age>=15, 15 >=15); (age>=15); (age>=15); (age>=15);
- Page 2 -> Page 4: Iterate over the tuple record in page 2. Since the age element of the first tuple record in the node of page 2 is 15, the query condition age>=15, the same 15 >=15, so, follow the pointer of page 2 to page 4 to locate the node of page 4.
- Page 4 -> Page 5 -> Page 6 -> Page 7 The nodes on page 4, page 5, page 6 and page 7 are connected by a two-way pointer (forward and reverse) to form a two-way linked list. All records inside each node are connected by a one-way pointer (forward) to form a one-way linked list, and all records are arranged in ascending order according to the column value in index Index_AGe_birth. In other words, the age element of all records in the nodes from page 4 to page 7 must be greater than or equal to 15 and arranged in ascending order. Therefore, we only need to start from the first record in page 4 and traverse all the subsequent records connected by its pointer to find the primary key ID of these records with age >= 15, that is, 2, 5, 6, 8, 3, 7, 4, 1, and finally, Do InnoDB look for b-tree leaves sequentially? In detail.

In summary, we can see why MySQL queries one or more records so quickly:

Binary search: excludes nodes that do not need to be traversed during the search.
Linear search: there is no need to search for the node record that meets the condition repeatedly from the root node, but directly traverse all the subsequent nodes of the first record in the leaf node that meets the query condition.

summary

Finally, I will summarize the contents of this article. In this article, I will focus on the index structure of InnoDB, including cluster indexes and auxiliary indexes.

What both have in common: B+Tree structure.

The difference:

	The leaf nodes	A leaf node
Clustering index	Stores primary key value + child node page pointer	Stores a complete row record
Secondary index	Store index column value + child node page pointer	Stores index column value + primary key

At the same time, based on B+Tree structure, two algorithms to improve the performance of query index structure are obtained: binary search and linear search.

Why can MySQL support fast query of tens of millions of data scale?

takeaway

Index structure

Search algorithm

summary

Related Posts

PowerJob self-implementation of high availability solutions, wonderful wonderful!

12 images to give you a thorough understanding of distributed transaction generation scenarios and solutions!!

Use Caddy as the API gateway for microservices