The article directories

  • One, foreword
  • Clustered index and secondary index
    • 2.1 Cluster index
    • 2.2 Secondary Index
    • Back to the table 2.3
      • 2.3.1 back to the table
      • 2.3.2 Solution 1: Overwrite the index
      • 2.3.3 Table Back Solution 2: Index push down
  • Compound index and prefix matching principle
    • 3.1 Composite Index
    • 3.2 Prefix Matching
      • 3.2.1 Underlying Principle: leftmost matching
      • 3.2.2 Prefix Index (Applicable to Secondary Index + Compound Index)
        • 3.2.2.1 Determine the prefix index by index selectivity. The prefix index should take the first few characters
        • 3.2.2.2 Prefix Indexes May increase the number of scanned rows
        • 3.2.2.3 There are two processing methods for the same prefix
        • 3.2.2.4 Set the order of the union index by index
  • 4. Index related knowledge
  • Five, the end

One, foreword

Index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index: create a B+ tree index No two rows are allowed to have the same index value, and one is allowed to be NULL

Primary key = primary key; primary key = primary key; increment = increment; any int can be set to increment, not just primary key. Physical primary keys are generally incremented, not unordered, because primary keys are clustered indexes and cause frequent merging and splitting of B+ tree nodes.

Clustered index: a unique key. The logical order of the table is exactly the same as the physical storage order of the B+ tree. The leaf nodes stored in the B+ tree are the actual row records, that is, the data is the index, and the index is the data.

Secondary index: non-unique key. The logical order of the table is different from the physical storage order of the B+ tree. The leaf nodes in the B+ tree are the index field value + clustered index field value.

Note: the secondary index leaf node stores the “index field value + clustered index field value”, not the address of the row record, because the address is subject to change.

Single index: This index covers only one field. Composite index/composite index: This index only covers more than one field, follows the leftmost matching principle, does not allow skipping.

Indexes are divided into single indexes and compound indexes according to the number of fields. Indexes are divided into common indexes, unique indexes, and primary key indexes according to different properties. These two methods are used to represent specific actual indexes, such as actual B+ trees. According to the storage of leaf nodes, it is divided into clustered index and secondary index, which does not exist. An actual index is used as the clustered index or secondary index of the table.

Clustered index and secondary index

2.1 Cluster index

Records that are logically adjacent are not physically adjacent on disk because of indexes.

A database index is a sorted data structure in a database management system to help query and update data in a database table quickly. Indexes are usually implemented using B trees and their variant B+ trees.

In addition to the data, the database system maintains data structures that satisfy specific lookup algorithms, and that reference (point to) the data in a way that makes it possible to implement advanced lookup algorithms on these data structures. This data structure is called an index.



The figure above shows one possible way to index. On the left is the data table, with two columns and seven records, and on the far left is the physical address of the data records (Goldfinger: Note that logically adjacent records are not physically adjacent on disk either). To speed up Col2 lookup, a binary lookup tree can be maintained as shown on the right. Each node contains the index key value and a pointer to the physical address of the corresponding data record, so that binary lookup can be used to obtain the corresponding data within O(log2n) complexity.

If the logical order in the table is consistent with the physical storage order of the B+ tree, then the clustered index (unique key) is consistent, and the secondary index (non-unique key) is inconsistent. Try it and execute the construction sentence:

CREATE TABLE `student` (
  `id` BIGINT UNSIGNED AUTO_INCREMENT NOT NULL COMMENT 'primary key id',
  `student_no` VARCHAR(64) COMMENT 'student id',
  `name` VARCHAR(64) COMMENT 'Student name',
  `age` INT COMMENT 'Student age'ENGINE=InnoDB CHARSET= UTf8MB4 COMMENT='Student Information Form';
Copy the code

Insert 5 pieces of data:

insert into student(student_no,name,age) values(101,"Alice", 18); insert into student(student_no,name,age) values(102,"Bob"19); insert into student(student_no,name,age) values(104,"Brandt", 15); insert into student(student_no,name,age) values(105,"David"19); insert into student(student_no,name,age) values(109,"David", 18);Copy the code

MySQL maintains a B+ tree with the primary key you specify, in this case increasing primary key, and I used BPlusTree Visualization at THE University of San Francisco to simulate what the tree would look like, increasing primary key from 1 and inserting five, so from 1 to 5:



Note: MySQL creates the index before you execute the CREATE INDEX statement.

B+ tree in the process of insertion is how to maintain its several features: 1, B+ tree insertion guarantee order (order guarantee range search) : the left node is smaller than the right; 2, B+ tree insertion to ensure self-balance (one of the four measures to reduce tree height) : the number of left and right sides tends to be equal; 3. Ensure node splitting during B+ tree insertion: how the node splits into two when the number of elements exceeds the node capacity is also the principle of MySQL page splitting.

The leaf node of a B+ tree has all the data of the row, as shown in figure:



For B+ trees, non-leaf nodes play a retrieval role and leaf nodes hold table data.

In the retrieval of non-leaf nodes, three characteristics are guaranteed:

The left node is smaller than the right node.

The left and right sides tend to be equal,

When the number of elements in a node exceeds the capacity of the node, it is split to give you a node, because a node is the size of one, so this is also the principle of MySQL page splitting

In the figure above, the first-layer non-leaf node is 3, indicating that the left side is less than 3 and the right side is greater than or equal to 3.

The first non-leaf node in the second layer is 2, indicating that the left side is less than 2 and the right side is greater than or equal to 2.

The second non-leaf node in the second layer is 4, indicating that the left side is less than 4 and the right side is greater than or equal to 4

If you don’t have a B+ tree, you have to look up by primary key, for example

select * from student where id = 5;
Copy the code

Sorry, the data is out of order, you can only scan the full table, like waves washing sand.

Question: Aren’t primary keys increasing, so we can use dichotomy to find them? Answer: Not, although the primary key is increasing, but if you are written to disk, there is no such a data structure to maintain orderly array (for example, if you delete the 4, how to move the 5 to the front), the data in the disk is unordered, still looks can only random search, and if you maintain the orderly array data structure, index is built, It’s just indexing different data structures. 1. The B+ tree is the index, the index data structure. For the B+ tree, the non-leaf node plays the retrieval function, and the leaf node stores the table data. Of retrieval of leaf nodes, guarantee the left node number is smaller than the right, left and right sides tend to be equal, node, in case of not exceeding the number of element nodes, is how to split into two, this is also a principle of split MySQL page above, the first layer of the leaf node is 3, said on the left is less than 3, the right side is greater than or equal to 3; The first non-leaf node in the second layer is 2, indicating that the left side is less than 2 and the right side is greater than or equal to 2. The second non-leaf node of the second layer is 4, which means that the left side is less than 4, and the right side is greater than or equal to 4. Why does indexing speed up retrieval? 3, create table create table create table insert into data, the B+ tree is already out, the B+ tree index is already ready. Why did MySQL create an index before you executed the CREATE Index statement? Clustered indexes, for primary/unique keys, are created when a new table is created and data is inserted, but others, such as secondary indexes, for non-unique keys, must not be created until create Index is used

Now that we have this B+ tree, the data is stored regularly, looking for id=5, and instead of being a mess, we have a lot of rules:

1, top to bottom, find 3,5 is bigger than that, find the right node

2, then find 4, 5 is still larger, continue to find the right node

3, this time to reach the leaf node, the leaf node is an increment array, then use dichotomy, find the data id=5

Summary: B+ tree search + dichotomy to search an incrementing array, first use B+ tree search up to the leaf node (must go to the leaf node, because the specific table data must be stored in the leaf node, non-leaf nodes are only used for search), and then the leaf node is an incrementing array, then use dichotomy. So, the number of times you need to access the disk is determined by the number of layers in the tree, and the number of times you need to access the disk per specific retrieval is the number of layers of the leaf node.

Select primary key (primary key, primary key, primary key, primary key, primary key, primary key, primary key); Select * from table where primary key (s) is not specified; select * from table where unique key (s) is not specified; Mysql > select * from rowid; select * from rowiD; select * from ROwiD; select * from ROwiD; select * from ROwiD; Therefore, there is only one clustered index in any table in mysql (there must be one clustered index in any table in mysql, and only one clustered index in a table).

Clustered index limitations: Clustered indexes can only speed up primary/unique key retrieval, but not for non-unique keys, such as names, which are secondary index matters.

Question: why is there only one clustered index allowed in MySQL? Answer: Clustered index structure: A clustered index is a B+ tree, indexed by non-leaf nodes. Leaf nodes hold indexes and data, so a clustered index is a combination of indexes and data (this is the structure of clustered indexes). Clustered index has two functions: 1. Quick retrieval of non-leaf nodes determines the order of data; 2. 2. Data rows stored in leaf nodes. First, from the perspective of index, non-leaf nodes quickly retrieve data and determine the order of data: The cluster index represents the actual storage order of a table, there must be only one (that is, the class to set the queue, can not be sorted according to the child height and test scores, always choose the same, primary key and unique key can do clustered index). (2) If you copy the table and store it again in a different order (i.e., another cluster index), it is a different table. Second, in terms of rows, leaves hold rows: if there are two or more clustered indexes, rows are stored repeatedly, wasting disk space and not necessarily. In practice, primary keys of clustered indexes are independent of services to facilitate sequential reads and writes and reduce THE I/O overhead of random reads and writes.

2.2 Secondary Index

Clustered indexes can only help you speed up primary key queries, but what if you want to query by name?

Sorry, if you look at the tree above, the data is not organized by name, so you still have to do a full table scan.

Do not want full table scan, how to do? Add an index to the name field so that the data is organized by name:

create index idx_name on student(name); Create index (idx_name); // Create index (idx_name); // Create index (idx_name);Copy the code

MySQL will create a new B+ tree (not the same as the one where the index is clustered) :

You will find that the leaf node of the tree contains only the name and primary key ID fields, and there is no complete data for the row.

select * from student where name = "David";
Copy the code

MySQL > select * from ‘select *’; MySQL > select * from ‘select *’; select * from ‘select *’;

MySQL created a B+ tree for you in the first place. If you put these two trees next to each other, you can find the two primary key ids from the tree in the cluster index.

In summary, such index without complete information of row data is called secondary index, also known as secondary index.

Key: a clustered index finds the complete row information based on the unique key/primary key, and a secondary index finds the unique key/primary key based on the non-unique key. 2. Clustered index leaf nodes store complete row information, and secondary index leaf nodes store incomplete row information, that is, unique key/primary key. 3, clustered index non-leaf nodes store primary key/unique key as retrieval, comparison rule is int comparison; Secondary index non-leaf nodes store non-unique keys for retrieval, and the comparison rule is alphabetical, generally a string type. Select * from where id= XXX; Where name = “David”; First, find the unique key/primary key based on the non-unique key, then find the full row information based on the unique key/primary key.

Back to the table 2.3

2.3.1 back to the table

Select * from table where name = ‘CSDN’; select * from table where name = ‘CSDN’; Select name from CSDN where id = 2; select name from CSDN where id = 2; Second, go to the primary key index and find the value corresponding to ID 2. The second process, the search back to the primary key index tree, is back to the table. However, there is a way to avoid back tables, and that is to overwrite indexes.

2.3.2 Solution 1: Overwrite the index

Back table is a secondary index problem that finds the primary key first and then the record row information. Overwriting index is to solve the back table, overwriting index can reduce the search times of the tree, improve the performance, he is also in the actual development process we often used to optimize the efficiency of the query. Many syndication indexes are created to support overwriting indexes, which can greatly improve efficiency for specific businesses.

Solution:

Select id from student where idx_name =? , do not return to the table; Select idx_name_age from student where id,age from student where idx_name =? , will not return to the table.

Look at the graph above for secondary and composite indexes. Mysql may select an index from a table that is not indexed back to explain. Mysql may select an index that is not indexed back to explain.

2.3.3 Table Back Solution 2: Index push down

Index push-down is a built-in optimization in mysql to reduce the number of times tables are returned.

Suppose there are federated indexes (a,b,c). There are the following SQL statements:

Select * from itemCenter where a = ‘aa%’ and b=22 and c = 20;

According to the left matching rule, only “A” can be used to retrieve data, BC two fields cannot be used. Before MySQL 5.6, you could only go back to the primary key index to find rows, and then compare b and C values. The index condition pushdown feature introduced in MySQL 5.6 can be used to determine the fields contained in the index during index traversal and filter out the records that do not meet the conditions to reduce the number of table returns.

Compound index and prefix matching principle

3.1 Composite Index

Need: Query by name and age simultaneously?

select * from student where name = "David" and age = 18;
Copy the code

Again, the data is organized by name, but not by age, so we want to index both name and age:

create index idx_name_age on student(name,age); Alter table student create index idx_name_age; alter table student create index idx_name_age; create index idx_name_age; It's still a B+ tree.Copy the code

Create Index is used for all non-clustered indexes, secondary indexes, and composite indexes (not just on a single field).

Goldfinger: The link between clustered indexes and secondary indexes? It’s called a compound index because it’s on two fields at the same time, so it’s actually a two-field secondary index, and it’s still a secondary index, so it’s still going to use create Index, and it’s going to be a B+ tree.

MySQL > select * from B+ tree; select * from B+ tree;



Goldfinger: The link between clustered indexes and secondary indexes?

As mentioned above, a composite index is essentially a two-field secondary index, or secondary index, so it’s still going to find the unique key/primary key based on the non-unique key, and then find the full row based on the unique key/primary key.

Notice the two nodes that I’ve highlighted with the red dotted line, that’s the only difference between this tree and the one that just indexed name, because when you sort, you compare them by name, and you compare them by age if they’re the same name.

Again, this is a very small example, so you can imagine 10,000 students named David, randomly aged between 13 and 20, and if you didn’t store them regularly by age, you’d still have to scan 10,000 rows of data.

3.2 Prefix Matching

Question 1: What is the relationship between leftmost match and prefix match? Answer 1: Leftmost matching is a matching principle that exists at the bottom of the database and cannot be changed. Prefix matching is a matching method for long string fields, using the principle of leftmost matching.

Question 2: Why is left-most matching applicable to secondary and composite indexes, and prefix matching applicable to secondary and composite indexes? Answer 2: Left-most matching is an existing matching principle at the bottom of the database, so it is naturally suitable for secondary indexes and composite indexes, while prefix matching is a tuning method that makes use of the left-most matching principle, so it is also naturally suitable for secondary indexes and composite indexes.

3.2.1 Underlying Principle: leftmost matching

Leftmost matching principle:

Applicable scope: secondary index, including single column index and joint index two kinds (cluster index directly use primary key/unique key, do not need the most left match). Indexes can be as simple as a single column (a) or as complex as multiple columns (A, B, C, D). The left-most matching principle applies to both single-column indexes and multi-column indexes.

In the case of a federated index, the key also consists of multiple columns. At the same time, the index can only be used to check whether the key exists (equal). In the case of a range query (, <, between, like left match), the index cannot be further matched. Therefore, in a federated index, the order of columns determines the number of columns that can be hit.

Example: If there is index (a,b, C,d), query condition A =1 and b=2 and C <3 and d=4, then a, B, and C will be matched in each node in sequence, and D cannot be matched. (c is already a range query, d is certainly not sorted)

Question: Why is the order of the columns in a composite index important? Answer: one sentence explanation, index of the most left prefix matching principle.

Application of left-most matching in composite index

Alter table student select idx_name_age; alter table student select idx_name_age;

select * from student where name = "David"; Select * from student where idx_name_agewhereage = 18; Do not use idX_NAMe_age composite index, use full table scanCopy the code

3.2.2 Prefix Index (Applicable to Secondary Index + Compound Index)

1. Secondary index long string: Because of the left-most matching principle, long string fields need both good retrieval efficiency and small storage space. The balance between the two is index selectivity, so the prefix index is produced. Prefix index determines based on index selectivity that the prefix index should take the first few characters.

Prefix index advantages: the index occupies a small space and fast effect; Disadvantages: Cannot use prefix index to do ORDER BY and GROUP BY; Unable to use prefix index for overlay scan Prefix indexes may also increase the number of scanned rows.

2, the secondary index long string prefix is the same: in the long string field, the prefix is the same how to do (actual: ID number, student number, work number, mailbox, etc.), the first three, the database in reverse order storage of these prefix the same number, mailbox, remove the WWW began to search;

3, the joint index in the joint index which index to put in front, index selectivity determines, the fundamental basis is still the left-most matching principle (the joint index represents not only a field, is the extension of the secondary index)

3.2.2.1 Determine the prefix index by index selectivity. The prefix index should take the first few characters

We have said before, such as in a long string of fields (e.g., url), we can use a pseudo hash index form to create an index, in order to avoid the index became both large and slow, besides it can also use the prefix index (strings of characters) to achieve our goal, in the form of the prefix index should be how to choose? This involves a concept called index selectivity

Index selectivity definition: The ratio of non-repeating index values (also called cardinality) to the total number of records ina table. The higher the ratio, the more selective the index is. Unique indexes are the most selective, and the ratio is 1.

You can evaluate index design by viewing the cardinality value of each index by SHOW INDEXES FROM Table.

To select this ratio, we can take the first 3,4,5,6,7 prefix index, and then compare the selection of these prefix index, execute the following statement

SELECT 
 COUNT(DISTINCT LEFT(city,3))/COUNT(*) as sel3,
 COUNT(DISTINCT LEFT(city,4))/COUNT(*) as sel4,
 COUNT(DISTINCT LEFT(city,5))/COUNT(*) as sel5,
 COUNT(DISTINCT LEFT(city,6))/COUNT(*) as sel6,
 COUNT(DISTINCT LEFT(city,7))/COUNT(*) as sel7
FROM city_demo
Copy the code

The result is as follows

Sel3 sel4 sel5 sel6 sel7 0.0239 0.0293 0.0305 0.0309 0.0310Copy the code

It can be seen that when the prefix length is 7, the ratio of selective index promotion is very small, that is to say, the first six characters of city should be selected as the prefix index, as follows

ALTER TABLE city_demo ADD KEY(city(6)) ALTER TABLE city_demo ADD KEY(city(6))Copy the code

Our current is based on the average selectivity of index, sometimes it is not enough, also have to consider the worst case of selective, in this demo, for example, might some see choose 4, 5 prefix index and choice of the 6, 7 selectivity were similar, you’ll have to look at the choice of 4, 5 prefix index distribution is uniform

SELECT COUNT(*) AS  cnt, LEFT(city, 4) AS pref FROM city_demo 
  GROUP BY pref 
  ORDER BY cnt DESC LIMIT 5
Copy the code

The following results can occur

cnt	pref
305	Sant
200	Toul
90	Chic
20	Chan
Copy the code

It can be seen that the distribution is very uneven, and the number of indexes with Sant and Toul as prefixes is very large. The selectivity of these two indexes is not very ideal, so the worst selectivity should also be considered when selecting prefix indexes.

3.2.2.2 Prefix Indexes May increase the number of scanned rows

Prefix index advantages: the index occupies a small space and fast effect; Disadvantages: Cannot use prefix index to do ORDER BY and GROUP BY; Unable to use prefix index for overlay scan Prefix indexes may also increase the number of scanned rows.

Prefix indexes may also increase the number of rows scanned, explained, assuming the following table data and SQL to be executed

id	email
1	[email protected]
2	[email protected]
3	[email protected]
4	[email protected]
Copy the code
SELECT id,email FROM user WHERE email='[email protected]';
Copy the code

If we set the index of the whole field for email, then query the next record of the record after finding the relevant record according to “[email protected]” in the above table. If no record is found, the scan is stopped. At this time, only one record is scanned. If we used the previous six characters (email(6) zhangs as a prefix index, we would obviously have to scan four rows and then have to go back to the primary key index to determine the value of the email field. So using a prefix index to evaluate this overhead, the prefix index might increase the number of scanned rows.

3.2.2.3 There are two processing methods for the same prefix

Question: index design, advanced use of prefix index: how to do with the same prefix, such as ID number, student number, work number, mailbox, website, etc.?

Answer: ID number, student number, work number: prefix the same suffix is different, the database in reverse order storage of these prefix the same number, mailbox, website: prefix the same suffix is the same, the database directly does not store these prefixes (WWW, etc.).

For example, now we establish a population information table for the citizens of a city, although the id card of the population of this city is different, but the number of numbers in front of the ID card are the same, this situation how to establish prefix index.

The ID card in the database can be stored in reverse order, and can be queried as follows:

SELECT field_list FROM t WHERE id_card = reverse('input_id_card_string');
Copy the code

Mysql stored in reverse order id number to t id_card fields in a table, so that the input value input_id_card_string need to reverse, and then compared with id_card field in a database table, so that you can use the id card after six prefixes index.

3.2.2.4 Set the order of the union index by index

In a union index, which index comes first is determined by index selectivity, based on the left-most matching principle (a union index is not just a field, but an extension of a secondary index)

Index selectivity also applies to the design of a federated index. Unless there are special circumstances, we generally recommend that the most selective column be first when creating a federated index. For example, for the following statement:

SELECT * FROM payment WHERE staff_id = xxx AND customer_id = xxx;
Copy the code

For this statement alone, which of the two joint indexes (staff_id, customer_id) and (customer_id, staff_id) should we create?

SELECT 
 COUNT(DISTINCT staff_id)/COUNT(*) as staff_id_selectivity,
 COUNT(DISTINCT customer_id)/COUNT(*) as customer_id_selectivity,
 COUNT(*)
FROM payment
Copy the code

The results are:

Staff_id_selectivity: 0.0001 CUSTOMer_id_SELECTIVITY: 0.0373 COUNT(*): 16049Copy the code

You can see that customer_id is more selective, so you should select Customer_id as the first column.

create index idx_customer_staff  on student(customer_id,staff_id);  
Copy the code

4. Index related knowledge

Question: If YOU are using ORM, do you still need to care about indexes? Answer: Yes, you still need to care about indexes, even if you use the object Mapping (ORM) tool. ORM tools are capable of producing logical, legitimate queries (most of the time), and it is difficult to generate queries suitable for indexing unless they are only producing very basic queries (such as queries based on primary keys only). ORM tools, no matter how complex, are often overshadowed by sophisticated and complex indexes. Many times, even query optimization technologists, let alone ORM, have a hard time covering all the scenarios.

Question: In MySQL, indexing is implemented at the storage engine layer rather than the service layer (MySQL architecture is divided into two layers: MySQL service layer and storage engine layer)? Answer: First, because indexes depend on the storage engine layer, different storage engines provide different indexing standards, and there is no universal indexing standard: indexes in different storage engines work differently, and not all storage engines support all types of indexes. Second, even if multiple storage engines support the same type of index, their underlying data structures may be implemented differently. There are three common indexes: B+ tree index, hash index and full-text index. Even with B+ tree index, different storage engines implement it differently: (1) MyISAM uses prefix compression technology to make indexes smaller, while InnoDB stores them in raw data format; (2) MyISAM index refers to indexed rows by physical location of data, while InnoDB refers to indexed rows by primary key.

Question: How do I see all indexes defined for the table? Answer: Indexes are defined for tables by: SHOW INDEX FROM

Question: How many columns can be used to create an index? Answer: Any standard table can create up to 16 index columns.

Hash for long string fields?

Because there is a disk occupancy problem, the longer the index is selected, the more disk space is occupied, the fewer index values can be placed on the same data page, and the less efficient the search will be.

The solution is hash, which stores the hash field as another field, uses the hash field as the index, and verifies the hash every time. The hash index is not large

Five, the end

Play code every day, progress every day!