What is the index?

Indexing is to organize data by some data structures and algorithms, and ultimately guide users to quickly retrieve the data they need. Working mode: using B + tree, linked list, dichotomy search, achieve fast location of target data, fast range search.

Indexes have two features:

  • Use data structures and algorithms to organize raw data efficiently
  • With these effective organizations, users can be guided to quick retrieval of raw data

The essence of indexing: It filters out the desired results by constantly narrowing the range of data to be retrieved, while turning random events into sequential events. In other words, with this indexing mechanism, we can always lock the data in the same lookup.

Index data structure: B+ tree

InnoDB data is read and written on a data page basis. That is, when it is time to read a record, instead of reading the record itself from disk, the entire record is loaded into memory on a page-by-page basis, where there may be many records, and the page is retrieved in memory. In InnoDB, the default page size is 16KB.

B + tree features:

  • Each node has at most m children
  • Each node except the root has at least [m/2] children, and the root has at least two children
  • Nodes with k children must have k keywords
  • The parent node holds Pointers to child nodes
  • The keyword of the parent node is either the minimum value or the maximum value of the child node. If the keyword of the node is in ascending order, the keyword of the parent node is the minimum value of the child node
  • The lowest node is the leaf node
  • Except for leaf nodes, the other nodes do not store data, only keywords and Pointers
  • The leaf node contains all the keywords and data of the data, and the leaf nodes are connected with a linked list, which can be very convenient to support range search
  • The middle node of a k subtree contains one K element (k-1 element in a B tree). Each element does not store data, and all data is stored on the leaf node.
  • All leaf nodes contain information of all elements and Pointers to records containing these elements, and the leaf nodes themselves are connected in order from small to large in the size of keywords.
  • All intermediate node elements exist at the same time with the child node, in the child node element is the maximum or minimum value.

Advantages of B+ trees:

  • A single node stores more elements, resulting in fewer I/OS for queries.
  • All queries look for leaf nodes, and the query performance is stable.
  • All leaf nodes form an ordered linked list for easy range query.

The index classification

There are clustered indexes and non-clustered indexes.

  1. Clustered index

Each table has and must have a clustered index. The data of the entire table is stored in the clustered index. Mysql indexes are stored in files using B+ tree structure. When no primary key is specified in the table, mysql automatically adds a hidden ROWID field (default 4 bytes) to each record as the primary key, and roWID is used to build the clustered index. Clustered indexes are also called primary key indexes in mysql.

  1. Non-clustered index (secondary index)

It is also a B + tree structure, but unlike clustered indexes, non-clustered index leaf nodes store the value of the field (index field) and the corresponding primary key of the record, while other nodes store only the value of the field (index field). Each table can have multiple non-clustered indexes.

Mysql non-clustered indexes are divided into:

  • Single-column index: That is, an index contains only one column.
  • Multi-column index (also called compound index) : An index contains multiple columns.
  • Unique index: The value of the indexed column must be unique, allowing a null value.

Index FAQs:

  • Return table: select primary key ID from where condition, use primary key ID to query current data.
  • Index coverage: The index tree used in the query contains the values of all the fields required by the query, and there is no need to aggregate the index to retrieve data. This is called index coverage.
  • Index Condition Pushdown(ICP) is a new feature in MySQL 5.6. It is an optimized way to filter data using indexes in the storage engine layer. ICP can reduce the number of times the storage engine accesses the base table and the number of times the MySQL server accesses the storage engine.

Advice:

  1. Creating an index on a highly differentiated field can effectively use the index. If the index is too differentiated, it cannot be used effectively and may need to scan all data pages, which is similar to not using the index

  2. Note the leftmost matching principle: Mysql > select * from left to right; mysql > select * from left to right; mysql > select * from left to right; mysql > select * from left to right; mysql > select * from left to right; mysql > select * from left to right; D (a,b,d,c); d (a, B,d,c); d (a, B,d)

  3. When querying records, use index coverage instead of * to reduce back operations and improve efficiency

  4. Some queries can use federated indexes, which can then be used for index push-down (IPC), as well as to reduce back to table operations and improve efficiency

  5. Do not use functions or operators on an index field. Otherwise, the index will become invalid

  6. String field comparisons with numbers invalidate indexes

  7. A fuzzy query ‘% value %’ invalidates the index and turns it into a full table scan, but ‘value %’ makes efficient use of the index

  8. Use index fields to reduce sorting and improve query efficiency

Additional knowledge:

In the case of mechanical hard disks, a few concepts.

  1. Sector: the smallest unit of disk storage. The sector size is 512 bytes.

  2. Disk block: the smallest unit of interaction between a file system and a disk (the smallest unit for a computer system to read and write disks). A disk block consists of several consecutive (2^n) sectors. The block size is generally 4KB. Data reading from disk: Data reading from disk depends on mechanical movement. The time spent for each data reading can be divided into three parts: seek time, rotation delay, and transmission time. The seek time refers to the time required for a magnetic arm to move to a specified track. Rotation delay is the disk speed we often hear, for example, a disk 7200 revolution, means that it can rotate 7200 times per minute, that is, one second can rotate 120 times, the rotation delay is 1/120/2 = 4.17ms; Transfer time refers to the time it takes to read or write data from or to the disk, usually in the order of a few tenths of a millisecond and negligible compared to the first two. So the time to access a disk, namely the time to access a disk IO, is about 5+4.17 = 9ms, which sounds good, but remember that a 500-MIps machine can execute 500 million instructions per second, because instructions rely on electrical properties, in other words, the time to execute an IO can execute 400,000 instructions. A database with hundreds of millions or even tens of millions of data, nine milliseconds at a time, is a disaster.

  3. A page is an internal data structure defined by mysql. The default value of a page is 16kb, which is equivalent to four disk blocks. This value can be changed if mysql reads 16kb of data from the disk at a time.

  4. In the process of data retrieval, we do not optimize the data storage mode, and directly store the records of the tables in the database on disk. If a table has only one field, which is of int type, and int occupies 4 bytes, each disk block can store 1000 records, and 1 million records need 1000 disk blocks. If we need to retrieve the required records from the 1 million records, we need to read the data of 1000 disk blocks (1000 IO times), and each IO takes 9ms, then 1000 times requires 9000ms=9s, and a random query of 100 data takes 9 seconds. This situation is unacceptable, obviously not acceptable.