preheating

More like a paragraph: without a cold biting, how fragrant plum blossom, learning is boring, please adhere to! I learned this article from Mr. Dinky. Do not understand their own search ha! It takes about 35 minutes to read this article!

Hi, everybody. Earlier we had an overview of transaction view isolation. Today, I’ll talk about index selectivity. Improve processing performance with index selection!

start

Suppose you are maintaining a citizen system where everyone has a unique ID number, and the business code already ensures that no two duplicate ID numbers are written. If the citizen system needs to look up names based on id numbers, it will execute SQL statements like this:

select name from CUser where id_card = 'xxxxxxxyyyyyyzzzzz';
Copy the code

Is_card is definitely one of the more commonly used fields at this point. I’m sure you’ll want to add an index to this field. This time must not add wrong. The small system is fine, but the big one is probably dead. Why do you say that?

Is_card is the ID field, it’s 18 bits, it’s a big one. So if you don’t think about it well, the impact is pretty big. First exclude primary key indexes and clustered indexes.

We set our sights on. Normal index, unique index. As long as the information of the id card on the list is not repeated, it is unique, that can be used. Now let’s examine these two index advantages and disadvantages in terms of performance

The query process

As shown in the figure below, ID is the primary key and the value of k is IS_card.

Figure 1 Index organization structure of InnoDB

SQL statement select id from T where k=5 The query looks up the tree by looking at the root node to see if k is equal to 5. And then it goes through the dichotomous legal position to 5,500.

  • For a normal index, the data is found5500After that, the next record needs to be searched until the first record that does not meet the condition k=5 is encountered.
  • For a unique index, the data is found5.500After that, it returns directly, stopping the current search.

The performance difference between the two cases is minimal

InnoDB data is read and written on a data page basis. That is, when a record is read, it is not read from disk itself, but read into memory as a whole on a page basis. In InnoDB, the default size of each data page is 16KB.

Because the engine reads and writes by page, when a record with k=5 is found, its data page is in memory. So, for a normal index, the extra “find and judge the next record” operation requires only one pointer search and one calculation.

The update process

To illustrate the impact of normal and unique indexes on update statement performance, we need to look at what a Change buffer is.

Change buffer: This is persistent data. In other words, the change buffer is copied in memory and is also written to disk.

When a data page needs to be updated, it is directly updated in memory. If the data page is not already in memory, InnoDB caches the update in the Change Buffer without affecting the consistency of the data, so that the data page does not need to be read from disk. The next time a query needs to access the data page, the data page is read into memory and the change Buffer operations related to the page are performed. In this way, the correctness of the data logic can be guaranteed.

Merge Restores the data in the Change buffer to the data page, a process called merge

In addition to accessing the data page that triggers the merge, the system has background threads that merge periodically. The merge operation is also performed during the database shutdown process.

Obviously, if update operations can be recorded in the change buffer first to reduce disk reads, the execution speed of statements will be significantly improved. In addition, data reading into memory needs to occupy the buffer pool, so this method can avoid occupying memory and improve memory utilization.

Change buffer is introduced

For unique indexes, all update operations must first determine whether the operation violates the uniqueness constraint. For example, to insert the record (4,400), we need to determine whether there is already a record k=4 in the table, which must be read into memory. If it is already read into memory, it is faster to update the memory directly, and there is no need to use change buffer.

Therefore, the change Buffer cannot be used for updates to unique indexes, and in fact only normal indexes can be used.

The change buffer uses memory in the buffer pool, so it cannot grow indefinitely. The size of change buffer can be set dynamically with innodb_change_buffer_max_size. If this parameter is set to 50, the change buffer can occupy at most 50% of the buffer pool.

The above is the general introduction of change Buffer. Now let’s talk about how innoDB works if we insert a new record into this table.

The first case is if the updated data page of the record is in memory

  • For unique indexes, find the position between 3 and 5, determine that there is no conflict, insert this value, the statement is finished;
  • For normal indexes, find the position between 3 and 5, insert the value, and the statement completes.

Thus, the difference between a normal index and a unique index in the performance of an update statement is just a judgment that costs a tiny amount of CPU time. That’s not what we’re focusing on.

The second case is if the data page that the record updates is not in memory

  • For unique indexes, the data page needs to be read into memory, and if there is no conflict, insert the value, and the statement is executed.
  • For normal indexes, updates are recorded in the Change Buffer, and statement execution ends.

Reading data from disk into memory involves random IO access and is one of the most expensive operations in a database. The change Buffer will significantly improve update performance because it reduces random disk access.

Change buffer scenario

Based on the above analysis, we are probably familiar with the acceleration effect of Change Buffer on the update process, and it is also clear that change buffer is only used in the scenario of common indexes, not unique indexes. Now, there is a question: can change buffer be used to speed up all scenarios with normal indexes?

Obviously not. In normal indexes, the advent of change Buffer greatly reduces the need to modify data to interact with disk IO. What if you read more and write less? That is not the number of updates is not less!

So the real benefit is, the more changes you make, the better. It is suitable for writing more than reading less tables

Add a precondition above. Before each merge, more is more

Note: For tables with too many reads and too few writes, it is recommended to turn change Buffer off. If you do not use change Buffer, you will need to add changes to the change buffer, so this business model can cause side effects.

Index selection and actual combat

Back to our question at the beginning of this article, how to choose between a normal index and a unique index. In fact, there is no difference between the two types of indexes in terms of query capability, but the main consideration is the impact on update performance. Therefore, I recommend that you choose the normal index as much as possible.

If all updates are immediately followed by a query for the record, you should turn off the change Buffer. In other cases, change Buffer can improve update performance.

In practice, you will find that the combination of normal indexes and change buffer is quite obvious for updating and optimizing tables with large data volumes.

In particular, the change Buffer mechanism is very effective when using mechanical hard disks. So, if you have a library of historical data and use a mechanical hard disk for cost reasons, you should pay special attention to the indexes in these tables, use normal indexes as much as possible, and then make the Change Buffer as large as possible to ensure that the historical table can write data as fast as possible.

Change buffer and redo log

End of the principle of change buffer, similar operations can be related to the place we do an extension.

Redo log insert request

mysql> insert into t(id,k) values(id1,k1),(id2,k2);
Copy the code

Here, we assume the current state of k index tree. After the location is found, the data page k1 is in memory (InnoDB Buffer pool), and the data page K2 is not in memory. The update status diagram with change buffer is shown in Figure 2.

Figure 2 Update process with change buffer

This update statement involves four parts: memory, redo log (ib_log_fileX), data tablespace (T.ibD), and system tablespace (IBDATA1).

The update statement does the following (in numerical order) :

  • Page 1 in memory, directly update memory;
  • When Page 2 is not in memory, record the message “I want to insert a line into Page 2” in the change buffer
  • Log the above two actions in the redo log (figure 3 and 4).

After the above operations, the SQL is executed. You’ll find that when you modify data, it’s very cheap to do, it’s all about memory, buffering. There is only one place where disk IO interacts because there is no buffer memory.

Meanwhile, the two dotted arrows in the figure are background operations and do not affect the response time of updates.

Redo log read requests

For example, we now want to execute select * from t where k in (k1, k2). Here, I have drawn a flowchart for the two read requests.

If the read occurred shortly after the update and the data is still in memory, these two reads are irrelevant to the system tablespace (IBDATA1) and redo log (ib_log_fileX). So, I didn’t draw those two parts in the picture.

Figure 3 Read process with change buffer

As can be seen from the figure:

When Page 1 is read, it returns directly from memory. Several students asked in the comments on the previous article whether WAL must then read data from a disk or update data from the redo log before returning it. You don’t have to. You can see the state in Figure 3 below. Although the data is still on disk, the result is returned directly from memory, and the result is correct.

To read Page 2, you need to read Page 2 from disk into memory, and then apply the operation logs in the Change Buffer to generate a correct version and return the result.

As you can see, the data Page is not read into memory until Page 2 needs to be read.

So, if you want to compare the benefits of these two mechanisms in terms of update performance, the redo log saves mainly I/O from random disk reads (switching to sequential writes), while the Change Buffer saves mainly I/O from random disk reads.

Conclusion (to promote personal public account)

About the general index, the only index in the actual user. Also included are redo log, WAL, change buffer, query, and update processes.

MySQL sometimes picks the wrong index