When building an index, should I select a common index or a unique index?

preface

The basic concepts of indexes are not discussed here, but you already know the difference between a unique index and a normal index. In different business scenarios, when creating an index for a table, should you choose a normal index or a unique index?

Suppose you now maintain a user information system where each person has a unique ID number, and the business code already ensures that two duplicate ID numbers are not written. If the user information system needs to look up names based on id numbers, it should execute SQL statements like this:

select f_name from t_user_nfo where f_id_card = 'xxxxxx';
Copy the code

Therefore, you should consider building indexes on the F_ID_CARD field. Because the id number field is large, it is not recommended to use the ID number as the primary key. In this case, there are two options:

Create a unique index for f_ID_CARD;
Either create a plain index;

The business code already guarantees that duplicate ID numbers will not be written, so both options are logically acceptable. But from a performance perspective, is it better to choose a unique index or a plain index? What are the basis of selection?

Before answering, take a look at the index chart below:

Next, we will analyze the performance impact of these two indexes on query and update statements.

The query process

Select id from T where k=5 The process of the query in the index tree starts from the root of the B+ tree and searches by layer to the leaf node, which is the data page in the lower right corner of the figure. Then it can be thought of as the data page is located by dichotomy.

For a normal index, after finding the first record that meets the condition (5,500), the next record is searched until the first record that does not meet the condition k=5 is encountered.
For unique indexes, since the index defines uniqueness, the search stops after the first record that meets the condition is found.

So how much of a performance difference does this different retrieval bring? The answer is: very little.

We know that InnoDB data is read and written on a data page basis. That is, when a record is read, it is not read from disk itself, but read into memory as a whole on a page basis.

In InnoDB, the default size of each data page is 16KB. Because the storage engine reads and writes on a per-page basis, when a record with k=5 is found, its data page is in memory. So, for a normal index, the extra “find and judge the next record” operation requires only one pointer search and one calculation.

Of course, if the record k=5 happens to be the last record on the data page, then to fetch the next record, the next data page must be read, which is slightly more complicated. However, for integer fields, a data page can contain nearly a thousand keys, so the probability of this happening is relatively low.

So, when we calculate the average performance difference, we can still consider this operating cost to be negligible for today’s cpus.

The update process

To illustrate the impact of normal and unique indexes on the performance of update statements, let’s start with the change Buffer.

When a data page needs to be updated, it is updated directly if the data page is in memory. If the data page is not already in memory, InooDB will cache the update operations in the Change Buffer without affecting data consistency, so that the data page does not need to be read from disk.

The next time a query needs to access the data page, the data page is read into memory and the change Buffer operations related to the page are performed. In this way, the correctness of the data logic can be guaranteed.

It should be noted that despite its name, the change Buffer is actually persistent data. In other words, the change buffer is copied in memory and also written to disk.

The process of applying the operations in the Change Buffer to the original data page to get the latest results is called merge. In addition to querying the data page that triggers the merge, the system has background threads that merge periodically.

The merge operation is also performed during the database shutdown process. Obviously, if update operations can be recorded in the change buffer first to reduce disk reads, the execution speed of statements will be significantly improved. In addition, data reading into memory needs to occupy the buffer pool, so this method can avoid occupying memory and improve the memory utilization.

So, when can you use change Buffer?

For unique indexes, all update operations must first determine whether the operation violates the uniqueness constraint.

For example, to insert (3,300), we need to determine if there is already a k=3 record in the table, which we must do by reading the data page into memory. If it is already read into memory, it is faster to update the memory directly, so there is no need to use change buffer.

Therefore, the change Buffer cannot be used for updates to unique indexes, and in fact only normal indexes can be used.

The change buffer uses memory in the buffer pool, so it cannot grow indefinitely. The size of change buffer can be dynamically set with innodb_change_buffer_max_size. If this parameter is set to 50, the change buffer can occupy at most 50% of the buffer pool.

Perform the update process

Now that you understand the mechanism of the change buffer, let’s take a look at InnoDB’s process for inserting a new record (3,300) into the table.

In the first case, the target page for which the record is to be updated is in memory. In this case, InnoDB’s processing flow is as follows:

For unique indexes, find the position of 3, determine that there is no conflict, insert the value, the statement is finished;
For normal indexes, find 3, insert the value, and the statement completes.

You can see that the difference in the impact of a normal index versus a unique index on update statement performance is just a judgment that costs a tiny amount of CPU time. It’s negligible.

In the second case, the target page for which the record is to be updated is not in memory. At this point, InnoDB’s processing flow is as follows:

For unique indexes, the data page needs to be read into memory, and if there is no conflict, insert the value, and the statement is executed.
For normal indexes, updates are recorded in the Change Buffer, and statement execution ends.

Reading data from disk into memory involves random IO access and is one of the most expensive operations in a database. Change Buffer significantly improves update performance because it reduces random disk access.

A former DBA colleague reported that the library memory hit ratio for one of his operations had suddenly dropped from 99% to 75%, and the whole system was blocked, along with most of the update statements. After exploring the reason, it was found that this business had a large number of data insertion operations, and he had changed a common index into a unique index the day before.

Usage scenarios of change Buffer

Through the above analysis, it has been clear that the use of Change buffer can accelerate the update process, and it is also clear that change buffer is only used for common index scenarios, not for unique indexes.

So, one question is: can change buffer be used to speed up all scenarios with normal indexes?

Because merge is the time when data is actually updated, the main purpose of change buffer is to cache recorded changes. Therefore, the more changes recorded by change buffer before a data page is merged, The more times the page has to be updated, the greater the advantage.

Therefore, for businesses that write too much and read too little, the probability of the page being accessed immediately after being written is relatively small. In this case, the use of change Buffer is better.

This business model is common in billing and logging systems. On the other hand, if a business update pattern is that a query is made immediately after a write is written, then even if the condition is met, the update is recorded in the Change Buffer first, but then the merge process is triggered immediately because the data page is to be accessed immediately. This will not reduce the number of random ACCESS to IO, but increase the maintenance cost of change buffer. So, for this business model, Change Buffer is not worth the cost.

Index selection

In fact, there is no difference between the two types of indexes in terms of query capability, but the main consideration is the impact on update performance. Therefore, I recommend that you choose the normal index as much as possible.

If all updates are immediately followed by a query for the record, you should turn off the change Buffer. In other cases, change Buffer can improve update performance.

In practice, you will find that the combination of normal indexes and change buffer is quite obvious for updating and optimizing tables with large data volumes.

In particular, the benefits of the change Buffer mechanism are significant when using mechanical hard disks. So, if you have a library of historical data and use a mechanical hard disk for cost reasons, you should pay special attention to the indexes in these tables, use normal indexes as much as possible, and then make the Change Buffer as large as possible to ensure that the historical table can write data as fast as possible.

Change buffer and redo log

If you understand the principle of change Buffer, you may think of redo logs.

Now we will execute the insert statement on the table:

mysql> insert into t(id,k) values(id1,k1),(id2,k2);
Copy the code

Here, we assume the current state of k index tree. After the location is found, the data page k1 is in memory (InnoDB Buffer pool), and the data page K2 is not in memory.

Analyzing the update statement, you can see that it has four parts: memory, redo log (ib_log_fileX), data tablespace (T.ibD), and system tablespace (IBDATA1).

The update statement does something like this:

Page 1 in memory, directly update memory;
When Page 2 is not in memory, record the message “I want to insert a line into Page 2” in the change buffer.
Log the above two actions in the redo log.

With this done, the transaction is complete. As you can see, this update is cheap: two memory writes, then one disk (two operations combined to write one disk), and the writes are sequential.

So what happens to read requests after that?

For example, we now want to execute select * from t where k in (k1, k2).

If the read occurred shortly after the update and the data is still in memory, these two reads are irrelevant to the system tablespace (IBDATA1) and redo log (ib_log_fileX).

When Page 1 is read, it returns directly from memory. Although the data is still on disk, the result is returned directly from memory, and the result is correct.
To read Page 2, you need to read Page 2 from disk into memory, and then apply the operation logs in the Change Buffer to generate a correct version and return the result.

As you can see, the data Page will not be read into memory until Page 2 needs to be read.

So, if you want to compare the benefits of these two mechanisms in terms of update performance, the redo log saves mainly I/O from random disk reads (switching to sequential writes), while the Change Buffer saves mainly I/O from random disk reads.

conclusion

Today, we mainly start from the selection of common index and unique index, share the data query and update process, then explain the mechanism of change buffer and application scenarios, and finally talk about the use scenarios of index selection.

Since unique indexes do not use the optimization mechanism of Change Buffer, I recommend that you prioritize non-unique indexes from a performance perspective if the business can accept them.