Abstract: In the cloud service environment, how to solve the performance problem of customer index creation based on a large amount of data has become a challenge for cloud service vendors. Huawei cloud GaussDB(for MySQL) uses the parallel index creation technology to solve performance bottlenecks such as batch index creation and temporary index addition, helping users establish indexes more quickly. For more tips on how to quickly create an index, don’t miss this article.

This document is shared with Huawei cloud community how to Quickly Create an index on GaussDB(for MySQL) by Bin Su, a senior architect of Huawei Cloud Database.

Su Bin, senior architect of Huawei Cloud database, has 16 years of database kernel development experience. Previously, as the main r&d staff of MySQL official InnoDB team, he participated in and led the development and release of many important features. Currently, I am responsible for and participate in the design and development of kernel functions of RDS for MySQL and GaussDB(for MySQL), huawei cloud RDS products.

MySQL > update MySQL

As we all know, databases use indexing techniques to speed up the query of data. MySQL database also supports several kinds of index structure to improve the performance of the query (see the MySQL documentation: dev.mysql.com/doc/refman/…). , the most widely used index is B+ Tree index, because B+ Tree index has a good balance between query and modify performance, and its storage and maintenance costs are relatively good.

The MySQL table itself is represented by the clustered index (B+tree index), plus several secondary indexes, including B+tree index, together to form a MySQL independent table, can be said to be composed of a group of indexes. We all know that indexes are a double-edged sword. Sufficient indexes can better improve the performance of queries that can be adapted, but they need to be maintained to synchronize with the data, so more indexes can bring higher overhead during data modification operations. The tradeoff between index creation and not creation is usually dynamic. Users may not know which indexes need to be created at the beginning of the table definition, and they need to adjust the indexes as the business changes, which also brings some problems of dynamic index creation.

MySQL index creation logic

Let’s first look at the MySQL index creation logic. First, MySQL index creation can be implemented using two different DDL (Data Definition Language) algorithms. The first is the COPY algorithm, which is very inefficient, is between the two tables for data COPY, to complete the table structure related modification, especially it requires table lock, now basically not used. The second, INPLACE, does not require locks, so many DDL operations do not block DML (Data ManipulationLanguage) operations, such as creating indexes. The specific implementation of the algorithm is completed at the storage engine level, which can be more optimized. There is actually an INSTANT algorithm for DDL statements, but it does not support index creation and is not covered here.

For the INPLACE algorithm, prior to version 5.7, index records were continuously inserted into the established empty index. This approach leads to significant performance problems and potential space waste due to the disorderly nature of the inserted data. After version 5.7, MySQL optimized the index building step to a bottom-up batch insertion and compact assembly of sorted index records. If there are multiple indexes to be created, the same algorithm will be performed for each index individually. The new algorithm goes through the main steps of reading data, sorting data, and creating indexes.

In general, DDL operations such as index creation take longer than common DML operations. As a result, users need to wait a long time before dynamically adding indexes to accelerate query, which greatly affects services. In addition, the user’s MySQL instance has enabled Binlog replication. Time-consuming DDL operations may cause the standby database to lag behind for a long time.

MySQL index creation flowchart

Index creation in cloud scenarios

As more and more users host their data in the cloud and the volume of user data continues to grow, the aforementioned dynamic indexing issues can be very disruptive to the user experience. At the same time, the single table data of customers gradually reaches several TB or even dozens of TB, and customers increasingly complain about the performance problems caused by the slow index creation. Especially, if the index creation cycle is too long, it may be difficult to find a suitable period of low business peak to dynamically create indexes and avoid business fluctuations. Therefore, in the cloud service environment, how to solve the performance problem of customer index creation based on a large amount of data becomes a challenge for cloud service vendors.

In the cloud scenario, there is another major scenario that is important to the customer experience. We know that customers need to migrate their services to the cloud on a large scale (Huawei cloud provides DRS tool for data replication service to support various data migration scenarios). The most efficient data migration methods are as follows:

1) Logically export the source data

SQL > alter table select * from user where id = 1;

3) Insert the exported data from the source end to the target end

4) Create secondary indexes for the tables at the target end

If dynamic data synchronization is involved, the steps are a little more complicated and will not be expanded here because they are irrelevant to this topic. In the previous steps, pay attention to step 2 and step 4. Do not create secondary indexes when creating tables on the target end. This optimization can have a significant impact on performance, especially if a table has many secondary indexes. We know that if the insertion of Btree index is ordered, it is the best for the insertion performance and the space utilization of the result, because the splitting of Btree index will occur at the end of the insertion region, and the page filling rate generated by splitting will be relatively high due to the optimization of the splitting algorithm. On the contrary, if the insertion is random, especially if it is concurrent, it is easy to cause the Btree index to be split in different nodes, and the page fill rate after splitting is half full, resulting in the Btree expansion.

With this background, we can easily understand the above problem. When inserting table data, we shield the secondary index. When all data is ready, we create the secondary index by using batch index creation method, which is the most efficient for secondary index creation. If you don’t do this, every time you insert a record, you insert the corresponding secondary index, which is an unordered random insert, and the performance will be much worse concurrently.

Creating secondary indexes in batches after data synchronization is ready is an effective solution. However, creating secondary indexes in batches takes a long time if a large amount of data is generated. As a result, customers need to wait for a long time after data migration to start services. While concurrent index creation at the table level can be considered, this approach has significant disadvantages: limited application scenarios require multiple tables; And that table to table concurrency is not the most efficient form of concurrency, because it affects each other a lot.

How does GaussDB(for MySQL) quickly create an index?

To sum up, there are two performance bottlenecks at the index creation point: one is the batch index creation after users migrate data; The second is that the user temporarily needs to add a secondary index. No matter which point, we need to establish a good index faster, improve the user experience.

Huawei cloud GaussDB(for MySQL) introduces the parallel index creation technology, which improves the index creation efficiency of MySQL community edition by using only a single thread. In addition, the two pain points are solved together. The index creation logic of community edition mentioned above is single thread, which has the problem of insufficient resource utilization. Secondly, the process of index creation is the process of alternating CPU and I/O costs. When performing an operation, even if it is not a resource competition operation, it only has to wait. Multi-thread index creation can make full use of CPU and I/O resources, while some threads do CPU calculations, other threads can concurrently do I/O operations.

The parallel index creation used by GaussDB(for MySQL) is a full-link parallel technique. As mentioned earlier, creating an index involves several phases, and our parallel creation algorithm does parallel processing for each of these phases, from reading data, sorting, to creating an index, all in parallel, and each step is handled concurrently by the specified N threads. Its logic is shown below:

In particular, GaussDB(for MySQL) makes a variety of optimizations for data merge sort, so that our normal merge sort can be fully parallel and make full use of CPU, memory, and IO resources. The merge step after parallel index creation also uses a simplified set of algorithms to correctly handle scenarios with various index structures.

Supported indexes and scenarios

The parallel index creation function of GaussDB(for MySQL) supports Btree secondary indexes. Virtual Index secondary indexes will be fully supported in the near future, while MySQL spatial Index and fulltext index are not covered by this parallel index creation.

Note that primary key index creation is not supported in parallel, so if a SQL statement contains primary key index creation, or the spatial index and fulltext index mentioned above, then the client will receive an alarm indicating that this operation does not support parallel index creation. At the same time, the statement is executed in a single-thread index creation mode.

From the perspective of SQL statements, as mentioned earlier, different algorithms can be used to create indexes, and since the COPY ALGORITHM (ALGORITHM=COPY) does not use bulk inserts, it does not benefit from this parallel index creation optimization. The INPLACE algorithm can benefit from this optimization if the index is created in a non-rebuild mode. Once the rebuild index is created, the parallel index creation algorithm cannot be used because it involves the establishment of the primary key index.

The sample

Let’s take a look at a few examples of how we can use the parallel index creation algorithm to speed up the creation and how our conditional constraints work.

1. We use sysbench tables with 100 million entries

2. It takes 146.82 seconds to build an index in the K field of the table using the default single-thread in the community

Innodb_rds_parallel_index_creation_threads = 4 To enable index building for 4 threads, you can see that index building takes 38.72 seconds, which is 3.79 times faster.

4. If you want to modify the primary key index, you will receive a warning that the index can only be built using a single thread

Matters needing attention

The innodb_rDS_PARALLEL_INDEx_CREation_threads parameter controls the total number of threads available for all parallel DDLS in the system. The value ranges from 1 to 128. The value of this parameter is 1, indicating that the original single thread is used to create the index. The value of this parameter is N, indicating that the subsequent DDL is created using N threads. If one DDL is executing with 100 threads, the other DDL should also use the parallel DDL and only use the remaining 28 threads. If all 128 threads are occupied by parallel DDL statements, the new DDL can only follow the logic created by the original single thread.

Although this parallel index creation speeds up index creation, careful evaluation is required in specific usage scenarios. We know that after the parallel algorithm is applied, the DDL uses as much hardware resources as possible, which means that other operations don’t get as many resources. Therefore, specific analysis is required for different scenarios, which determines how we create indexes.

For the migration scenario, since there is no service access at this time, the user wants to complete the creation of all indexes as soon as possible, so we can set as many threads as possible. For example, if we are a 16-core instance, we can set the number of parallel threads to 16 to speed up the completion of the operation.

If the index is to be created during the user business run, we still don’t want DDL operations to have too much impact on the ongoing business such as DML operations. Therefore, index creation at this time can specify a relatively small number of threads, such as 2-4 (or, depending on CPU specifications and load, multiple DDL operations concurrently are discouraged). This can not only speed up the process of index creation, but also ensure the normal progress of DML.

In summary, GaussDB(for MySQL) supports parallel index creation. By shortening the index creation time, it solves two types of customer concerns and improves customer experience. However, the technology is endless, and there are other issues that need to be optimized in the area of index creation, such as how to reduce the impact of index creation steps on IO and so on. We will optimize these points in the future to bring customers more surprises.

At present, huawei cloud GaussDB (for MySQL) parallel create index optimization function has been online, welcome everybody to huawei cloud website experience: www.huaweicloud.com/product/gau…

Huawei cloud GaussDB(for MySQL) kernel expert series of articles

How powerful is the vertical integration of GaussDB(for MySQL) cloud stack?

Huawei cloud database kernel expert reveals: How fast is GaussDB(for MySQL) parallel query?

Click to follow, the first time to learn about Huawei cloud fresh technology ~