When the CPU load of Mysql is too high, we will look at the slow SQL log during the failure, and then find out the full table scan, index is not reasonable, and the SQL function calculation is too much, let the development students optimize. If not, upgrade the CPU hardware, replace the CPU with a higher frequency, upgrade the CPU 1 to 2, upgrade the CPU 2 to 4.

The problem this time is related to daily business processing, so many measures cannot be put in place at the first time, such as hardware procurement, so we can only consider the database side first, locate the specific bottleneck, and see if there is room for parameter optimization.

It took a lot of detours, but we finally solved the problem. Let’s review the whole process:

One. Problems arise

Too many Threads Running alarms are generated in the main library of a database cluster during service peak hours in the evening, and applications frequently report slow response alarms. At this time, the processList list of non-sleep SQL has reached 1000+, the database can not process, the subsequent SQL all accumulated in it. One minute later, the SQL in the running state reached 2700+.

When the kill command is used to kill the SQL, the SYSTEM is in the KILLED state and the SQL process does not end.

The top command sees CPU pressure, but not full, iowait is not high.

Under free, we can see that there is 20G spare memory, and swap is almost useless.

Iostat indicates that disk reads and writes are not high and there is no pressure.

Ii. Problem handling

For the first time, we should think from three perspectives:

Slow SQL optimization

Took out the slow SQL logs during the failure and analyzed the TOP30 slow SQL logs. Some SQL does scan the entire table (there is not much data in the table), and many SQL have count, order by logic, which does cost CPU.

The SQL is not very slow to execute in isolation, at around 1,200 milliseconds, which is actually acceptable. Moreover, our development resources are relatively tight. If we want to optimize these business core SQL, it is impossible to finish in two or three weeks.

Because of this daily task, THE priority of SQL optimization is pushed back.

2. Reduce concurrency

Directly reducing the number of concurrent transactions seems like the simplest and most effective measure, and we did it by cutting the number of concurrent transactions in half. Maxactive is also halved in some application server JDBC configurations.

3. Modify database parameters

Change the innodb_flush_method parameter from the default fDATASync to O_DIRECT.

The goal is to flush write operations directly from InnoDB buffer to disk without passing them through the OS cache. This results in longer response times for insert,update, and delte operations, but reduces IO and CPU overhead, and reduces physical memory usage.

  • fdatasync

    By default, data is written to the OS buffer and flushed to disk by the operating system.

  • O_DIRECT

    Data files are written directly from the buffer pool to disk without being transferred through the OS buffer.

Problem recurrence

The preparation is complete, but when the next task runs, the database behaves the same way it did last time.

This time, however, pStack logs were typed before restarting the database.

4. Problem repositioning

1. System information

In order to further locate the problem, the SAR logs were analyzed this time:

A lot of context switching occurred during the failure:

The CPU response queue reached 100+ and the load average reached 100+ :

Runq-sz: Run queue length (number of tasks waiting for Run time).

At this point, you can be sure that the problem is at the CPU level.

2. Pstack log

Btr_search_info_update & btr_search_info_update_slow are found in the pStack logs:

The circled functions are used to build up the AHI.

Take a look at the btr_search_info_update call chain:

Btr_cur_search_to_nth_level – btr_search_info_update – btr_search_info_update_slow – btr_search_build_page_hash_index

In btr_search_info_update_slow, a decision is made based on the statistics, and btr_search_build_page_hash_index is called to add the current page record to the AHI hash table.

3. AHI adaptive hash index

Now that you’ve tracked down the AHI, let’s take a closer look at what AHI is.

The innoDB storage engine monitors queries to secondary index pages on tables. If it is observed that creating hash indexes improves performance, hash indexes are created in the buffer pool, called Adaptive Hash Indexes (AHI)

The AHI is constructed from the buffer pool’s B+ Tree, using the prefix of the index key to build the hash index, which can be of any length. Therefore, the build is fast and there is no need to build a hash index on the entire table. The InnoDB storage engine automatically hashes some of the hot pages based on the frequency and mode of access.

According to InnoDB’s documentation, with AHI enabled, read and write speeds are up to two times faster and secondary index join performance is up to five times faster.

Query statements using AHI have the following advantages:

  • It can directly locate the leaf node from the query condition to reduce the time needed for a location.
  • In the case of insufficient buffer pool, cache can be established only for hot data pages to avoid frequent LRU of data pages.

But AHI has its problems, too, with this caption:

With some workloads, the speedup from hash index lookups greatly outweighs the extra work to monitor index lookups and maintain the hash index structure. Access to the adaptive hash index can sometimes become a source of contention under heavy workloads, such as multiple concurrent joins.

In a multi-concurrent connection scenario, hash indexes are used more frequently than they are monitored and hash structures are maintained, resulting in resource competition and additional CPU consumption.

  • AHI monitoring: Includes times of AHI queries (adaptive_hash_searches) and times of using BREE queries (adaptive_hash_searches_btree)
  • AHI maintenance: AHI also maintains physical records when their location or block address changes. Btr_search_update_hash_ref () &btr_search_drop_page_hash_index () &buf_lru_drop_page_hash_FOR_TABLESPACE ()

4. View the 5.6.26 source code

We are currently using the 5.6.26 version of the database, take a look at the source of the version:

Rw_lock_s_lock (&btr_search_latch) where btr_search_latch is the global lock.

V. AHI problem solving

Mysql 5.6.26 uses the innodb_adaptive_hash_index parameter to switch AHI on and off.

The following parameter is ok. There are still two questions that need to be confirmed:

  1. Has the global lock been optimized?
  2. 5.6.26. Does this parameter work?

1. Optimization of AHI global locking

AHI’s use of global read/write locks causes resource contention, a bug that was fixed in 5.7.7: bugs.mysql.com/bug.ph…

Fix bug bugs.mysql.com/bug.ph… (innodb adaptive hash index mutex contention) by partitioning btr_search_latch into a latch array indexed by index id. Add system variable innodb_adaptive_hash_index_partitions to control the number of partitions.

The solution is to split the locks on AHI. It is designed in the form of component areas, each of which has an independent latch for protection. The partition size is set by the innodb_adaptive_hash_index_parts parameter. The default value is 8 and the maximum value is 512.

In this case, the high version has indeed been optimized.

2. Check whether the shutdown parameter is valid in version 5.6

We are currently using version 5.6.26. Will innodb_adaptive_hash_index be disabled but Btr_search_latch still exist?

bugs.mysql.com/bug.ph…

Many uses of btr_search_latch with innodb_adaptive_hash_index are OFF

This issue has been fixed, merge vers: 5.6.1-M4, so it is valid to turn innodb_adaptive_hash_index off under 5.6.26.

Look again at the source code to see the implementation of closing AHI:

Btr_search_disable frees buffer_pool memory used by AHI.

6. Summary

To summarize this problem, a large number of concurrent connections were encountered after AHI was enabled because AHI held a global lock, resulting in a large amount of lock spin in the database and consuming a large amount of CPU.

At present, the database has been closed AHI, the business has been running for two days, and everything is normal in the database. The database version will need to be upgraded later, and 5.6 is a bit behind.

When a database encounters a similar CPU problem, you can perform the following operations to help trace the root cause of the problem:

  • vmstat 1 1000
  • top -Hu mysql
  • perf top -a -g
  • show engine innodb status G
  • show processlist
  • Type PStack logs before the restart (Type pStack logs only before the restart. Do not type pStack logs at other times)

The full text.