Abstract: How to locate the distributed cache database Redis big KEY problem, practical case with you to master the optimization method.

[background]

Error: OOM Command not allowed when used Memory > ‘maxMemory’ Indicates that some ECS applications cannot write data to the database. Services are affected. When set T2 s2 is executed, the database displays an error OOM, as shown below:

“Topology”

Environmental information:

Redis 5.0 Cluster 4 gb memory

DCS Network segment: 192.168.1.0/24

Fragment 1: Master 192.168.1.12 Slave 192.168.1.37

Fragment 2: Master 192.168.1.10 slave 192.168.1.69

Fragment 3: Master 192.168.1.26 slave 192.168.1.134

【 答 案 】

[Detailed steps]

First, check the monitoring

The Redis instance monitoring shows that the Redis cluster memory usage is 46.97%, and there is no obvious exception, as shown in the following figure:

Check the memory monitoring of the controller. In shard 2, the memory usage of master node 192.168.1.10 reaches 100%, and the memory usage of the other two shards is around 20%, as shown in the figure below:

2. Confirm abnormal fragment information

According to the above monitoring information, the memory usage in Shard 2 in the Redis cluster reaches 100%. Only the fragment memory is abnormal. Procedure

3. Big KEY analysis

Online analytical

1 Tool analysis: Use the Huawei Cloud Management Console cache analysis-Big Key analysis tool. After the command is executed, view the information. The result is as follows :(string saves top20, list/set/zset/hash saves top80)

Specific use method refer to the following link: support.huaweicloud.com/usermanual-…

Command analysis: Run redis-cli -h IP -p port — bigkeys. The tool will list the information about the largest Key among the large keys in each type of data. The results are shown below:

As shown in the preceding figure, the big key of string type is NC_filed / _PK with a size of 13283 bytes. No big key is found for data of list, set, Hash, and Zset.

Offline mode

For offline analysis, you need to use the dedicated RDB_Bigkeys analysis tool to analyze RDB files. Tool address: github.com/weiyanwei41…

Compilation method:

yum install git go -y

mkdir /home/gocode/

cd /home/gocode/

git clone Github.com/weiyanwei41…

cd rdb_bigkeys

go build

The rdb_bigkeys executable file is generated. Usage:

./rdb_bigkeys -bytes 1024 -file bigkeys.csv -sorted -threads 4 /home/redis/dump.rdb

Parameter Description:

-bytes 1024: Filters keys larger than 1024 bytes

-file bigkeys. CSV: saves the result to bigkeys. CSV

– Sorted: sorted from highest to lowest

– Threads: indicates the number of threads used

/home/redis/dump. RDB: indicates the actual RDB file path

The generated file style is as follows:

Each column contains the database number, key type, key name, key size, number of elements, maximum element name, element size, and key expiration time. Document links: www.cnblogs.com/yqzc/p/1242…

Iv. Solutions

The root cause of this OOM problem is that large keys cause uneven data size distribution. When a fragment reaches maxMemory, an OOM problem occurs if the fragment is scheduled during data writing. Export a copy of the RDB file of the fragment for later optimization of large keys.

Interim plan:

To restore services as soon as possible, delete the bigkey queried in the preceding steps. Perform the following operations :(if the bigkey is not a string, do not use del to delete it. Use hscan, sscan, and zscan to gradually delete it.)

Long-term plan:

By splitting a large KEY into multiple small keys, value1, value2… ValueN: Can not split the data into different fragments to avoid uneven data distribution caused by data skew.

Other types of data can be split and reassembled in the same way to avoid the impact of large keys.

5. Result verification

When you look at the sharding monitor, 192.168.1.10 memory usage drops to 24%, as shown in the following figure:

Run set t2 s2 to return to normal. Log in to the cluster and run get to return to normal data. The following information is displayed.

[Optimization and Suggestions]

  1. Configure node-level memory usage monitoring alarms. If a node has a large key, the memory usage of the node is higher than that of other nodes, and an alarm is generated to help users discover the potential large key.

  2. Configure node-level alarms for the maximum inbound bandwidth, maximum outbound bandwidth, and CPU usage. If a node has a hot key, the bandwidth usage and CPU usage of this node are higher than those of other nodes. This node triggers alarms to facilitate users to discover potential hot keys.

  3. The string type should be less than 10KB, and the hash, list, set, and zset elements should be less than 5000.

  4. Periodically use large key and hot key analysis tools to check whether large key problems exist in clusters and identify risks as soon as possible.

Click to follow, the first time to learn about Huawei cloud fresh technology ~

Abstract: How to locate the distributed cache database Redis big KEY problem, practical case with you to master the optimization method.

[background]

Error: OOM Command not allowed when used Memory > ‘maxMemory’ Indicates that some ECS applications cannot write data to the database. Services are affected. When set T2 s2 is executed, the database displays an error OOM, as shown below:

“Topology”

Environmental information:

Redis 5.0 Cluster 4 gb memory

DCS Network segment: 192.168.1.0/24

Fragment 1: Master 192.168.1.12 Slave 192.168.1.37

Fragment 2: Master 192.168.1.10 slave 192.168.1.69

Fragment 3: Master 192.168.1.26 slave 192.168.1.134

【 答 案 】

[Detailed steps]

First, check the monitoring

The Redis instance monitoring shows that the Redis cluster memory usage is 46.97%, and there is no obvious exception, as shown in the following figure:

2. Confirm abnormal fragment information

According to the above monitoring information, the memory usage in Shard 2 in the Redis cluster reaches 100%. Only the fragment memory is abnormal. Procedure

3. Big KEY analysis

Online analytical

1 Tool analysis: Use the Huawei Cloud Management Console cache analysis-Big Key analysis tool. After the command is executed, view the information. The result is as follows :(string saves top20, list/set/zset/hash saves top80)

Specific use method refer to the following link: support.huaweicloud.com/usermanual-…

Command analysis: Run redis-cli -h IP -p port — bigkeys. The tool will list the information about the largest Key among the large keys in each type of data.

In this environment, the big key of string type is NC_filed / _PK with a size of 13283 bytes. No big key is found for list, set, Hash, or zset data.

Offline mode

For offline analysis, you need to use the dedicated RDB_Bigkeys analysis tool to analyze RDB files. Tool address: github.com/weiyanwei41…

Compilation method:

yum install git go -y

mkdir /home/gocode/

cd /home/gocode/

git clone Github.com/weiyanwei41…

cd rdb_bigkeys

go build

The rdb_bigkeys executable file is generated.

Usage:

./rdb_bigkeys -bytes 1024 -file bigkeys.csv -sorted -threads 4 /home/redis/dump.rdb

Parameter Description:

-bytes 1024: Filters keys larger than 1024 bytes

-file bigkeys. CSV: saves the result to bigkeys. CSV

– Sorted: sorted from highest to lowest

– Threads: indicates the number of threads used

/home/redis/dump. RDB: indicates the actual RDB file path

The generated file style is as follows:

Each column contains the database number, key type, key name, key size, number of elements, maximum element name, element size, and key expiration time. Document links: www.cnblogs.com/yqzc/p/1242…

Iv. Solutions

The root cause of this OOM problem is that large keys cause uneven data size distribution. When a fragment reaches maxMemory, an OOM problem occurs if the fragment is scheduled during data writing. Export a copy of the RDB file of the fragment for later optimization of large keys.

Interim plan:

To restore services as soon as possible, delete the bigkey queried in the preceding steps. Perform the following operations :(if the bigkey is not a string, do not use del to delete it. Use hscan, sscan, and zscan to gradually delete it.)

Long-term plan:

By splitting a large KEY into multiple small keys, value1, value2… ValueN: Can not split the data into different fragments to avoid uneven data distribution caused by data skew.

Other types of data can be split and reassembled in the same way to avoid the impact of large keys.

5. Result verification

When you look at the sharding monitor, 192.168.1.10 memory usage drops to 24%, as shown in the following figure:

Run set t2 s2 to return to normal. Log in to the cluster and run get to return to normal data. The following information is displayed.

[Optimization and Suggestions]

  1. Configure node-level memory usage monitoring alarms. If a node has a large key, the memory usage of the node is higher than that of other nodes, and an alarm is generated to help users discover the potential large key.

  2. Configure node-level alarms for the maximum inbound bandwidth, maximum outbound bandwidth, and CPU usage. If a node has a hot key, the bandwidth usage and CPU usage of this node are higher than those of other nodes. This node triggers alarms to facilitate users to discover potential hot keys.

  3. The string type should be less than 10KB, and the hash, list, set, and zset elements should be less than 5000.

  4. Periodically use large key and hot key analysis tools to check whether large key problems exist in clusters and identify risks as soon as possible.

Click to follow, the first time to learn about Huawei cloud fresh technology ~