Abstract: This article will introduce several common methods for sorting out the possible problems of database network faults, so as to quickly locate and recover.

1. Background

Network faults are difficult to locate and rectify in various GaussDB fault scenarios. Network faults may affect database performance and even block services to some extent, resulting in serious consequences. Network faults affect the application side (GaussDB), operating system (OS), switch, and hardware resources. This document describes common methods to quickly locate and rectify possible faults. For details about parameters and views, see the product documentation.

2 Symptom

Figure 1. Gsar script running results

If the performance is slow or the database connection is abnormal, you are advised to use the GSAR script to check the network status. If the retransmission rate or packet loss rate exceeds 0.01% (the red box in the last column in Figure 1), it indicates that the network is faulty and needs further analysis and locating.

3 Troubleshooting 1: The network adapter of the TaiShan server is hardened

For TaiShan servers (100/200), compatible network cards and drivers are required, otherwise such network problems are likely to occur.

Strictly follow the hardening configuration guide to locate the device, including checking the large transparent pages.

4 Check 2: MTU consistency

MTU is the maximum transmission unit (MTU). Ensure that the MTU is consistent throughout the data link. Otherwise, packet loss may occur due to packet size mismatch. You can run the ifconfig command to view and modify the MTU value of each network adapter.

Figure 2. Ifconfig modifies MTU

As shown in Figure 2, its disadvantage is that it becomes invalid after restart. If you want to keep it for a long time, you need to modify the configuration file. The modification method varies with different operating systems, which can be searched by Google.

5 Troubleshooting 3: Network retransmission

1. Netstat View retransmission times After the gsar script is used to observe the retransmission, run the netstat command to view the retransmission status.

Figure 3. Netstat viewing the retransmission status

If the retransmission times up to 12 times (FIG. 3 in the red box, the first column represents the time of the next retransmission, the second as the number of retransmission have occurred, heavy convey to 9 minutes, theoretically keepalive can detect abnormal connection, disconnect it), then the network impassability, can be further explored to end process status and network environment (ping).

2. View the cache status by netstat

When the send cache is severely blocked, you can obviously see the retransmission phenomenon. Run the netstat command to check the status of the cache:

Figure 4. Netstat viewing the status of the send cache

The red box in Figure 4 shows the status of the cache of the sender. It can be seen that the block is serious and the receiver is 192.168.2.101. In this case, you can check the receiving status of the peer end according to the port number:

Figure 5. Netstat viewing the status of the peer receive cache

The red box in Figure 5 shows the cache area of the receiving end of port 44112, and the blocking phenomenon is also obvious. In this case, you can obtain the status of each thread based on the related view of GaussDB to analyze the blocking cause. The following uses a blocked connection as an example:

Figure 6. Query_id found on DN based on client_port

Log in to the database based on the GaussDB node port and query the query_ID using the peer connection port number.

The status of each thread can be queried according to query_id in FIG. 7.cn

Log in to the CN node of GaussDB and query the CN thread ID based on the query_ID. The DN is transmitting data to the CN. You can use gStack to print the CN stack.

1. Print the thread stack: gstack lwtid

2. Monitor thread interaction with the kernel: strace -p lwtiD-tt-t -o strace.log

3. Run the top -p pid -d 0.2 command to view the CPU resources used by the thread

3. The known statement gather is slow

Some statements are slow to execute. Printing execution plans mainly takes time to discover. At this time, you can find the status of CN and DN based on the SQL statement to be executed, locate the node and thread ID where the slowness occurs, and print stack information for further analysis.

Figure 8. Query CN thread status based on SQL

6 Troubleshooting 4: Network packet loss

1. The memory is insufficient

Running out of memory is one of the main causes of packet loss, but there are other visualizations that usually occur. You can use commands such as free and top to check memory, or you can use the pv_total_memory_detail view to see how specific processes are.

2. CPU soft interrupts are insufficient

After the NIC receives data, the CPU interrupts the data entering the TCP cache. If the CPU is busy and the soft interrupt usage is high, the CPU does not process the data in a timely manner, causing packet loss.

Figure 9. Speed_test pressure receiver

Figure 10. Speed_test pressure the sender

Figure 11. Network condition during speed_test pressure test

Figure 12. CPU soft interrupt status during speed_test pressure test

The speed_test tool was used for pressure measurement and observation. Two machines were used as receiving and sending ends respectively, as shown in FIG. 9-12. At this time, there was no background pressure in the test cluster, and it could be seen that the network traffic reached the upper limit of the network card, and packet loss occurred occasionally.

In addition, soft interrupts are related to I/OS. You can run the iostat command to view the I/O status at the corresponding time. In some scenarios, separate nic binding from service binding can alleviate the problem. Run the get_irq_affinity2.sh script to check the current NIC binding.

Figure 13. View the binding status of the nic core

Bind the core of a nic using smart_irq_affi.sh:

Figure 14. Core binding of the network card

Bind GaussDB cores using gs_cgroup:

Figure 15. Kernel binding GaussDB

7 Troubleshooting 5: Switch

As an important part of the entire data transmission link, you need to contact related experts to check the topology, flow control, and interface bandwidth of the switch.

8 Common Commands

1. Network pressure test tool: speed_test/iperf

./speed_test_xxx recv/send ip port

iperf -s / iperf -c ip -t time -p thread_num

2. Nic tool: ethtool

ethtool ethx // speed

ethtool -i ethx // driver

ethtool -k ethx // gro gso tso

ethtool -l ethx // channel

Ethtool -s ethx // Statistics

3. Packet capture tool: tcpdump

tcpdump tcp -i ethx and host ip1 and ip2 and port port1 -w target.pcap

9 summary

Due to the complexity of data transmission links, it is difficult to locate retransmission and packet loss problems. However, you can learn to master certain means and methods, clarify your thinking, and start from the source to find out the root cause.

GaussDB A Hardening Configuration Guide 04.pdf

Script tool. Rar

This document is shared with “Locating GaussDB Network Retransmission/Packet Loss Problems” written by Caesar.

Click to follow, the first time to learn about Huawei cloud fresh technology ~