This paper records an online troubleshooting process of DPDK-LVS cluster for readers’ reference.


Review of previous article:
SOAR’s IDE plug-in — your personal DBA bodyguard

background

Dpdk-lvs, our internal high-performance load balancer developed by ourselves based on DPDK, has been deployed and put online in several computer rooms and is running normally. However, there are several business feedbacks related to finance recently. After forwarding the service data packets by DPDK-LVS, the situation will appear hang.

The problem

1. Dpdk-lvs has been online in several computer rooms and has been running for more than half a year. Why is there sudden abnormal business feedback

2. Most of the businesses with feedback problems are related to the financial district (due to the particularity of the financial district, additional security reinforcement strategies will be added)

3, why are the problems of service hang

Troubleshoot problems

First, we suspected dPDK-LVS or some security policy related to finance, so we did the following test (running the same test code on the back end and simulating server-side logic) :

1. Client < —– > DPDK-LVS < —– > RS (financial district) is abnormal

2. Client < —– > DPDK-LVS < —– > RS (non-financial zone) normal

3, Client < —– > LVS < —– > RS (financial district) normal

4. Client < —– > LVS < —– > RS (non-financial zone) normal

Through 1 or 2 tests, it can be concluded that the problem is related to the financial district and dPDK-LVS forwarding is normal

Through 3 or 4 tests, it can be concluded that the problem has nothing to do with financial district and kernel VERSION LVS forwarding is normal

Through group 1 and 3 tests, it can be concluded that the problem is related to DPDK-LVS and that requests through DPDK-LVS are abnormal

Through 2 and 4 tests, it can be concluded that the problem is not related to DPDK-LVS/LVS, and all requests through DPDK-LVS/LVS are normal

The conclusions of the above four groups conflicted in pairs, so it was impossible to determine whether the problem was related to DPDK-LVS or financial district. The investigation once entered a deadlock, and the fault point could not be located.

For further investigation, we captured packets on client and back-end RS, and found that all requests from client could reach RS normally, and most data of RS could be returned to client normally, but some fixed packets would always be retransmitted until time out. The following is the packet capture screenshot:

10.128.129.14 is the IP of RS, 10.115.167.0/24 is the local IP of DPDK-LVS, According to the packet capture result on RS, it can be clearly seen that the packet with length of 184 sent by RS to DPDK-LVS is correctly transmitted, but the packet with length of 2 is retransmitted all the time and fails until timeout. Meanwhile, the captured packet on the client shows, The client received the packet with the length of 2, but the TCP checksum error was discarded, and the upper layer application did not handle it, which explained why the exception was displayed as hang because the packet was retransmitted until timeout.

Now the hardware network adapter generally has the function of csum offload, which can help us to do the checksum offload. If there is a problem with the offload function of the nic hardware, it will affect not a particular packet, but all the packets that pass through the NIC. Therefore, we doubt that the NIC has calculated the checksum error for a particular packet. We conducted packet capture analysis on DPDK-LVS, and the following are captured screenshots:

Dpdk-lvs received the packet, and the processing logic is completely normal. The remaining step is to checksum the network card and forward the packet. The initial size of this packet = Ethernet header length + IP header length + TCP header length + TCP data = 14 + 20 + 20 + 5 = 59, and we know that, The minimum length of the data frame transmitted on the network is 64 bytes. Excluding the 4 bytes of FCS (which is added to the end of the packet by the nic itself), the minimum length should be 60 bytes. That is to say, if the packet that reaches the NIC is less than 60 bytes, The nic will add all the padding of 0 at the end of the packet to make the packet reach 60 bytes, so the packet also needs the nic hardware to add 1 byte padding to achieve the minimum transmission length. Rfc894 states:

So the RS nic needs to do two things when the packet length is less than 60 bytes:

  • Add 1 byte padding to reach a minimum length of 60 bytes
  • The added value is all 0

As you can see, there is indeed a supplementary 1-byte padding in the layer 2 header: Ec, the padding is a non-zero value, instead of a zero value as specified in RFC894, so that the DPDK-LVS network card can calculate the CHECK sum instead of the PADDING value. Therefore, the checksum calculated by the client based on the PSEUDO IP header and TCP header is inconsistent with the checksum calculated by the client based on the pseudo IP header and TCP header. Therefore, the client directly drops the packet instead of handing it to the upper-layer application for processing.

(Nic manual for TCP/UDP checksum section description)

At this point, the cause of the problem is clear: When the cards of some machines do the padding, other values are added instead of all zeros according to RFC894. As a result, the padding data of the cards of DPDK-LVS also participate in the checksum offload calculation.

By analyzing the difference between normal RS and abnormal RS in nic hardware, it is found that the nic has the same hardware model and the same driver model, but the abnormal NIC Fireware is different from the normal nic, and fireware cannot be upgraded or degraded by ourselves.

The whole fault process can be roughly expressed as:

client                 dpdk-lvs                rs

— — — — — — > 1

< — — — — — — 2

< — — — 3 — — –

Step 1: The data packet is normal and data is requested

Step 2: The initial length of some packets is less than 60 bytes. The NIC needs to add the padding. The NIC calculates the checksum and fills in the TCP packet header before adding the padding to the end of the packet

Step 3: DPDK-LVS receives the packet in Step 2 and forwards the packet to the nic. The NIC calculates checksum and forwards the packet. However, the new checksum is calculated incorrectly because the padding is not all 0, and the client discards the packet

Ps: The above is that the padding of the RS nic is not all 0. Another scenario is that the padding of the client nic is not all 0. Both cases will lead to the above problems.

Problem solving

So far, we have been able to explain the first three questions:

1. Dpdk-lvs has been online in several machine rooms for more than half A year. Why is there sudden abnormal business feedback A: In this business, problems occurred after dPDK-LVS was launched in one of the core machine rooms. Other machine rooms have been launched dPDK-LVS for a long time. However, since the backup machine rooms of other machine rooms are actually not enabled, no problems have been found for more than half a year

2. Most of the businesses with feedback problems are related to the financial district (the financial district will add additional security reinforcement strategies due to its particularity). A: The investigation found that fireware bugs existed in A batch of machines in the financial district, which has nothing to do with the security policies of the financial district itself

A: The essence of the problem is that packets are lost and the service is waiting for A response, so it is displayed as hang

We’ll solve the problem next:

If dPDK-LVS does not checksum the padding of the packet, the padding will be reprocessed by the NIC of DPDK-LVS. The checksum is correct. Since DPDK-LVS is developed based on DPDK, data packets are stored and processed in the structure of MBUF in DPDK-LVS. The structure of MBUF is as follows:

Data frames are stored between headroom and tailroom (similar to SKB), pkt_len=data_len= the length of the entire data frame. What we need to do is remove the padding from data (put it into tailroom), So you can add the following code to the packet entry:

int padding_length = mbuf->data_len - (mbuf->l2_len +rte_be_to_cpu_16(ipv4_hdr->total_length));
mbuf->data_len = mbuf->data_len - padding_length;
mbuf->pkt_len = mbuf->data_len;Copy the code

After adding the above code, the test passes and the fault is resolved.

The resources

https://tools.ietf.org/html/rfc894

http://doc.dpdk.org/guides/prog_guide/mbuf_lib.html

https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82599-10-gbe-controller-datasheet.pdf


This article was first published on the public account “Mi Operation and Maintenance”. Click to view the original article