The author zhu Yujian, background development engineer of Tencent Cloud, is familiar with CNI container network related technologies. He is responsible for the construction of Container network of Tencent Cloud TKE and the development and maintenance of relevant network components. As the main developer, he has realized TKE’s next generation container network solution.

1. Background

1.1 Problem Description

An internal customer’s INDEPENDENT NETWORK interface card (NIC) cluster in TKE VPC-CNI mode cannot ping any other POD or node.

Neighbour: arp_cache: neighbor table overflow! (The following figure is the screenshot of the subsequent reoccurrence of the log)

In addition, the cluster has a large scale, with about 1000 nodes and 30,000 PODS. It can be suspected that the large scale of the cluster leads to too many ARP entries, thus causing the problem of ARP Overflow.

1.2 Explanation of Nouns

noun instructions
TKE Tencent Kubernetes Engine, Tencent Cloud container Service, is based on native Kubernetes to provide container-centered, highly scalable high-performance container management services
VPC – the CNI mode Container service TKE provides container network capabilities based on CNI and VPC elastic nics
Pod Pod is the basic resource management unit of Kubernetes. It has an independent network namespace and one Pod can contain multiple containers

2. Preliminary analysis of the problem

According to the preceding error information, the basic cause of this problem is that the ARP cache table is full. This involves the ARP cache garbage collection mechanism of the kernel. If there are too many ARP entries and no recoverable ARP entries exist, new ARP entries cannot be inserted.

As a result, the corresponding hardware address (MAC) cannot be found when the network packet is sent. The network packet cannot be sent.

What happens when a new entry cannot be inserted? To answer this question, we need to take a closer look at ARP cache aging and garbage collection.

3. ARP cache aging mechanism

3.1 ARP Cache Entry state machine

The figure shows the entire ARP entry life cycle and its state machine.

As we know, when sending TCP/IP network packets, the network stack requires the MAC address of the peer in order for the network packets to be converted into layer 2 data structures, called frames, for transmission across the network. For IP addresses in different broadcast domains, the peer MAC address is the gateway, and the sender sends the network packet to the gateway for forwarding. For IP addresses in the same broadcast domain, the peer MAC address corresponds to the IP address.

Finding MAC addresses through IP addresses is the main work of ARP. The working process of ARP is not described here. After the MAC address corresponding to an IP address is found using ARP, the MAC address is stored on the local computer for a period of time to reduce the communication frequency of ARP and speed up the sending of network packets. The corresponding relationship, namely the ARP cache entry, can be described as follows:

  1. Initially, when any network packet is sent, the kernel protocol stack needs to find the peer MAC address corresponding to the destination IP address. If no match is found in the ARP cache, an Incomplete entry is inserted. The Incomplete status attempts to send ARP packets to request the MAC address corresponding to an IP address.
  2. If an ARP response is received, the entry status becomes Reachable.
  3. If no response is received after a certain number of attempts, the entry becomes Failed.
  4. When the Reachable entry reaches the timeout period, it becomes Stale. Stale entries are no longer available.
  5. If a Stale entry is referenced to send packets, the entry changes to Delay state.
  6. The entries in the Delay state cannot be used to send packets. However, when the Delay state expires, the local device that receives ARP packets changes to the Reachable state.
  7. The Delay state expires, and the entry enters the Probe state, which is similar to the Incomplete state.
  8. If the Stale state expires, it will be collected and deleted by garbage collection.

You can run the following command to view arp entries in the network namespace and their status:

ip neigh
Copy the code

Such as:

Local machine confirmation: Indicates that the local machine has received a network packet with a matching source MAC address. This network packet indicates that the last hop of the network communication is the machine with the MAC address. If the local machine can receive this network packet, the MAC address is reachable. Therefore, you can turn this entry into the Reachable state. Through this mechanism, the kernel can reduce the communication requirements of ARP.

3.2 Kernel parameters involved

The main kernel parameters involved in this mechanism are listed below:

parameter meaning The default value
/proc/sys/net/ipv4/neigh/default/base_reachable_time The Reachable status base expiration time, each entry expiration time is within [1/2Base_reachable_time, 3/2Between base_reachable_time] 30 seconds
/proc/sys/net/ipv4/neigh/default/base_reachable_time_ms Reachable Base expiration time of the status, expressed in milliseconds 30 seconds
/proc/sys/net/ipv4/neigh/default/gc_stale_time Stale Indicates the expiration time 60 seconds
/proc/sys/net/ipv4/neigh/default/delay_first_probe_time Delay Indicates the time when the status expires to Probe 5 seconds
/proc/sys/net/ipv4/neigh/default/gc_interval Cycle time of GC startup 30 seconds
/proc/sys/net/ipv4/neigh/default/gc_thresh1 Less than this value, the GC will not start 2048
/proc/sys/net/ipv4/neigh/default/gc_thresh2 A soft limit for the maximum number of records in an ARP table. The maximum number is allowed to exceed 5 seconds 4096
/proc/sys/net/ipv4/neigh/default/gc_thresh3 A hard limit on the maximum number of records in an ARP table. If the number is greater than this, gc starts immediately and the collection is forced 8192

Gc kernel parameters are valid for all network cards. However, the expiration time Settings are valid only for individual network adapters, and the default value is valid only for newly added interface devices.

3.3 ARP Cache Garbage Collection Mechanism

As we know from the state machine that caches entries, not all entries will be reclaimed. Only entries in the Stale state that are Stale and Failed will be reclaimed. In addition, the garbage collection of ARP cache entries is triggered. The entries to be reclaimed may not be immediately reclaimed. The garbage collection of ARP cache entries has four startup logic:

  1. If the number of ARP entries is less than gC_thresh1, the DEVICE is not started.
  2. Gc_thresh1 =< Number of ARP entries <= gc_thresh2. The device is periodically started according to gc_interval
  3. Gc_thresh2 < Number of ARP entries <= Gc_thresh3
  4. If the number of ARP entries is greater than gC_thresh3, the system starts immediately

For non-recyclable entries, garbage collection does not collect them even if it is started. Therefore, when the number of unrecyclable entries is greater than GC_THRESH3, garbage collection is helpless.

4. Explore further

4.1 Whether the garbage collection threshold takes effect at the namespace level or the sub-machine level

We know that each individual network namespace has a complete network protocol stack. Is ARP cache garbage collection handled separately for each namespace?

As you can see from the kernel parameters involved, the GC related kernel parameters are valid for all interface devices, so it can be assumed here that the threshold for garbage collection is also valid at the sub-machine level, rather than by network namespace.

Here’s a simple experiment to check:

  1. Set gC_thresh1, GC_thresh2, and gC_thresh3 to 60 on node Default NS.
  2. Nineteen pods with independent network card mode were created on the nodes
  3. Select any POD to ping the other pods to generate ARP cache
  4. Shell script is used to scan all pod on the node and calculate the sum of ARP entries, which can be obtained:

As you can see, the cumulative number of ARP entries in each namespace decreases rapidly after 60, which is when garbage collection occurs. Repeated several times, the result is similar, so this indicates that garbage collection calculates the cumulative value of each namespace when calculating the trigger threshold of ARP entries, that is, at the child machine level, not at the namespace level.

4.2 What Happens when an UNrecoverable ARP entry reaches GC_thresh3

As we know from the previous introduction, garbage collection does not reclaim any ARP cache. Therefore, what happens when all reachable ARP entries fill the ARP cache table, i.e., when gC_thresh3 is reached? It can be speculated that the old ARP entry cannot be reclaimed, the new ARP entry cannot be inserted, and the new network packet cannot be sent. That is, the problem described in this article occurs.

To test this, continue the experiment in the following environment:

  1. Change the base_reachable_time of any two PODS to 1800 seconds to generate ARP entries that cannot be recycled.
  2. Set GC_thresh3 to 40 to trigger problems more easily
  3. Select a POD with the aging time adjusted to ping other PODS to generate ARP cache.
  4. When the ping reaches the threshold, packets are lost or blocked:

Neighbour: arp_cache: neighbor table overflow!

The preceding experiments show that the number of ARP entries that cannot be recycled is full. As a result, new ARP entries cannot be inserted and the network cannot be connected.

4.3 Why is this problem more likely to occur in independent NIC mode than in TKE global routing mode and single-NIC multi-IP mode

To answer this question, let’s take a brief look at the principles of TKE’s network modes

Global routing mode

In this network mode, container IP addresses on each node are pre-allocated to each node. These IP addresses belong to the same subnet, and each node is a small subnet. As we know, THE ARP protocol serves for layer 2 communication. Therefore, in this network scheme, the ARP table in the network namespace of each Pod can save the ARP entries of all other PODS on the node to the maximum, and the maximum number of ARP entries on the last node is the square of the number of subnet IP addresses of the node. If the subnet size of a node is 128, the maximum number of ARP entries can be 127 square, or about 16,000.

Shared NIC mode

In this network mode, each node is bound to an auxiliary elastic network adapter, and the Pod on the node shares the auxiliary network adapter. Each Pod does not route network packets, but only has an ARP entry. The actual routing control is completed in the default namespace of the node. Therefore, in this network mode, the ARP cache table is almost shared. Because nics belong to only one subnet, the Pod ARP cache table of each node can store only the IP-MAC mapping of one subnet. The maximum number of IP addresses in the subnet where each NIC resides is the sum of the total number of IP addresses. If the number of IP addresses is about 1000, the number of ARP entries is about 1000. The maximum number of ARP entries on a node is usually less than 10.

Next Generation Network solution – Independent network card mode

The independent network card mode is the next generation “zero-loss” container network solution launched by THE TKE team. Its basic principle is shown in the following figure:

That is, the elastic NETWORK interface card (NIC) created by the mother vm can be directly placed in the container. In this way, the container can obtain the same network communication and network management capabilities as the CVM child, which greatly improves the data plane capability of the container network and truly achieves zero loss.

At present, the independent network card network solution has been whitelist test in TKE products, welcome internal and external customers experience trial.

In the network scenario, each Pod has an exclusive network card, a separate namespace, and a separate ARP cache table. Each network card can belong to a different subnet. Therefore, in independent NIC mode, the maximum number of ARP cache entries is the sum of the subnet IP addresses in the same availability zone. This magnitude can easily be tens of thousands, easily exceeding the default ARP cache Settings. And that triggered the problem.

5. Solutions

From the above analysis, it can be seen that the problem can be better solved by increasing the garbage collection threshold. Therefore, a temporary solution is to increase the garbage collection threshold of the ARP cache table:

echo 8192 > /proc/sys/net/ipv4/neigh/default/gc_thresh1echo 16384 > /proc/sys/net/ipv4/neigh/default/gc_thresh2echo 32768 > /proc/sys/net/ipv4/neigh/default/gc_thresh3
Copy the code

6. Summary

After the ARP cache is full, the Pod network fails. At first glance, it looks simple, but the ARP cache aging and garbage collection mechanisms behind it are also quite complex. A lot of data have been queried, but they are not clear about whether the garbage collection threshold takes effect on the cumulative ARP entries of each namespace or independently, which entries are collected by garbage collection, and how the entries behave when they are full. Therefore, the author tried to verify the specific behavior pattern through several small experiments. Experimentation may also be a quicker way to investigate problems and understand mechanisms than reading arcane kernel source code directly. I hope I can help you.