This article introduces the whole process of rectifying IPSec faults, reveals how to analyze the faults, and learns related knowledge from the faults.


Review of previous article:
Tcp_recycle discarded

Due to business requirements, we set up A VPN on some overseas nodes to facilitate data interaction between overseas nodes. One day, we set up a new VPN between two new nodes. After the launch, Ping and traceroute tests showed no abnormality, observed that traffic had passed, and monitored indicators were normal. However, half an hour later, the service reports that the network between the two new nodes is abnormal. After discovering the fault, the network goes online and the configuration is rolled back. After the offline test, it is found that the IPsec process can be restarted.

Next, replay the configuration and scenario at that time and explain the root cause of the problem.

The environment is briefly

As shown above, two nodes A and B are connected through IPsec. Network segment 10.0.0.0/24 on side A and network segment 10.0.0.0/24 and 10.0.2.0/24 on side B are connected. The data packets destined for network segments 10.0.0.0/24 and 10.0.2.0/24 are discarded to the IPSec Server on node B, and vice versa.

The configuration on one side is as follows:

conn Tunnel1

authby=secret

auto=start

left=%defaultroute

Leftid =1.1.1.1(Local public IP address)

Right =2.2.2.2(Peer Public IP address)

type=tunnel

ikelifetime=8h

keylife=1h

phase2alg=aes128-sha1; modp1024

ike=aes128-sha1; modp1024

auth=esp

keyingtries=%forever

keyexchange=ike

2.0/24} leftsubnets = {10.0.1.0/24,10.0.

Rightsubnet = 10.0.0.0/24

dpddelay=10

dpdtimeout=30

dpdaction=restart_by_peer

Abnormal appearance and cause

The representation of the exception is that the back-end service cluster of node A cannot communicate with the back-end service of node B. However, when the IPSec service status is checked, the IPSec service status is normal, and packets can even be captured to see that data is communicating in the IPSec Tunnel. However, when the communication between nodes A and B is abnormal, The IPSec Server incorrectly forwards encrypted data packets to the back-end. The /proc/net/xfrm_stat file has an XfrmInTmplMismatch error count that has been increasing. The SA and SP of IPSec do not match. The SPI of the communication between nodes A and B is found to be inconsistent by using the IP XFRM monitor command.

Figure 1

Figure 2

The SPI of the Tunnel used when the request is initiated is 0x198e7538, which belongs to ReqID 16385. The SPI of the Tunnel used when the packet is returned is 0x9CE44e77, which belongs to ReqID 16389. Because the IKE of the two tunnels is different, the XfrmInTmplMismatch error occurs. The solution is simple. Change leftSubnets ={10.0.0.0/24,10.0.2.0/24} in the configuration file to LeftSubnet =0.0.0.0/0 to avoid the problem of inconsistent path back and forth (PS). Cause there’s only one way :))

Returning for analysis

The solution is available, but you need to understand why there are two IPSec tunnels, and the route error occurs. Since the above mentioned problem is caused by the different return path, so look at the source code to find the SPI generation rule:

https://github.com/xelerance/Openswan/blob/master/programs/pluto/kernel.h

https://github.com/xelerance/Openswan/blob/master/programs/pluto/kernel.c

Get_ipsec_spi: get_ipsec_spi: get_ipsec_spi: get_ipsec_spi: get_ipsec_spi

Based on the above, we can see that the peer subnet mask is not considered when generating the SPI.


In summary, these are all the steps for troubleshooting IPSec faults.


This article was first published on the public account “Mi Operation and Maintenance”. Click to view the original article