Some time ago, there was a problem with the company’s Android packaging service. When the 360 server was being uploaded for reinforcement, it was very likely to be stuck in the uploading stage, so it failed after a long time of retry. I conducted some investigation and analysis on this situation, solved the problem, and wrote this long article to review the investigation experience, which will involve the following contents.

  • Docker bridge mode network model
  • Principles of Netfilter and NAT
  • The use of Systemtap in kernel probes

The phenomenon of description

The deployment structure of packaging service is as follows: Android packaging environment is packaged as a Docker image, which is deployed on a physical machine. This image will complete the functions of code compilation, packaging, hardening, signing and channel package generation, as shown in the figure below:

The problem lies in the step of uploading APK, which is blocked when uploading part of IT. The SDK of 360 indicates timeout and other abnormalities, as shown in the picture below.

By capturing packets in the host machine and in the container, we found several phenomena.

The packet with serial number 881 is a delayed ACK with ACK value of 530104. The packet with larger serial number 875 has been confirmed (serial number 532704, The host then sends an RST packet to the remote 360 hardened server.

The next step is to continuously retry sending data, and the upload jam corresponds to this phase of continuously retry sending data, as shown in the following figure

This RST does not appear in the captured packets on the container side, as shown in the figure below

Because the container side did not perceive the connection exception, the service in the container kept retrying the upload, but still failed after several retries.

Preliminary screening analysis

At the beginning, I was wondering whether it was because I received the ACK of delayed arrival, so I replied to RST.

In the TCP protocol specification, when receiving a delayed ACK, you can ignore it and do not need to reply.

Is the bag illegal in the first place? After careful analysis of the package’s information, nothing unusual was found. It is impossible to deduce this phenomenon from existing knowledge of TCP principles.

Use SystemTap to see where the RST package is coming from.

Looking at the kernel code, the main functions for sending RST packets are the following two

tcp_v4_send_reset@net/ipv4/tcp_ipv4.c

static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb) {
}

tcp_send_active_reset@net/ipv4/tcp_output.c

void tcp_send_active_reset(struct sock *sk, gfp_t priority) {
}

Copy the code

Systemtap then injects these two functions.

probe kernel.function("tcp_send_active_reset@net/ipv4/tcp_output.c").call { printf ("\n%-25s %s<-%s\n", ctime(gettimeofday_s()) ,execname(), ppfunc()); if ($sk) { src_addr = tcp_src_addr($sk); src_port = tcp_src_port($sk); dst_addr = tcp_dst_addr($sk); dst_port = tcp_dst_port($sk); if (src_port == 443 || dst_port == 443) { printf (">>>>>>>>>[%s->%s] %s<-%s %d\n", str_addr(src_addr, src_port), str_addr(dst_addr, dst_port), execname(), ppfunc(), dst_port); print_backtrace(); } } } probe kernel.function("tcp_v4_send_reset@net/ipv4/tcp_ipv4.c").call { printf ("\n%-25s %s<-%s\n", ctime(gettimeofday_s()) ,execname(), ppfunc()); if ($sk) { src_addr = tcp_src_addr($sk); src_port = tcp_src_port($sk); dst_addr = tcp_dst_addr($sk); dst_port = tcp_dst_port($sk); if (src_port == 443 || dst_port == 443) { printf (">>>>>>>>>[%s->%s] %s<-%s %d\n", str_addr(src_addr, src_port), str_addr(dst_addr, dst_port), execname(), ppfunc(), dst_port); print_backtrace(); } } else if ($skb) { header = __get_skb_tcphdr($skb); src_port = __tcp_skb_sport(header) dst_port = __tcp_skb_dport(header) if (src_port == 443 || dst_port == 443) { try { iphdr = __get_skb_iphdr($skb) src_addr_str = format_ipaddr(__ip_skb_saddr(iphdr), @const("AF_INET")) dst_addr_str = format_ipaddr(__ip_skb_daddr(iphdr), @const("AF_INET")) tcphdr = __get_skb_tcphdr($skb) urg = __tcp_skb_urg(tcphdr) ack = __tcp_skb_ack(tcphdr) psh = __tcp_skb_psh(tcphdr) rst = __tcp_skb_rst(tcphdr) syn = __tcp_skb_syn(tcphdr) fin = __tcp_skb_fin(tcphdr) printf ("skb [%s:%d->%s:%d] ack:%d, psh:%d, rst:%d, syn:%d fin:%d %s<-%s %d\n", src_addr_str, src_port, dst_addr_str, dst_port, ack, psh, rst, syn, fin, execname(), ppfunc(), dst_port); print_backtrace(); } catch { } } } else { printf ("tcp_v4_send_reset else\n"); print_backtrace(); }}Copy the code

The call stack is tcp_v4_send_reset

Tue Jun 15 11:23:04 2021 swapper/6<-tcp_v4_send_reset SKB [36.110.213.207:443->10.21.17.99:39700] ACK :1, PSH :0, RST :0, syn:0 fin:0 swapper/6<-tcp_v4_send_reset 39700 0xffffffff99e5bc50 : tcp_v4_send_reset+0x0/0x460 [kernel] 0xffffffff99e5d756 : tcp_v4_rcv+0x596/0x9c0 [kernel] 0xffffffff99e3685d : ip_local_deliver_finish+0xbd/0x200 [kernel] 0xffffffff99e36b49 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff99e364c0 : ip_rcv_finish+0x90/0x370 [kernel] 0xffffffff99e36e79 : ip_rcv+0x2b9/0x410 [kernel] 0xffffffff99df0b79 : __netif_receive_skb_core+0x729/0xa20 [kernel] 0xffffffff99df0e88 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff99df0f10 : netif_receive_skb_internal+0x40/0xc0 [kernel] ...Copy the code

Tcp_v4_rcv tcp_V4_RCV tcp_V4_RCV

This requires a powerful tool, faddr2line, to restore the stack information to the corresponding line number of the source code.

wget https://raw.githubusercontent.com/torvalds/linux/master/scripts/faddr2line

bash faddr2line /usr/lib/debug/lib/modules/`uname -r`/vmlinux tcp_v4_rcv+0x536/0x9c0
 
tcp_v4_rcv+0x596/0x9c0:
tcp_v4_rcv in net/ipv4/tcp_ipv4.c:1740
Copy the code

You can see that the tcp_v4_send_reset function is called on line 1740 of tcp_ipv4.c,

int tcp_v4_rcv(struct sk_buff *skb) { struct sock *sk; sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest); if (! sk) goto no_tcp_socket; . no_tcp_socket: if (! xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) goto discard_it; if (skb->len < (th->doff << 2) || tcp_checksum_complete(skb)) { csum_error: TCP_INC_STATS_BH(net, TCP_MIB_CSUMERRORS); bad_packet: TCP_INC_STATS_BH(net, TCP_MIB_INERRS); } else { tcp_v4_send_reset(NULL, skb); // 1739 line}}Copy the code

The only logic that can be invoked is to not find the socket information for this packet, sk is NULL, and then go to the no_tcp_socket tag, and then go to the else process.

How is that possible? The connection exists. How can you receive a delayed ACK packet and not find the connection socket? Next, let’s look at the low-level implementation of the __inet_lookup_skb function, culminating in the __inet_lookup_ESTABLISHED function.

struct sock *__inet_lookup_established(struct net *net,
				  struct inet_hashinfo *hashinfo,
				  const __be32 saddr, const __be16 sport,
				  const __be32 daddr, const u16 hnum,
				  const int dif)
Copy the code

A very similar RST scenario is to send a packet to a service that is not listening on a port, despite the existing phenomenon. If there is no connection for the packet, the kernel will reply to RST, telling the sender that it cannot process the packet.

At this point, the canvass is deadlocked. Why can’t the kernel protocol stack be found when the connection is still there?

Docker Bridge mode Network packet flow mode

When the Docker process starts, a virtual bridge named Docker0 is created on the host, and the Docker container on the host is connected to the virtual bridge.

After the container is started, Docker will generate a pair of VETH interfaces (Veth pair), which is essentially equivalent to the Ethernet connection realized by software. Docker connects eth0 in the container to the Docker0 network bridge through VEth. External connections can be provided in the form of IP masquerading, a form of network address translation (NAT) established using IP forwarding and iptables rules.

Go deeper into Netfilter and NAT

Netfilter is a Linux kernel framework that sets several hook points in the kernel protocol stack to intercept, filter, or otherwise process packets. It can be used for everything from simple firewalls, to detailed analysis of network traffic data, to complex state-dependent grouping filters.

Docker takes advantage of its Network Address Translation (NAT) feature to translate source and destination addresses according to certain rules. Iptables is a tool for managing these Netfilters in user mode.

For the deployment structure in this scenario, it works as shown below.

Net /netfilter/nf_conntrack_proto_tcp.c

/* Returns verdict for packet, or -1 for invalid. */
static int tcp_packet(struct nf_conn *ct,
		      const struct sk_buff *skb,
		      unsigned int dataoff,
		      enum ip_conntrack_info ctinfo,
		      u_int8_t pf,
		      unsigned int hooknum,
		      unsigned int *timeouts) {
    
    // ...	
    	      
    if(! tcp_in_window(ct, &ct->proto.tcp, dir, index, skb, dataoff, th, pf)) { spin_unlock_bh(&ct->lock);return-NF_ACCEPT; }}Copy the code

An ACK package caused by an invalid packet is invalid.

We can print invalid packets using iptables rules.

iptables -A INPUT -m conntrack --ctstate INVALID -m limit --limit 1/sec   -j LOG --log-prefix "invalid: " --log-level 7
Copy the code

After the preceding rules are added, the hardening upload script is run again and packets are captured.

Then view the corresponding log in dMESG.

In the case of the first behavior, LEN=40, that is, 20 IP header + 20 byte TCP header, the ACK bit is set to indicate that this is an ACK packet with no content, corresponding to the previous ACK packet of the RST packet in the figure above. The package details are shown below, and window equals 187 is also correct.

If the packet is in INVALID state, NetFilter does not perform NAT between IP address and port. In this case, the protocol stack fails to find the connection of the packet based on IP + port. In this case, an RST is returned, as shown in the following figure.

This also confirms our previous code logic that __inet_lookup_SKb is null and then RST is sent.

How to modify

Knowing the reason, it is very simple to modify, there are two changes. The first change is a bit rude, using iptables to drop the invalid package without allowing it to generate an RST.

iptables -A INPUT -m conntrack --ctstate INVALID -j DROP
Copy the code

After this modification, the problem was solved instantly. After dozens of tests, there was no upload timeout or failure.

There is a slight problem with this modification, which can accidentally injure FIN packets and some other truly invalid packets. A more elegant change is to set the kernel option net.netfilter.nf_conntrack_tcp_be_liberal to 1:

sysctl -w "net.netfilter.nf_conntrack_tcp_be_liberal=1"
net.netfilter.nf_conntrack_tcp_be_liberal = 1
Copy the code

Net/netfilter/nf_conntrack_proto_tcp.c: net/netfilter/nf_conntrack_proto_tcp.c: net/netfilter/nf_conntrack_proto_tcp.c: net/netfilter/nf_conntrack_proto_tcp.c: net/netfilter/nf_conntrack_proto_tcp.c:

static bool tcp_in_window(const struct nf_conn *ct,
			  struct ip_ct_tcp *state,
			  enum ip_conntrack_dir dir,
			  unsigned int index,
			  const struct sk_buff *skb,
			  unsigned int dataoff,
			  const struct tcphdr *tcph,
			  u_int8_t pf) {... res =false;
		if (sender->flags & IP_CT_TCP_FLAG_BE_LIBERAL ||
		    tn->tcp_be_liberal)
			res = true; .return res;
}
Copy the code

I’ll close this article with a silky upload screenshot.

Afterword.

Look at the code and suspect the impossible. The above may be wrong, just look at the method.