In this case, the technical engineers of Jingdong Cloud met the special requirements from customers to break the functional limitations of K8s products. Faced with this very challenging task, whether the siege lion finally overcame numerous difficulties to help customers achieve the requirements perfectly? And see this period K8s technical case share! (Friendly reminder: the article is long, it is recommended that you collect it first and then read it. Meanwhile, pay attention to combining work and rest in the process of reading to maintain physical and mental health!)

Part ONE: “Distinctive” needs

One day, our jingdong cloud technology engineers received customers for help. The customer deployed the test cluster using the managed K8s cluster product in the on-cloud environment. Due to business requirements, r&d colleagues need to be able to directly access clueterIP service and back-end POD of K8s cluster in the office network environment. Generally, K8s pods can only be accessed through other pods or cluster nodes in the cluster, not directly outside the cluster. When POD provides services inside and outside the cluster, the access address and port need to be exposed through service. Service not only plays the role of access entrance of POD application, but also probes the corresponding port of POD to realize health check. At the same time, when there are multiple pods at the back end, the service will forward the client requests to different PODS according to the scheduling algorithm to achieve load balancing. The following types of services are commonly used:

Service Type Introduction

A clusterIP service can only be accessed by A POD or node using a Cluster IP address. This type is typically used by services such as K8s cluster system service Kubernetes that do not need to provide services outside the cluster but only need to access services inside the cluster.

2. Nodeport type. In order to solve the external access requirements for services, nodePort type is designed to map the ports of services to the ports of each node in the cluster. When a service is accessed from outside the cluster, the request is forwarded to the back-end POD through access to the node IP and specified port.

3. The loadbalancer type, which usually requires calling the CLOUD vendor’s API to create a load balancing product on the cloud platform, and creating listeners based on the Settings. Inside K8s, the Loadbalancer type service actually maps the service port to a fixed port on each node like the NodePort type. The node is then set up as the back end of load balancing, and listeners forward client requests to service mapping ports on the back end node, where the requests are forwarded to the back end POD once they arrive. The Loadbalancer service eliminates the need for clients to access IP addresses of multiple LB nodes when nodePort has multiple nodes. When LB services are used to provide external services, K8s nodes only need to bind public IP addresses to LB nodes, improving node security and saving public IP address resources. The health check function of the LB on the back-end node enables high availability of services. Avoid service access failure caused by a K8s node failure.

Part1 summary

Based on the K8s cluster service type, we can know that customers want to access the service outside the cluster. LB service is recommended first. At present, nodes in the K8s cluster do not support binding to public IP addresses. Therefore, nodeport-type services cannot be used to access services over the public network. Customers can access nodeport-type services only when they connect their office network to the cloud network through private lines or IPSEC. With PODS, you can only access them from within the cluster using other PODS or cluster nodes. At the same time, the K8s clusterIP and POD are designed not to allow external access to the cluster to improve security. If access restrictions are broken, security problems may occur. Therefore, it is recommended that customers use LB service to expose services externally, or connect to the NAT host of the K8s cluster from the office network, and then connect to the K8s node through the NAT host, and then access clusterIP Service or back-end POD.

The customer said that currently there are hundreds of clusterIP services in the test cluster. If they are all transformed into LB services, hundreds of LB instances need to be created and hundreds of public IP addresses bound. This is obviously unrealistic. At the same time, if the NAT host is used to jump to the cluster node, the system passwords of the NAT host and cluster node need to be provided to the r&d colleagues, which is not conducive to operation and maintenance management. In terms of operation convenience, it is not as easy as the R&D can directly access service and POD through the network.

Part two: More ways than Difficulties?

Although the customer access mode violates the design logic of THE K8s cluster, it appears to be “non-mainstream”, but it is also a strong demand for the customer’s use scenarios. As a technical lion, we will try our best to help customers solve technical problems! Therefore, we plan the implementation scheme according to the customer’s requirements and scenario architecture.

Since the network is through, the first step is to analyze the K8s cluster network architecture from the customer’s office network and cloud. The customer’s office network has a unified egress device on the public network. However, the network architecture of the K8s cluster is as follows. The master node of the K8s cluster is invisible to users. Node subnets for K8s node communication, NAT and LB subnets for deploying NAT hosts and load balancing instances created by LB serivce, and POD subnets for POD communication. Nodes in the K8s cluster are built on cloud hosts, and the next hop of the route for accessing public IP addresses of node subnets points to NAT hosts. That is, cluster nodes cannot be bound to public IP addresses, and NAT hosts are used as unified public network egress to implement public network access. A NAT host provides only the SNAT function but not the DNAT function. Therefore, a node cannot be accessed from a NAT host outside the cluster.

The purpose of planning a POD subnet is to introduce the POD network architecture on the node. As shown below:

On the node, the container in POD is connected with docker0 device through VETH pair, while docker0 is connected with the network card of the node through self-developed CNI network plug-in. To separate cluster control traffic from data traffic and improve network performance, each node in the cluster is bound with an elastic network adapter for POD communication. When a POD is created, an IP address is assigned to the POD on the elastic NIC. Each elastic NIC can be assigned a maximum of 21 IP addresses. When an elastic NIC has full IP addresses, a new NIC is bound for subsequent pods. The subnet to which an elastic NIC belongs is a POD subnet. Based on this architecture, the load on the active NIC eth0 on a node can be reduced, and control traffic and data traffic can be separated. In addition, the POD IP addresses have corresponding network ports and IP addresses in the VPC network, enabling ROUTING for POD addresses in the VPC network.

You need to know how to get through

After understanding the network architecture at both ends, we will choose the way to get through. Usually, the network under the cloud is connected with the network on the cloud. There are dedicated product connection modes or user-built VPN connection modes. Dedicated Line Product connection You need to configure a dedicated line from the customer’s office network to the equipment room on the cloud. Then configure routes to each other on the network egress device on the customer’s office network and the BGW border gateway on the cloud network. As shown below:

Due to the BGW function limitation of the existing dedicated line product, routes on the upper cloud side can only point to the VPC where the K8s cluster resides, not to a specific K8s node. To access clusterIP services and Pods, you must access them from nodes and pods in the cluster. Therefore, the next hop of the route to service and POD must be a cluster node. So the use of dedicated line products is obviously unable to meet the demand.

Let’s look at the self-built VPN mode. The self-built VPN has an endpoint device with a public IP address on the customer’s office network and on the cloud network, and an encrypted communication tunnel is established between the two devices. The actual underlying layer is still based on public network communication. If this solution is used, the endpoints on the cloud can be selected as cloud hosts with public IP addresses under different subnets in the same VPC as cluster nodes. Office network side access to the service and pod packets sent through the VPN tunnel to cloud after the host, the host’s subnet route can be configured cloud, will the packet routing nodes to a cluster, and then in a cluster node configuration to the client’s subnet route the next hop to endpoint cloud hosting, at the same time need to do the same in the pod subnet route configuration. As for the implementation of VPN, through communication with the customer, we choose ipsec tunnel.

Having identified the solution, we need to implement the solution in the test environment to verify its feasibility. Since we do not have a cloud environment, we choose cloud hosts in different regions from K8s cluster to replace the client’s office network endpoint equipment. Create a cloud host k8S-ipsec-BJ with a public IP address on the NAT/LB subnet of the VPC of K8S-Bjtest01 in the K8s cluster k8S-Bjtest01 in North China. Simulate the upper end of ipsec cloud in the client scenario, and establish an ipsec tunnel with the east China Shanghai cloud host Office-ipsec-sh. Set the routing table of the NAT/LB subnet, and the next hop of the route added to the service network segment points to the K8s cluster node K8S-Node-vmLppP-bs9jq8pua, hereinafter referred to as Node A. The POD subnet and NAT/LB subnet belong to the same VPC. Therefore, you do not need to configure a route to the POD network segment. When a POD is accessed, it matches a Local route and is forwarded to an elastic NIC. To return packets, configure routes to office ipsec-sh of The Shanghai cloud host on the Node subnet and POD subnet respectively, with the next hop pointing to K8S-ipsec-BJ. The complete architecture is shown in the figure below:

Part 3: Practice goes “wrong”

Now that we have the solution, we can start building the environment. Create a K8S-ipsec-BJ cloud host on the NAT/LB subnet of the K8s cluster and bind it to a public IP address. Then, an ipsec tunnel is established with the Shanghai cloud host Office-ipsec-sh. There are many documents about ipsec configuration methods on the network. This document does not provide detailed description. If you are interested, you can refer to the documents for practice. After the tunnel is established, ping the Intranet IP address of the peer end. If the ping succeeds, ipsec works properly. Configure routes for the NAT/LB subnet, Node subnet, and POD subnet as planned. In the K8s cluster, we select a serivce named nginx, whose clusterIP is 10.0.58.158, as shown in the figure:

The pod at the back of the service is 10.0.0.13, deploys the default nginx page, and listens on port 80. Test the IP address of the ping service 10.0.58.158 on the Shanghai cloud host, and ping port 80 on the service using paping.

Use curl http://10.0.58.158 to make an HTTP request.

Retest direct access to back-end pod, no problem 🙂

Just when the siege lion thought it was all done, the test access to another service was like a cold shower. We then select mysql service to test access to port 3306. The clusterIP of the server is 10.0.60.80 and the IP of the back-end POD is 10.0.0.14.

Ping service clusterIP directly from the Cloud host in Shanghai, no problem. But paping port 3306, incredibly blocked!

Then we tested the back-end pod directly accessing Serivce. Strangely, the back-end POD was able to ping IP or Paping port 3306.

What’s going on?

What’s going on here? The only difference between the two serivce servers is that the back-end POD 10.0.0.13, which can connect to the Nginx service, is deployed on Node A, to which the client requests are forwarded. The back-end pod of the disconnected mysql service is not on Node A, but on another node. To verify if this is the cause of the problem, we individually modified the NAT/LB subnet route so that the next hop to the mysql service points to the node where the back-end POD is located. Then test again. Sure enough! Now you can access port 3306 of the mysql service!

Part four: Three Whys?

At this moment, the siege lion has three questions in mind:

(1) Why can the request be connected when it is forwarded to the node where the SERVICE back-end POD is located?

(2) Why the request cannot be connected when it is forwarded to the node where the SERVICE back-end POD is not present?

(3) Why can the IP address of the Service be pinged through no matter which node it is forwarded to?

Deep analysis, eliminate question marks

In order to eliminate the question mark in our minds, we need to analyze deeply, understand the cause of the problem, and then respond to the problem. Tcpdump: Tcpdump: Tcpdump: tcpdump: tcpdump To keep the focus focused, we changed the architecture of the test environment. The existing architecture of ipsec from Shanghai to Beijing remains unchanged. We expand the capacity of nodes in K8s cluster and create an empty node k8S-Node-VMCRm9-bST9jQ8pua without any POD, hereinafter referred to as Node B, which only performs request forwarding. Modify a NAT/LB subnet route so that the next hop of the route to access the Service address points to this node. We select the nginx service 10.0.58.158 and back-end POD 10.0.0.13, as shown in the figure below:

When we need to test the scenario where the request is forwarded to the pod node, we can change the next hop of the service route to K8S-Node-a.

All set, let’s embark on our journey! Go Go Go!

Firstly, we explore the scenario of Question 1. We run the following command on K8S-Node-a to grab the packet connected to 172.16.0.50 of Shanghai cloud host:

Tcpdump -i any host 172.16.0.50 -w/TMP /dst-node-client.cap

Do you remember that in a hosted K8s cluster, all pod data traffic is sent and received through the elastic network card of the node? The elastic network adapter used by pod on K8S-Node-a is eth1. K8s-node-a = k8S-node-a = k8S-node-a = k8S-node-a = k8S-node-a = k8S-node-a = k8S-node-a = k8S-node-a = k8S-node-a = k8S-node-a = k8S-node-a = k8S-node-a

Tcpdump -i eth1 host 10.0.0.13

The results are shown below:

No packet 10.0.0.13 was sent or received from eth1, but the curl operation on the Shanghai cloud host was successful, indicating that the packet 10.0.0.13 was sent to the client, but the packet was not sent through eth1. Then expand the packet capture scope to all interfaces. The command is as follows:

Tcpdump -i any host 10.0.0.13

The results are shown below:

Tcpdump -i any host 10.0.0.13 -w/TMP /dst-node-pod.cap to output the packet as a CAP file.

Then run the tcpdump -i any host 10.0.58.158 command to capture packets from the service IP address.

When 172.16.0.50 performs the curl request, it can capture data packets. Only 10.0.58.158 interacts with 172.16.0.50. Since these packets will be included in the captured packets of 172.16.0.50, we will not analyze them separately.

Use the Wireshark to analyze the captured packets on 172.16.0.50 and 10.0.0.13, as shown in the following figure:

The 172.16.0.50 client sends a packet to service IP 10.0.58.158 and then to POD IP 10.0.0.13. The ids and contents of the two packets are identical. Finally, pod 10.0.0.13 returns a packet to the client, and service IP 10.0.58.158 returns a packet with the same ID and content. What causes this?

In this process, the client requests the IP address of the Service. The service then performs A DNAT (NAT forwarding based on the destination IP address) to forward the request to the POD IP address of the back-end. Although we can see that the client sends the packet twice, one to the Service and the other to the POD, the client does not send the packet again, but the service does the destination address translation. When a POD sends a packet back to a Service, the service forwards the packet to the client. Since the request is within the same node, the process should be completed in the internal virtual network of the node, so we did not catch any data packets interacting with the client on the ETH1 network card used by pod. Combined with the packet capture of pod dimension, we can see that the HTTP GET request packets caught in the packet capture of client can also be caught in the packet capture of POD, which also verifies our analysis.

So what network interface does a POD send and receive packets through? Run the netstat -rn command to check the network route on Node A. The following information is found:

Within the node, all routes to 10.0.0.13 point to the network interface CNI34F0B149874. This interface is clearly a virtual network device created by the CNI network plug-in. In order to verify that all pod traffic is sent and received through this interface, we request the Service address from the client again, and capture packets in the client dimension and pod dimension in Node A. But this time, we replace the -i any parameter with -i CNI34F0B149874 when capturing packets in the POD dimension. After packet capture analysis and comparison, it is found that, as we expected, all the request packets of the client to POD can be found in the packet capture of CNI34F0B149874. Meanwhile, none of the packets of other network interfaces except CNI34F0B149874 in the system can be caught with the client. So we can prove that our inference is correct.

To sum up, when the client request is forwarded to the node where POD is located, the data path is as follows:

Next, we explore the most concerned problem 2 scenario, where the next hop of the NAT/LB subnet route to the Service points to the newly created node B, as shown in the figure

This time we need to capture packets on both Node B and Node A. Using curl to request the service address. On the forwarding node B, we run the tcpdump -i eth0 host 10.0.58.158 command to capture data packets of the service dimension, and find that the request packet from the client to the service is captured, but there is no return packet of the service, as shown in the figure:

10.0.58.158 is captured, but the destination end displayed in the captured packet is the node name. In fact, it has to do with the service implementation mechanism. After a service is created in a cluster, the cluster network component selects a random port on each node to listen and then configudes forwarding rules in iptables of the node. All requests for service IP addresses within the node are forwarded to this random port and processed by the cluster network component. So when you access a service inside a node, you’re actually accessing a port on the node. If you export the captured packet as a CAP file, you can see that the destination IP of the request is still 10.0.58.158, as shown in the figure:

This also explains why clusterIP can only be accessed by nodes or PODS in the cluster, because devices outside the cluster do not have the iptables rules created by the K8s network component to convert the requested service address to the port of the requesting node, even if packets are sent to the cluster. The clusterIP of the Service does not exist on the node network and is discarded. (Strange gestures grow again.)

Back to the problem itself, the service related packet is captured on the forwarding node, and it is found that the service does not send the packet back to the client as when it is forwarded to the pod node. Then run the tcpdump -i any host 172.16.0.50 -w/TMP/FWd-node-client. cap command to capture packets from the client. The packet contents are as follows:

After the client requests the service on node B, the service also performs DNAT and forwards the request to 10.0.0.13 on Node A. However, the forwarding node did not receive any packet 10.0.0.13 back to the client. After that, the client retransmitted the request packet for several times, but did not respond.

Does Node A receive A request packet from the client? Does POD send a packet back to the client? We move to Node A to capture packets. From the packet capture on Node B we can see that node A should only have client IP and pod IP interaction, so we capture packets from these two dimensions. According to the analysis results of captured packets, after packets enter the node, they should interact with POD through the virtual device CNI34F0B149874. To verify this, run tcpdump -i eth0 host 172.16.0.50 and tcpdump -i eth0 host 10.0.0.13. No packets were caught.

Note The packet does not go eth0. Then run tcpdump -i eth1 host 172.16.0.50 -w/TMP /dst-node-client-eth1.cap and tcpdump -i cNI34f0b149874 host 172.16.0.50 -w / TMP /dst-node-client-cni.cap Captures client dimension data packets. The content of the data packets is consistent, indicating that the data packets enter Node A from eth1 and are forwarded to CNI34F0B149874 through the internal route. The packet content is as follows:

You can see that after the client sends the package to the POD, the pod sends the package back to the client. Run tcpdump -i eth1 host 10.0.0.13 -w/TMP /dst-node-pod-eth1.cap and tcpdump -i host 10.0.0.13 -w / TMP /dst-node-pod-cni.cap captures the POD dimension packets. The packet contents are consistent, indicating that the POD packet is sent to the client through CNI34F0B149874, and then leaves node A through network adapter eth1. The pod returns a packet to the client, but receives no response from the client, triggering retransmission.

So if the pod packet has been sent, why is it not received on Node B, nor on the client? Check the POD subnet routing table for the eth1 nic.

Because the POD packet is sent back to the client from the eth1 network adapter of Node A, the packet should be sent back to the service port of Node B according to the normal DNAT rules. However, the packet is directly hijacked to the host K8S-ipsec-BJ due to the effect of the eth1 subnet routing table. After the packet arrives at the host, the source address of the packet is pod 10.0.0.13 and the destination address is 172.16.0.50, and the packet is sent back to the source address 172.16.0.50 and destination address 10.0.58.158. The destination address of the request packet is inconsistent with the source address of the reply packet. K8s-ipsec-bj sees the reply packet from 10.0.0.13 to 172.16.0.50, but does not receive the request packet from 172.16.0.50 to 10.0.0.13. The mechanism of the virtual network of the cloud platform is that if there is only reply packet but no Request packet, the request packet is discarded to avoid network attacks using address spoofing. Therefore, the client will not receive 10.0.0.13 and will not be able to complete the service request. In this scenario, the path of packets is shown below:

At this point, the reason why the client can successfully request POD is clear. The data path of requesting POD is as follows:

The path of the request packet and the return packet is the same. Both packets pass through the K8S-ipsec-BJ node and the source IP address is not changed. Therefore, THE POD can be connected.

The next hop of the packet destined for 172.16.0.50 is not sent to K8S-ipsec-BJ, but is returned to K8S-Node-B. Then the packet will be returned to k8S-Node-B along the original route. Yes, in our tests, this does enable the client to successfully request the service. However, there is also a requirement that the client can access the back-end POD directly. If the POD packet is returned to Node B, what is the data path for the client to request the POD?

As shown in the figure, when the request from the client to Pod reaches K8S-ipsec-BJ, it is directly forwarded to node A eth1 network adapter following the local routing rules. When Pod sends A packet back to the client, it is routed to Node B under the routing control of eth1 network adapter. Node B has not received the request package for POD from the client before, so it also encounters the problem that only the reply package does not have the Request package. Therefore, the reply package is discarded, and the client cannot request POD.

At this point, we know why a client request cannot successfully access the Service when forwarded to a node where the Service backend POD is not present. The port of the service cannot be pinged, but the address of the service cannot be pinged. It is inferred that since service plays the functions of DNAT and load balancing on the POD of the backend, when the client pings the IP address of the SERVICE, the SERVICE directly responds to the client. That is, the Service responds to the PING packet of the client instead of the POD of the backend. To verify that our inference is correct, we create a new empty service in the cluster with no associated backend, as shown in the figure:

Ping 10.0.62.200 on the client

As a result, ICMP packets can be pinged through even if there is no POD on the service back-end. Therefore, ICMP packets are answered by the Service and there is no problem in the actual request for the pod on the back-end.

Part five: God always opens one door

Now that we have found the reason for the failed access, we need to solve the problem. In fact, you can avoid packet discarding by making the POD hide its IP address and display the SERVICE IP address when sending packets back to clients across nodes. The principle is similar to SNAT (source IP address-based translation). For example, if a LAN device without a public IP address has its own internal IP address, users need to access the public network through a unified public network egress. In this case, the client IP address is the IP address of the public network egress, not the internal IP address of the LAN device. To implement SNAT, we first think through the iptables rules on the node operating system. Run the iptables-save command on node A where the POD resides to check the existing iptables rules in the system.

Knock on the blackboard. Pay attention

You can see that the system has created nearly a thousand iptables rules, most of them related to K8s. We focused on the NAT type rules in the figure above and found the following ones that caught our attention:

Let’s look at the red box rule

-A KUBE-SERVICES -m comment –comment “Kubernetes service cluster ip + port for masquerade purpose” -m set –match-set KUBE-CLUSTER-IP src,dst -j KUBE-MARK-MASQ

This rule indicates that if the source address or destination address is cluster IP + port, the kube-Mark-masq chain will be jumped for masquerade purposes. Address masquerading is used in NAT.

Now let’s look at the blue box rule

-A KUBE-MARK-MASQ -j MARK –set-xmark 0x4000/0x4000

This rule indicates that packets are marked 0x4000/0x4000 for address camouflage.

Finally, look at the rules in the yellow box

-A KUBE-POSTROUTING -m comment –comment “kubernetes service traffic requiring SNAT” -m mark –mark 0x4000/0x4000 -j MASQUERADE

This rule indicates that packets marked 0x4000/0x4000 that need to be SNAT are redirected to the MASQUERADE chain for address camouflage.

These three rules seem to do exactly what we need Iptables to do for us, but it’s clear from our previous tests that they didn’t work. Why is that? Is there a parameter in the network component of K8s that controls whether packets accessing clusterIP will be SNAT?

This is studied from the work mode and parameters of kube-proxy, the component responsible for network proxy forwarding between Service and POD. We already know that service performs load balancing and proxy forwarding on back-end PODS. To do this, we rely on the Kube-Proxy component, which is a proxy network component as the name suggests. It runs on each K8s node in the form of POD. When it is accessed by clusterIP+ port of Service, the request is forwarded to the corresponding random port on the node through iptables rules, and then the kube-proxy component takes over the processing of the request. Through kube-proxy internal routing and scheduling algorithm, forward to the corresponding back-end Pod. Initially, kube-proxy works in userSpace mode. The Kube-proxy process is a real TCP/UDP proxy in this period, similar to HA proxy. Since this mode has been replaced by iptables mode since K8s version 1.2, I will not go into details here. Interested children can study it on their own.

The iptables mode introduced in version 1.2 is the default mode of Kube-proxy. Kube-proxy does not play the role of proxy, but implements traffic forwarding from service to POD by creating and maintaining corresponding iptables rules. However, there are inevitable defects in implementing agents based on iptables rules. When a large number of services and pods are added to the cluster, the number of Iptables rules will also increase dramatically, leading to significant degradation of forwarding performance and even rule loss in extreme cases.

In order to solve the disadvantages of iptables mode, K8s introduced IP Virtual Server (IPVS) mode in version 1.8. The IPVS mode is designed for high-performance load balancing and uses more efficient hash table data structures to provide better scalability and performance for large clusters. More complex load balancing scheduling algorithms are supported than the Iptables schema. Kube-proxy, which hosts clusters, uses the IPVS pattern.

However, IPVS mode cannot provide packet filtering, address masquerading, and SNAT functions. Therefore, IPVS must be used with iptables rules in scenarios where these functions are required. Wait, address masquerading and SNAT, isn’t that what we saw earlier in the Iptables rules? This means that iptables does not follow the iptables rules when address masquerading and SNAT are not implemented, but once a parameter is set to enable address masquerading and SNAT, the iptables rules as seen before will take effect! So we went to the official website of Kubernetes to find the working parameters of Kube-Proxy, and found exciting findings:

What a sudden turn of mind! The sixth sense of the siege lion tells us that the –masquerade-all parameter is the key to solving our problem!

Part six: Truth · Methods outnumber difficulties

We decided to test turning on the –masquerade-all parameter. Kube-proxy runs as a POD on each node in the cluster, and the parameter configurations of Kube-Proxy are mounted to the POD as a ConfigMap. Kubectl get cm-n kube-system kube-proxy configMap

Kube-proxy configMap is shown in the red box. Run kubectl edit cm kube-proxy-config-khc289cbhd -n kube-system to edit the Configmap, as shown in the figure

Finding the masqueradeALL parameter, which defaults to false, we change it to true and save the changes.

To make the configuration take effect, the current Kube-proxy Pods need to be deleted one by one. Daemonset will automatically rebuild the POD, which will mount the modified ConfigMap, and masqueradeALL will be enabled. As shown in the figure:

Rubbing hands expectantly

The exciting moment comes when we route the access to service to Node B and execute paping 10.0.58.158-p 80 on the Shanghai client to see the result (rubbing our hands in anticipation) :

This scene, can not help but let the siege lion shed tears of joy…

Curl http://10.0.58.158 can also be successful! The force to ~

There was no problem accessing the back-end Pod directly and forwarding the request to the node where the Pod was located. So far the customer demand finally swastika solution, a long sigh of relief!

The finale: Know why

The problem has been solved, but we’re not done yet. After masqueradeALL is enabled, how does the service SNAT packets to avoid packet loss? Analysis is still carried out through packet capture.

Firstly, the scenario of forwarding to the node where POD is not present is analyzed. When the client requests service, packets are captured on the CLIENT IP address of the node where POD is located, but no packets are captured.

Note After this parameter is enabled, requests to back-end PODS are no longer sent from client IP addresses.

Packet capture of POD IP on the forwarding node can capture the interaction packets between pod and service port on the forwarding node.

Note Pod does not directly send packets back to 172.16.0.50. In this way, the client and pod are unaware of each other’s existence and all interactions are forwarded through the Service.

Then, packets are captured on the forwarding node. The packet content is as follows:

At the same time, the pod is captured on the node where the POD is located. The package content is as follows:

The forwarding node receives the curl request packet with the serial number 708, and the pod node receives the request packet with the same serial number, but the source IP address is changed from 172.16.0.50/10.0.58.158 to 10.0.32.23/10.0.0.13. Here, 10.0.32.23 is the internal IP address of the forwarding node, which is actually the random port corresponding to the service on the node. Therefore, it can be understood that the source IP address is translated to 10.0.58.158/10.0.0.13. The packet return process is the same. Pod sends a packet with the serial number 17178, and the forwarding node sends the packet with the same serial number to the client. The source IP address is converted from 10.0.0.13/10.0.58.158 to 10.0.58.158/172.16.0.50

Based on the preceding symptoms, the Service performs SNAT on both the client and the back end, which can be interpreted as disabling load balancing for transparent transmission of the source IP address of the client. That is, the client and back end do not know the existence of each other but only know the address of the Service. The data path in this scenario is shown as follows:

The request for Pod does not involve SNAT transformation and is the same as if the masqueradeALL parameter was not turned on, so we will not do the analysis.

When a client request is forwarded to the pod node, the Service still does the SNAT conversion, but it is done inside the node. We have also learned from the previous analysis that whether SNAT is carried out when the client requests to forward to the node where POD is located has no influence on the access result.

conclusion

So far for the needs of customers, we can give the best plan at this stage. Of course, in the production environment, for the sake of business security and stability, it is not recommended that users directly expose clusterIP-type services and PODS outside the cluster. In addition, the impact of masqueradeALL on cluster network performance and other functions has not been tested. Therefore, the risk of enabling masqueradeALL in the production environment is unknown and needs to be treated with caution. Through the process of solving customer needs, we have a certain degree of understanding of K8s cluster service and POD network mechanism, and understand kube-proxy masqueradeALL parameters, for the future learning and operation and maintenance work or benefit.

Here thank you for reading, if you can help, welcome to like and forward, and pay attention to our public account, more wonderful content will continue to broadcast!

Recommended reading

  • Enterprise project | Kubernetes network environment
  • Online open classes | after reading this article, easy open Kubernetes trip
  • Dry goods | jingdong cloud + Traefik Kubernetes cluster of actual combat

Welcome to [JINGdong Technology] to learn about the developer community

More wonderful technical practice and exclusive dry goods analysis

Welcome to “JINGdong Technology Developer” public account