In the last article of monitoring problem troubleshooting, the author analyzed the principle of edge monitoring and problem troubleshooting in KubeSphere 3.1.0 integrated KubeEdge. In the introduction of EdgeWatcher component, the author mentioned the restriction condition that “the internal IP of edge node needs to be unique in the cluster”. This article will take a deep look at this problem and try to give you edge developers some suggestions and ideas to solve it.

Normal scenario

When an edge Node is added to the cloud cluster, you need to specify Node Name and Internal IP, which are the Node Name and Internal IP address of the edge Node. The Intranet IP address here is the subject of this article and needs to be unique within the cluster.

KubeSphere provides the authentication function in EdgeWatcher to verify that the specified Intranet IP address is occupied. In the case of verification failure (the IP address is already occupied), the edge node is not provided with the command line output for joining the cluster. The following two figures show the successful and failed validation scenarios.

Verification succeeded:

Verification failed:

KubeSphere has been very careful in this regard, providing the user with a UI “Validate” button and back-end API, both directly and for secondary development based on KubeSphere.

Illegal scenarios

As shown in the previous section, the result of an occupied Intranet IP address is that it cannot be added to the cluster because it has already been registered with EdgeWatcher and cannot be used by other edge nodes.

If an IP address has not been registered with EdgeWatcher, that is, the edge node is not actually connected to the cluster, you can skip this step and add two edge nodes with the same Intranet IP address to the same cluster to create this illegal usage scenario.

The problem with this illegal scenario is that an “early cluster” edge node with the same IP will fail both logs exec and metrics. That is, the operation and maintenance functions in the following figure are without data.

Previously, THE author also raised this issue in the KubeSphere developer community, and also communicated with the community developers responsible for edge modules. It was confirmed that in the product design of KubeSphere, the Intranet IP needs to be planned by administrators or users according to their needs, so as to ensure that there is no duplication.

A potential problem

In private deployment scenarios, unified IP address planning is relatively easy. So what if KubeSphere based edge solutions work in a public cloud scenario?

Public cloud users are not subject to planning restrictions, and the number of concurrent users is large. Therefore, the problem that the same IP address joins the cluster is very likely to occur. Logs exec and Metrics capabilities fail for some users, resulting in a large number of problematic work orders and decreased user engagement. Therefore, in the public cloud scenario, this problem must be solved. The following is a detailed analysis of the root cause of the problem and solutions.

The root cause

Before solving the problem, we should make clear the root cause of the problem, so that we can solve and deal with the problem with a definite aim.

In the previous article, we briefly introduced how metrics data acquisition works in the KubeEdge scenario: Iptables on the Kube-Apiserver are forwarded to Cloudcore in the cloud, and Cloudcore transfers messages and data to the edge through the WebSocket channel between the Kube-Apiserver and Edgecore.

Logs and exec are implemented in the same way as metrics. The following diagram briefly describes how these functions work under KubeEdge.

Consider the cloudCore (KubeEdge cloud component) shown in red in the above figure to explain why Intranet IP addresses need to be unique within the cluster.

When the edge node (edgecore) is connected to the cloud cluster, a WebSocket channel is established between the edgecore and the cloud. In order for the cloud to communicate with the edge node through this WebSocket channel, it is necessary to save this channel as a session in the cloud. The data structure is a map with “Intranet IP” as the key and session (WebSocket channel) as the value.

If the Intranet IP is the same, the session records of the edge nodes that joined the cluster earlier will be overwritten. When the cloud searches for the POD monitoring and operation data on the “edge node that is overwritten by the session”, it will certainly not find it.

The root cause of the problem has been found, and the solution is clear. In the next section, the author will briefly elaborate the solution to this problem.

The following is a sequence diagram of the Logs feature in KubeEdge’s edge scenario for interested developers to learn more about.

solution

In the last section, the root cause was clarified, and the solution was clear. In line with the principle of non-invasive transformation, as little as possible to change KubeSphere and KubeEdge, to enhance and expand the upper business logic is the best choice in the author’s mind.

Since the root cause is IP conflicts that cause sessions to be overwritten, it is natural to provide a service that does not duplicate IP addresses in the cluster, also known as IPAM. The IPAM service is introduced in the cloud business logic layer to provide the unique IP address allocation capability for user edge nodes in the cluster.

It is also important to note that the unique IP address assigned by the IPAM service belongs to the Internal implementation and cannot be displayed to users as “Internal IP”. The Intranet IP address of the edge node is still the IP address planned and entered by the user. However, after the modification, the Intranet IP address is no longer used as the session key and conflict check is no longer required. Instead, the IP address is displayed on the page to facilitate user search and improve the usability of the product.

The following is the idea of nodes to join the flow chart, for your developers reference.

According to the above flow chart, the author also lists the points that need to be modified in the above solution:

  1. Create IPAM services in a cluster to provide functions such as IP address allocation and IP address retracting.
  2. Create node services at the service layer to provide node names and display persistence capabilities such as IP addresses and unique IP addresses.
  3. Modify keadm and edgecore, support node IP optional
  4. Modify CloudCore to query unique IP addresses by node names during node registration and register nodes as Internal IP addresses.
  5. Hide the unique IP address (internal IP on K8s) of the northbound interface at the service layer and replace it with the display IP entered by the user.

Afterword.

By analyzing the phenomena and principles, we propose a solution to the IP conflict problem of edge nodes based on KubeSphere in the public cloud environment. Given my technical ability, there may be a simpler and more effective solution. I welcome your valuable suggestions and let’s make KubeSphere based edge solutions bigger and stronger.

This article is published by OpenWrite!