Author: Ten Sleep

background

As the core component of service registry discovery, registry is an essential part of microservices architecture. In CAP’s model, the registry can sacrifice A little bit of data consistency (C), that is, the service address that each node gets at the same time allows for temporary inconsistencies, but availability must be guaranteed (A). Because if the registry becomes unavailable due to some problem, or the service is not connected to the registry, the node that wants to connect to it can have a catastrophic impact on the entire system because it cannot get the address of the service.

A real case

The whole article starts from a real case, a customer uses Kubernetes cluster to deploy many of their own micro services on Ali Cloud, due to an abnormal network card of a CERTAIN ECS, although the abnormal network card was quickly recovered, but there was a large area of continuous service unavailability, business damage.

Let’s take a look at the problem how do chains form?

  1. The ECS failed node runs all Pod of CoreDNS, the core basic component of Kubernetes cluster, and the earlier version of Kubernetes cluster lacks the feature of NodeLocal DNSCache, resulting in DNS resolution problems in the cluster.
  2. The client’s service was found to be using a defective client version (nacos-Client version 1.4.1). The defect of this version is related to DNS. If a heartbeat request fails to resolve the domain name, the process will not renew the heartbeat until restart.
  3. This defective version is actually a known problem. Ali Cloud pushed a notice of serious bug in Nacos-Client 1.4.1 in May, but the client didn’t receive the notice, and then used this version in the production environment.

Risks are interlinked and indispensable.

The ultimate cause of failure is that services cannot be called downstream, resulting in reduced availability and impaired services. The following figure shows the root cause of the problem caused by client defects:

  1. DNS exceptions occur on the Provider client during heartbeat renewal.
  2. The heartbeat thread failed to handle the DNS exception correctly, causing the thread to exit unexpectedly.
  3. The normal mechanism of the registry is that the heartbeat is not renewed and the registry is automatically logged out after 30 seconds. Because CoreDNS affects the DNS resolution of the entire Kubernetes cluster, all instances of the Provider encounter the same problem, and all instances of the entire service are offline.
  4. On the Consumer side, when an empty list of pushes is received and the downstream cannot be found, an exception occurs to the upstream (such as the gateway) that called it.

Looking back at the case as a whole, each risk appears to be a small probability, but when it does occur, it can have a bad effect. High availability of service discovery is an important part of microservices architecture, which is often overlooked. This has always been an essential part of Ali’s internal failure drills.

Design for failure

Because of the jitter of the network environment such as CoreDns exception, or is our registry is not available due to some factors such as situation, service batch failure often happen, but this is not a business service is not available, if our micro service can identify that this is a kind of abnormal situation (batch failure or address when empty). A conservative strategy should be adopted, so as not to cause the problem of “no provider” in all services by mistake, which will lead to the failure of all micro-services being unavailable and difficult to recover for a long time.

From the perspective of microservices, how can we cut the chain of problems above? The above case seems to be a problem caused by a lower version of Nacos-Client, but what if we were using a Registry like ZooKeeper or Eureka? Can we pat our chests and say that none of the above will happen? The principle of fail-oriented design tells us that if the registry dies, or if our services fail to connect to the registry, we need a way to ensure that our services are called and that business continues online.

This article introduces the mechanism of high availability in the process of service discovery, and considers how to solve the above problems thoroughly from the perspective of service framework.

Analysis of high availability principles in service discovery

Service discovery high availability – push off protection

Failure oriented design tells us that the service cannot trust the address of the registry notification completely, and that when the registry push address is empty, the service invocation will definitely issue a no Provider error, so we ignore the push address change.

Microservice governance center provides push protection capability

  • Default non-intrusive support for the Spring Cloud and Dubbo frameworks that have been on the market for the last five years

  • No registry implementation, no need to upgrade the client version

Service found high availability – Outlier instance removed

Heartbeat renewal is the basic way the registry senses instance availability. However, in certain cases, heartbeat survival is not the same as service availability. There are still cases where the heartbeat is normal but the service is not available, for example:

  • The thread pool for Request processing is full

  • Dependent RDS connection exceptions result in a lot of slow SQL

  • Load is high on certain machines due to full disks or host resource contention

At this time, the service cannot completely trust the address notified by the registry, and there may be some service providers with low service quality in the address pushed. Therefore, the client needs to judge the availability of the service address and the quality of the service provided according to the result of the call, so as to ignore some addresses.

The microservice governance Center provides outlier instance removal

  • Non-intrusive by default, it supports Spring Cloud and Dubbo frameworks that have been on the market for the last five years

  • No registry implementation, no need to upgrade the client version

  • Exception detection-based removal policy: including network exceptions and network exceptions + Service exceptions (HTTP 5XX)

  • Set the exception threshold, QPS lower limit, and removal ratio lower limit

  • Remove event notifications and staple group alarms

The ability to remove outlier instances complements the measure of service availability based on the invocation exception characteristics of a particular interface.

Hands-on practice

The premise condition

  • A Kubernetes cluster has been created, see Creating a Kubernetes Managed Edition Cluster [1].

  • MSE Microservice Governance Professional edition has been opened. For details, see Opening MSE Microservice Governance [2].

The preparatory work

Enable MSE microservice management

1. Open the professional version of micro-service governance:

  1. Click open MSE microservice governance [3].
  2. Microservice Governance Version Select Professional, select service Protocol, and click Open Now. For details on microservice governance billing, see Price Description [4].

2. Install the MSE microservice governance component:

  1. In the left navigation bar of the Container Services Console [5], choose Marketplace > Application Catalog.
  2. Enter ACK-MSE-Pilot in the search box on the Application Catalog page, click the search icon, and then click Components.
  3. On the details page, select the cluster for which the component is enabled and click Create. After the installation is complete, apply it to the mSE-PilotMSE-pilot-ACK -mse-pilot namespace, indicating that the installation is successful.

3. Enable microservice governance for applications:

  1. Log in to the MSE Governance Center console [6].
  2. In the navigation tree, choose Microservice Governance Center > Kubernetes cluster list.
  3. On the Kubernetes Cluster list page, search for a target cluster, click the search icon, and then click Manage under the target cluster operation column.
  4. In the namespace list area on the cluster details page, click next to the target namespace in the operation column to enable microservice governance.
  5. In the enable Micro Service governance dialog box, click Ok.

Deploy the Demo application

  1. In the left navigation bar of the Container Services Console [5], click cluster.
  2. On the cluster list page, click the name of the target cluster or details in the operation column on the right of the target cluster.
  3. In the navigation tree of the cluster management page, choose Workload > Stateless.
  4. Select the namespace on the stateless page, and then click Create Resource with YAML.
  5. Configure the template and click Create. Sc-consumer, SC-consumer-Empty, and SC-Provider are deployed in this example using the open source Nacos.

Deploying a sample application (SpringCloud)

YAML:

Sc-consumer apiVersion: apps/v1 kind: Deployment metadata: name: sc-consumer spec: replicas: 1 selector: select * from sc-consumer apiVersion: apps/v1 kind: Deployment metadata: name: sc-consumer spec: replicas: 1 selector: matchLabels: app: sc-consumer template: metadata: annotations: msePilotCreateAppName: sc-consumer labels: app: Sc-consumer spec: containers: - env: - name: JAVA_HOME value: /usr/lib/jvm/java-1.8-openJDK/jre-name: spring.cloud.nacos.discovery.server-addr value: nacos-server:8848 - name: profiler.micro.service.registry.empty.push.reject.enable value: "true" image: Registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1 imagePullPolicy: Always name: sc - consumer ports: - containerPort: 18091 livenessProbe: tcpSocket: port: 18091 initialDelaySeconds: 10 periodSeconds: ApiVersion: apps/v1 kind: Deployment metadata: name: sc-consumer-empty spec: replicas: 1 selector: matchLabels: app: sc-consumer-empty template: metadata: annotations: msePilotCreateAppName: sc-consumer-empty labels: app: sc-consumer-empty spec: containers: - env: - name: JAVA_HOME value: / usr/lib/JVM/Java - 1.8 - its/jre - name: spring. Cloud. Nacos. Discovery. The server - addr value: nacos - server: 8848 image: Registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1 imagePullPolicy: Always name: sc-consumer-empty ports: - containerPort: 18091 livenessProbe: tcpSocket: port: 18091 initialDelaySeconds: 10 periodSeconds: 30 # sc-provider --- apiVersion: apps/v1 kind: Deployment metadata: name: sc-provider spec: replicas: 1 selector: matchLabels: app: sc-provider strategy: template: metadata: annotations: msePilotCreateAppName: sc-provider labels: app: sc-provider spec: containers: - env: - name: JAVA_HOME value: / usr/lib/JVM/Java - 1.8 - its/jre - name: spring. Cloud. Nacos. Discovery. The server - addr value: nacos - server: 8848 image: Registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-provider-0.3 imagePullPolicy: Always name: sc - provider ports: - containerPort: 18084 livenessProbe: tcpSocket: port: 18084 initialDelaySeconds: 10 periodSeconds: 30 # Nacos Server --- apiVersion: apps/v1 kind: Deployment metadata: name: nacos-server spec: replicas: 1 selector: matchLabels: app: nacos-server template: metadata: labels: app: nacos-server spec: containers: - env: - name: MODE value: standalone image: nacos/nacos-server:latest imagePullPolicy: Always name: nacos-server dnsPolicy: ClusterFirst restartPolicy: Always # Nacos Server Service configuration -- apiVersion: v1 kind: Service metadata: name: nacos-server spec: ports: - port: 8848 protocol: TCP targetPort: 8848 selector: app: nacos-server type: ClusterIPCopy the code

We need to increase an environment variable in Consumer profiler. Micro. Service. The registry. Empty. Push. Reject. Enable = true, pushed open the registry empty protection (don’t need to upgrade the client version of registry, MSE Nacos, Eureka, ZooKeeper and self-built Nacos, Eureka, Console, ZooKeeper etc.)

Add SLBS to Consumer applications for public network access

{sc-consumer-Empty} represents the public address of the SLB of the SC-consumer-Empty application, and {sc-consumer} represents the public address of the SLB of the SC-consumer application.

Application scenarios

The following uses the Demo prepared above to practice the following scenarios

  • Writing test scripts

vi curl.sh

    while :
    do
            result=`curl $1 -s`
            if [[ "$result" == *"500"* ]]; then
                    echo `date +%F-%T` $result
            else
                    echo `date +%F-%T` $result
            fi

            sleep 0.1
    done
Copy the code
  • Test, open two command lines respectively, execute the following script, the following information is displayed

% sh curl. Sh {sc-consumer-empty}:18091/user/rest2022-01-19-11:58:12 Hello from [18084]10.116.0.142! The 2022-01-19-11:58:12 Hello from [18084] 10.116.0.142! The 2022-01-19-11:58:12 Hello from [18084] 10.116.0.142! The 2022-01-19-11:58:13 Hello from [18084] 10.116.0.142! The 2022-01-19-11:58:13 Hello from [18084] 10.116.0.142! The 2022-01-19-11:58:13 Hello from [18084] 10.116.0.142!

% sh curl. Sh {sc-consumer}:18091/user/rest2022-01-19-11:58:13 Hello from [18084]10.116.0.142! The 2022-01-19-11:58:13 Hello from [18084] 10.116.0.142! The 2022-01-19-11:58:13 Hello from [18084] 10.116.0.142! The 2022-01-19-11:58:14 Hello from [18084] 10.116.0.142! The 2022-01-19-11:58:14 Hello from [18084] 10.116.0.142! The 2022-01-19-11:58:14 Hello from [18084] 10.116.0.142!

While the script is being called, observe the MSE console and see the following

  • Reduce the number of CoreDNS components to 0 to simulate the DNS network resolution exception scenario.

The discovery instance is disconnected from Nacos and the service list is empty.

  • Simulate DNS service recovery and expand the number of DNS servers to 2.

results

With continuous business traffic throughout the above process, we found that the SC-consumer-Empty service had a large number of persistent errors

The 2022-01-19-12:02:37 {” timestamp “:” the 2022-01-19 T04:02:37. 597 + 0000 “, “status” : 500, “error” : “Internal Server Error”,”message”:”com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider”,”path”:”/user/feign”}2022-01-19-12:02:37 {” timestamp “:” the 2022-01-19 T04:02:37. 799 + 0000 “, “status” : 500, “error” : “Internal Server Error”,”message”:”com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider”,”path”:”/user/feign”}2022-01-19-12:02:37 {” timestamp “:” the 2022-01-19 T04:02:37. 993 + 0000 “, “status” : 500, “error” : “Internal Server Error”,”message”:”com.netflix.client.ClientException: Load balancer does not have available server for client: mse-service-provider”,”path”:”/user/feign”}

In contrast, the SC-Consumer application reported no errors throughout the process

  • Sc-consumer-empty is restored only after the Provider is restarted

In contrast, the SC-Consumer application reported no errors throughout the process

subsequent

When the pushout protection occurs, we will report events and alarms to the spike group, and it is recommended to use it in combination with outlier instance removal. Pushout protection may cause consumers to hold too many Provider addresses. When the Provider address is invalid, outlier instance removal can be logically isolated. Ensure high availability of services.

The tail

Safeguard the cloud business online forever, is MSE has been in the pursuit of goals, based on the design for failure of high availability of service discovery ability to share, and the MSE of service governance ability to quickly build up the service discovery high ability demonstration are available, and simulates the unpredictable online service found that the influence of relative exception occurs and how do we prevent means, Demonstrates how a simple open source microservice application should build service discovery high availability.

A link to the

[1] Create a Kubernetes managed edition cluster

Help.aliyun.com/document_de…

[2] Enable MSE microservice governance

Help.aliyun.com/document_de…

[3] Open MSE microservice governance

Common-buy.aliyun.com/?commodityC…

[4] Price description

​https://help.aliyun.com/document_detail/170443.htm#concept-2519524​

[5] Container Services console

​https://cs.console.aliyun.com​

[6] MSE Governance Center console

​https://mse.console.aliyun.com​

Click here to see more on the MSE website!