preface

We know that governance of microservices can be achieved through registries such as Nacos. But with the introduction of Nacos, is it really the ideal that all services are managed perfectly by Nacos? Too young, Too simple!

Today’s article is about what can happen when a service goes down unexpectedly and Nacos doesn’t respond, and what solutions are available.

Nacos health check

The story starts with a health check of the service instance by Nacos.

Nacos currently supports the use of heartbeat reporting for temporary instances to maintain activity. The Nacos client maintains a scheduled task that sends heartbeat requests every five seconds to ensure that it is active.

If the Nacos server does not receive a heartbeat request from the client within 15 seconds, the instance is set to unhealthy, and if it does not receive a heartbeat within 30 seconds, the temporary instance is removed.

If the service suddenly dies

In normal business scenarios, if a service instance is shut down, by default, the unregister interface will be actively invoked before shutting down, and the registered instances on the Nacos server will be cleared.

If the service instance has been killed before it can be unregistered, such as killing an application normally, the application will finish processing the task at hand and then shut down, but if you use kill -9 to force the killing, you will not be able to unregister.

In this case, the service deregistration interface cannot be invoked correctly, and a health check is required to ensure that the instance is deleted.

Using the Nacos health check mechanism analyzed above, we found a 15-second gap after a service suddenly hangs. During this time, the Nacos server does not sense that the service is down and still provides the service to the client.

At this point, some of the requests must be assigned to the exception instance. How to deal with this situation? How do you ensure that services do not interfere with normal business?

You can customize the heartbeat period

The easiest solution to the above problem is to shorten the default health check time.

Instead of taking 15 seconds to detect a service exception and mark it as unhealthy, can we shorten it? In this way, the scope of error can be smaller and more manageable.

Nacos 1.1.0 provides the configuration of custom heartbeat cycles for this purpose. If you operate on a client, you can configure the heartbeat period, health check expiration time, and instance deletion time in the metadata data of the instance when creating the instance.

The following is an example:

String serviceName = randomDomainName(); Instance instance = new Instance(); The instance. SetIp (" 1.1.1.1 "); instance.setPort(9999); Map<String, String> metadata = new HashMap<String, String>(); / / set the cycle of the heartbeat, milliseconds metadata. The put (PreservedMetadataKeys HEART_BEAT_INTERVAL, "3000"); // Set the heartbeat timeout in milliseconds. Couldn't get a client server 6 seconds heartbeat, the client will be register is set to the instance of the unhealthy: metadata. The put (PreservedMetadataKeys HEART_BEAT_TIMEOUT, "6000"); // Set the timeout for instance deletion in milliseconds; Metadata. put(preservedMetadatkeys. IP_DELETE_TIMEOUT, "9000"); instance.setMetadata(metadata); naming.registerInstance(serviceName, instance);Copy the code

If the project is based on Spring Cloud Alibaba, it can be configured in the following ways:

spring: application: name: user-service-provider cloud: nacos: discovery: server-addr: 127.0.0.1:8848 heart-beat-interval: 1000 # Heartbeat interval The unit is milliseconds. Heart-beat-timeout: 3000 # Heartbeat pause. The unit is milliseconds. Ip-delete-timeout: 6000 # IP delete timeout. The unit is milliseconds.Copy the code

In some Spring Cloud releases, the above configuration may not take effect. You can also configure metadata data directly. The configuration mode is as follows:

Spring: Application: name: user-service-provider cloud: nacos: Discovery: server-addr: 127.0.0.1:8848 Metadata: Preserved, heart beat. The interval: 1000 # heartbeat interval. Time unit: ms. Preserved, heart beat. The timeout: 3000 # heartbeat pause. Time unit: ms. That is, if the server does not receive the heartbeat from the client for six seconds, the instance registered by the client is set to unhealthy. Preserved. IP. Delete. The timeout: 6000 # delete IP timeout. Time unit: second. That is, if the server does not receive the heartbeat communication from the client in 9 seconds, the instance registered by the client will be deleted.Copy the code

The first configuration, interested friends can see related components in NacosServiceRegistryAutoConfiguration instantiation. In some versions the configuration does not take effect due to the order in which NacosRegistration and NacosDiscoveryProperties are instantiated. Consider the second configuration.

The configuration items above, will eventually in NacosServiceRegistry for instance when registering through getNacosInstanceFromRegistration encapsulation method:

private Instance getNacosInstanceFromRegistration(Registration registration) { Instance instance = new Instance(); instance.setIp(registration.getHost()); instance.setPort(registration.getPort()); instance.setWeight(nacosDiscoveryProperties.getWeight()); instance.setClusterName(nacosDiscoveryProperties.getClusterName()); instance.setEnabled(nacosDiscoveryProperties.isInstanceEnabled()); // setMetadata instance.setmetadata (registration.getmetadata ()); instance.setEphemeral(nacosDiscoveryProperties.isEphemeral()); return instance; }Copy the code

The setMetadata method is.

With the heartbeat cycle configuration provided by Nacos, combined with our own business scenarios, we can choose the most suitable heartbeat detection mechanism to minimize the impact on services.

This scenario seems to be as short as possible, but it puts some strain on the Nacos server. If the server allows it, it can be kept as short as possible.

Nacos Protection threshold

In the above configuration, we also consider the configuration of the Nacos protection threshold based on our own project situation.

There is a protection threshold configuration item in Nacos for registered service instances. The value of this configuration item is a floating point number between 0 and 1.

Essentially, the protection threshold is a proportional value (number of current health instances of the service/total number of current service instances).

In general, there is a healthy/unhealthy state for service consumers to obtain available instances from Nacos. When Nacos returns an instance, only the healthy instance is returned.

However, problems exist in high concurrency and heavy traffic scenarios. For example, service A has 100 instances and 98 instances are unhealthy if Nacos returns only those two healthy instances. The arrival of peak traffic could directly overwhelm both services, creating a further avalanche effect.

The significance of the protection threshold is that when the number of healthy instances/total instances of service A is less than the protection threshold, it indicates that there are few healthy instances and the protection threshold will be triggered (state true).

Nacos provides all the instance information (healthy + unhealthy) of the service to the consumer, who may access an unhealthy instance and fail, but that’s better than causing an avalanche. Some requests are sacrificed to keep the entire system available.

In the solution above, we mentioned the ability to customize the heartbeat cycle, where you can see that the state of the instance is changed from healthy, unhealthy, and removed. The definition of these parameters should also take into account the triggering of protection thresholds to avoid the occurrence of avalanche effects.

SpringCloud’s request is retried

Even after we adjusted the heartbeat cycles above, there was a brief period when the Nacos service failed to weed out the abnormal instances when an instance failed. At this point, if the consumer requests the instance, the request will still fail.

To build more robust applications, we want to have a strategic retry mechanism when a request fails, rather than simply returning a failure. This is where developers need to implement the retry mechanism.

In microservices architecture, load balancing is usually done based on the Ribbon or Spring Cloud LoadBalancer. In addition to the request retry and request transfer functions already supported by the Ribbon and the Feign framework itself. Spring Cloud also provides a standard loadbalancer configuration.

We won’t talk much about using the Ribbon framework here, but let’s focus on how Spring Cloud can help.

Abnormal simulation

Let’s first simulate the abnormal situation, as mentioned above, first increase the heart rate of the above to facilitate testing.

Then start two providers and a consumer service, and load balancing is handled based on Spring Cloud LoadBalancer. Now make the request through the consumer, and you’ll see that the LoadBalancer rotates the request evenly between the two providers (printing the log).

In this case, run the kill -9 command to disable one provider. At this point, make a request through consumer, and you’ll find one success, one failure, and so on.

The solution

The retry mechanism is configured using the configuration items defined in the LoadBalancerProperties configuration class provided by Spring Cloud. For detailed configuration items, refer to the properties of this class.

Add the Retry configuration to the Consumer’s Application configuration:

Spring: Application: name: user-service-consumer Cloud: nacos: Discovery: server-addr: 127.0.0.1:8848 loadbalancer: Retry enabled: true Maximum number of retry attempts on the same instance max-retriss-on-same-service-instance 1 # Maximum number of retry attempts of other instances max-retries-on-next-service-instance: 2 # Enable retry for all operations (with caution, especially POST submission and idemidemality protection). Retry-on-all-operations: trueCopy the code

Retry is enabled by default in the above configuration.

Max-retriss-on-same-service-instance indicates the number of attempts made by the current instance, including the first request. This parameter is set to 1. If the first request fails, the instance is transferred to another instance. Of course, you can also configure a value greater than 1, which will be tried again in the current instance.

Max-retries -on-next-service-instance Indicates the maximum number of transfer requests to other instances.

Retry-on-all-operations The default value is false, which indicates that only Get requests can be retried. This setting is true to support all retries. Since retry is involved, you need to keep the business idempotent.

After the above configuration, the exception simulation is demonstrated again, and it is found that even if the service is down, it still exists in Nacos, and business can still be processed normally.

There are similar solutions for the Ribbon or other similar components that you can research.

Solution pit

There is a pitfall when using Spring Cloud LoadBalancer and you may encounter situations where the above configuration does not take effect. Why is that?

Spring Cloud LoadBalancer is based on Spring-retry. If a dependency is not introduced, the configuration will not take effect. The official documentation business does not specify.

<dependency>
    <groupId>org.springframework.retry</groupId>
    <artifactId>spring-retry</artifactId>
</dependency>
Copy the code

In addition, the above example is based on Spring Cloud version 2020.0.0, other versions may have different configurations.

summary

Integrating Spring Cloud components is not the end of the story when using microservices. In this article we can see that even with Nacos integration, there are trade-offs due to the heartbeat mechanism, such as adjusting the heartbeat frequency.

At the same time, even after the heartbeat parameters are adjusted, other components need to be leveraged to accommodate retries in the event of a request exception and prevent system avalanches. Pay attention to it, continue to update the micro service series of actual combat content.

About the blogger: Author of the technology book SpringBoot Inside Technology, loves to delve into technology and writes technical articles.

Public account: “program new vision”, the blogger’s public account, welcome to follow ~

Technical exchange: Please contact the weibo user at Zhuan2quan