preface

In distributed systems, the unavailability of one basic service often leads to the unavailability of the whole system. This phenomenon is called service avalanche effect. In response to a service avalanche, a common practice is manual service degradation. Hystrix offers us another option.

Definition of service avalanche effect

A service avalanche effect is a process in which the unavailability of a service provider leads to the unavailability of a service caller and amplifies the unavailability. If so:

In the figure above, A is A service provider, B is A service invoker of A, and C and D are service invokers of B. A service avalanche occurs when the unavailability of A causes the unavailability of B, and amplifies the unavailability of C and D.

The causes of the service avalanche effect

I simplified the participants of the service avalanche to service providers and service invokers, and divided the process of the service avalanche into the following three stages to analyze the causes:

  1. The service provider is unavailable

  2. Retry increases traffic

  3. The service caller is not available

Each phase of a service avalanche can be caused by a different cause, such as:

  • A hardware failure

  • Program Bug

  • Cache breakdown

  • User bulk request

Hardware failure may be caused by hardware failure of the server host downtime, network hardware failure caused by the service provider is not accessible. Cache breakdown typically occurs when the cache application is restarted, all caches are cleared, and a large number of caches fail in a short period of time. A large number of cache misses drive requests straight to the back end, overloading the service provider and making the service unavailable. A large number of user requests can also make the service provider unavailable if not prepared well before the seckill and push begins.

The reasons for retry increase flow are as follows:

  • The user to retry

  • Code logic retry

After the service provider is unavailable, the user is fed up with the long wait on the interface and keeps refreshing the page and even submitting the form. There is a lot of retry logic on the calling side of the service after the service exception. These retries further increase request traffic.

Finally, the main causes of service invoker unavailability are:

  • Resource exhaustion caused by synchronous waiting

When a service caller uses a synchronous invocation, a large number of waiting threads consume system resources. Once the thread resources are exhausted, the service provided by the service caller is also unavailable, and the service avalanche effect occurs.

Service avalanche coping strategies

Different strategies can be used for different causes of the service avalanche:

  1. Flow control

  2. Improved caching mode

  3. Automatic expansion of service capacity

  4. The service caller demotes the service

Specific measures of flow control include:

  • The gateway current-limiting

  • User interaction flow limiting

  • Close the retry

Because of the high performance of NGINX, a large number of front-line Internet companies use the gateway of NGINX + LUA for traffic control, thus OpenRESTY is also becoming more and more popular.

Specific measures to limit the flow of user interaction include: 1. Use loading animation to improve user’s patience and waiting time. 2. The submit button adds an enforced wait time mechanism.

Measures to improve the caching pattern include:

  • Cache preloading

  • Synchronous to asynchronous refresh

The measures for automatic expansion of service capacity mainly include:

  • AWS’s auto scaling

Measures taken by a service caller to degrade the service include:

  • Resource isolation

  • Classify dependent services

  • The invocation of an unavailable service fails quickly

Resource isolation is basically the isolation of the thread pool that calls the service.

According to the specific business, we divide the dependent services into strong dependency and non-dependency. The unavailability of strongly dependent services causes the current business to terminate, while the unavailability of weakly dependent services does not cause the current business to terminate.

The rapid failure of the invocation of unavailable services is usually achieved through timeout mechanisms, fuses, and post-fuses degradation methods.

Avalanche prevention services with Hystrix

The porcupine, known as Hystrix, has the ability to protect itself because of the spines on its back. Netflix’s Hystrix is a library that helps with timeout handling and fault tolerance when interacting with distributed systems. It also has the ability to protect the system.

Hystrix’s design principles include:

  • Resource isolation

  • fuse

  • Command mode

Resource isolation

In order to prevent the spread of water leakage and fire, the cargo ship will divide the cargo compartment into several places, as shown in the figure below:

This approach to resource isolation to reduce risk is called Bulkheads. Hystrix applies the same pattern to service callers.

In a highly service-oriented system, a business logic we implement will usually rely on multiple services, such as the product detail display service will depend on the product service, the price service, and the product review service. As shown in the figure:

Invoke a thread pool of three dependent services that will share commodity detail services. If one of the commodity review services is not available, all the threads in the thread pool are blocked waiting for a response, causing an avalanche of services. As shown in the figure:

Hystrix avoids a service avalanche by allocating each dependent service to a separate thread pool for resource isolation. As shown in the figure below, when the commodity review service is unavailable, even if all 20 threads of independent allocation of the commodity service are in a synchronous wait state, the calls of other dependent services will not be affected.

Fuse mode

The fuse mode defines the logic by which fuse switches are converted to each other:

Health of the service = number of requests failed/total number of requests. The state transition of the fuse switch from off to on is determined by comparing the current health of the service and the set threshold.

  1. When the fuse switch is off, the request is allowed through the fuse. If the current health condition is above the set threshold, the switch remains off. If the current health is below the set threshold, the switch switches to the on state.

  2. When the fuse switch is on, the request is blocked.

  3. When the fuse switch is on, after a period of time, the fuse will automatically enter the half-open state, and only one request will be allowed to pass through the fuse. When the request is successfully invoked, the fuse reverts to the closed state. If this request fails, the fuse remains open and subsequent requests are blocked.

The switch of the fuse can ensure that when the service caller calls an abnormal service, the result will be returned quickly, avoiding a large number of synchronous waiting. And the fuse can continue to detect the result of the request execution after a certain period of time, providing the possibility of resuming the service invocation.

Command mode

Hystrix uses the command pattern (which inherits the HystrixCommand class) to wrap the specific service invocation logic (the run method), and adds the demotion logic (getFallback) to the command pattern after the service invocation fails. At the same time, we can define parameters related to the current service thread pool and fuse in the Command constructor. The following code looks like this:

public class Service1HystrixCommand extends HystrixCommand<Response> { private Service1 service; private Request request; public Service1HystrixCommand(Service1 service, Request request){ supper( Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("ServiceGroup")) .andCommandKey(HystrixCommandKey.Factory.asKey("servcie1query")) .andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("service1ThreadPool")) AndThreadPoolPropertiesDefaults (HystrixThreadPoolProperties. Setter (.) withCoreSize (20)) / / service number of the thread pool .andCommandPropertiesDefaults(HystrixCommandProperties.Setter() WithCircuitBreakerErrorThresholdPercentage (60) / / fuse closed to open the threshold . WithCircuitBreakerSleepWindowInMilliseconds (3000) / / fuse open to close time window length)). This service = service; this.request = request; ) ; } @Override protected Response run(){ return service1.call(request); } @Override protected Response getFallback(){ return Response.dummy(); }}

After the service object is built using the Command pattern, the service has the function of fuse and thread pool.

Internal processing logic for Hystrix

Below is the internal logic of the Hystrix service invocation:

  1. Build Hystrix’s Command object and call the execution method.

  2. Hystrix checks to see if the fuse switch for the current service is on, and if so, performs the downgrading service getFallback method.

  3. If the fuse switch is off, Hystrix checks whether the thread pool of the current service can receive new requests, and if the thread pool is more than full, the downgrading service getFallback method is executed.

  4. If the thread pool accepts the request, Hystrix starts executing the service-specific logical run method.

  5. If the service fails, the downgrading service getFallback method is performed and the result of the execution is reported to Metrics to update the health of the service.

  6. If the service execution time out, the downgrading service getFallback method is performed and the result of the execution is reported to Metrics to update the health of the service.

  7. If the service executed successfully, the normal result is returned.

  8. If the service degradation method getFallback is successful, the degradation result is returned.

  9. If the service degradation method getFallback fails, an exception is thrown.

Implementation of Hystrix Metrics

Hystrix’s Metrics store the current health of the service, including the total number of service calls and the number of service calls that failed. Based on the Metrics count, the fuse can then calculate the failure rate of the current service invocation, which can be compared with the set threshold to determine the fuse’s state switching logic. Therefore, the implementation of Metrics is very important.

Sliding window implementation before 1.4

Hystrix in these versions uses its own defined sliding window data structure to record the count of various events (success, failure, timeout, thread pool rejection, etc.) in the current time window. When the event occurs, the data structure determines whether to use the old bucket or create a new bucket to count, based on the current time, and changes are made to the counter in the bucket. These modifications are executed concurrently by multiple threads, and there are many locking operations in the code, and the logic is more complex.

1.5 after the sliding window implementation

Hystrix started using RxJava’s Observable.window() in these versions to implement sliding Windows. RxJava’s Window uses background threads to create new buckets, avoiding the problem of concurrently creating buckets. RxJava’s single thread lockless feature also ensures thread-safety when counting changes. This makes the code much cleaner. Here’s a simple sliding-window Metrics I implemented using RxJava’s window method. The statistics are done in just a few lines of code, which is enough to demonstrate the power of RxJava:

@Test public void timeWindowTest() throws Exception{ Observable<Integer> source = Observable.interval(50, TimeUnit.MILLISECONDS).map(i -> RandomUtils.nextInt(2)); source.window(1, TimeUnit.SECONDS).subscribe(window -> { int[] metrics = new int[2]; window.subscribe(i -> metrics[i]++, InternalObservableUtils.ERROR_NOT_IMPLEMENTED, () -> system.out. println(" window Metrics:" + json.tojsonstring (Metrics)))); }); TimeUnit.SECONDS.sleep(3); }

conclusion

By using Hystrix, we can easily prevent avalanche effects while enabling the system to automatically degrade and automatically restore service.