1. Introduction



Hystix is Netflix’s open source delay and fault tolerant library for isolating access to remote services, third-party libraries and preventing cascading failures.

In a distributed system, many dependencies will inevitably fail, such as timeouts, exceptions, etc. How to ensure that the failure of a dependency does not lead to the failure of the entire service is Hystrix’s task. Hystrix provides fuses, isolation, Fallback, cache, monitoring, and more to keep the system usable in the event of a simultaneous failure of one or more dependencies.

2. Avalanche problem

In microservices, the invocation relationship between services is complex. A request may need to invoke multiple microservice interfaces to realize it, which will form a very complex invocation link:



As shown in the figure, A business request needs to call four services A, P, H and I, which in turn may call other services.

If a service is abnormal at this time:



For example, when microservice I is abnormal and the request is blocked, the user will not get a response, then the tomcat thread will not be released, so more and more user requests come, and more and more threads will block:



The server supports a limited number of threads and concurrence, and requests are blocked all the time, causing the server to run out of resources and all other services to become unavailable, creating an avalanche effect.

This is just like a car production line that produces different cars and needs to use different parts. If a part cannot be used for various reasons, the whole car will not be able to assemble and fall into a state of waiting for parts until they are in place. At this time, if many models need this part, the whole factory will be in a waiting state, resulting in all production paralysis. The sweep of a part is increasing.

Hystix solves the avalanche problem in two ways: • Thread isolation • service meltdown

2.1 Thread isolation and service degradation

1. Principle: Schematic diagram of thread isolation ☟



Hystrix allocates a small thread pool for each dependent service call. If the thread pool is full, the call is rejected immediately, with no queuing by default. Speed up the failure determination time.

The user’s request will no longer access the service directly. Instead, it will be accessed by free threads in the thread pool. If the thread pool is full, or the request times out, the service will be degraded.

Service degradation: Prioritizing core services over non-availability or weak availability of core services.

When a request fails, the user will not be blocked, will not have to wait endlessly or see the system crash, but will see at least one execution result (such as a friendly prompt).

Service degradation will result in request failure, but will not cause blocking, and at most will affect the resources in the thread pool corresponding to the dependent service, and will not respond to other services.

Hystix service degradation is triggered by: • Thread pool is full • Request times out

2.2 Service fuse

1. Principle fuse, also called circuit breaker. The circuit breaker mechanism is a micro-service link protection mechanism against avalanche effect. When a micro-service on the fan-out link becomes unavailable or the response time is too long, the fan-out link degrades the service and fuses the invocation of the micro-service on the node to quickly return the wrong response information. When detecting that the microservice invocation response of this node is normal, the call link is restored.

In the Spring Cloud framework, circuit breakers are implemented through Hystrix. Hystrix monitors calls between microservices, and when the number of failed calls reaches a certain threshold, the default is 20 failed calls within five seconds, the circuit breaker is activated. The comment on the circuit breaker mechanism is @hystrixCommand

The fusing state machine has three states: • Closed: All requests can be accessed normally. • Open: Indicates the Open state. All requests are degraded. Hystix counts requests, and when the percentage of failed requests reaches a threshold in a certain amount of time, a fuse is triggered and the circuit breaker is fully open. The default failure threshold is 50%, and the number of requests is at least 20. • Half Open: The state is Half Open. The state is not permanently Open. After the state is opened, the sleep time enters (5S by default). The circuit breaker will then automatically go into half-open mode. At this point, some requests are released to pass. If they are healthy, the circuit breaker is completely shut down. Otherwise, the circuit breaker remains open and the sleep timer starts again

Modifying a circuit breaker policy through configurations:

circuitBreaker.requestVolumeThreshold=10 // The minimum number of requests to trigger a fuse. The default is 20
circuitBreaker.sleepWindowInMilliseconds=10000 // Sleep duration, default is 5000 ms
circuitBreaker.errorThresholdPercentage=50 // The minimum proportion of failed requests that trigger a fuse, 50% by default
Copy the code

3. Traffic limiting

Second kill high concurrency and other operations, it is strictly prohibited to rush over crowded, we pair, N per second, orderly.

3.1 Steps on the official website

3.2 When does the circuit breaker start to work



There are three important parameters related to the circuit breaker: snapshot time window, total request threshold, and error percentage threshold

1)Snapshot window: The circuit breaker needs to collect request and error data to determine whether to enable the circuit breaker. The time range for collecting statistics is the snapshot time window. The default value is the latest 10 seconds.

2)Total request threshold: In the snapshot time window, the number of requests must meet the threshold. The default is 20, which means that if the Hystrix command is invoked less than 20 times within 10 seconds, the circuit breaker will not open even if all requests time out or fail for other reasons.

3)Error percentage threshold: When the total number of requests exceeds the threshold in the snapshot time window, for example, 30 calls occur. If 15 (half) of the 30 calls occur timeout exceptions, that is, the error percentage exceeds 50%, the breaker will be turned on under the default threshold of 50%.

3.2 Conditions for opening or closing the circuit breaker

-1 When a certain threshold is met (the default number of requests exceeds 20 within 10 seconds)

  • 2 When the failure rate reaches a certain level (more than 50% requests fail within 10 seconds by default)
  • 3 When the threshold is reached, the circuit breaker is turned on.
  • 4 When this function is enabled, all requests are not forwarded.
  • 5 After a certain period of time (5 seconds by default), the breaker is half-open and will allow one of the requests to be forwarded. If successful, the circuit breaker will close, if not, continue to open. Repeat 4 and 5

3.3 After the circuit breaker is turned on

1) When the request is called again, the main logic will not be called, but the degraded fallback will be directly called. Through the circuit breaker, the fault can be found automatically and the degraded logic can be switched to the main logic to reduce the response delay. 2) How to restore the original master logic? Hystrix also provides automatic recovery for this problem. As the circuit breaker opens, to fuse the main logic, hystrix will start a sleep time window, within the time window, relegation logic is temporary into logic, expire when sleep time window, the breaker will enter a state of half open, release a request to the original primary logically, if the request to return to normal, the circuit breaker will continue to be closed, The master logic resumes, and if the request is still in question, the breaker continues to open and the sleep time window restarts.