Why hystrix?

In large and medium-sized distributed systems, the system usually has many dependencies, as shown in the following figure:

Under high concurrent access, the stability of these dependencies has a great impact on the system, but dependencies have many uncontrollable problems, such as slow network connection, busy resources, temporarily unavailable, offline services, etc., as shown in the following figure:

In the case of high traffic, the delay of one back-end dependency can cause all resources on all servers to saturate in seconds

(PS: it means that the service cannot be provided immediately for subsequent requests)

Click on the getSpring buckets,SpringCloud Alibaba technology stackCourses and study materials.

For example, for an application that relies on 30 services, each of which has 99.99% uptime (52.56 minutes of annual unavailable time), the expectations are as follows:

0.9999^30 = 99.7% available which means 0.3% of 100 million requests = 300000 will fail. If all goes well, then 26.2 hours a year the service is unavailable when the cluster depends on 50 services: 0.9999^50 = 99.5% available and unavailable for 43.8 hours a yearCopy the code

In distributed system environment, similar dependencies between services are very common. A business invocation usually depends on multiple underlying services. The following figure

For synchronous invocation, when the member service is unavailable, the order service request thread is blocked, and when a large number of requests invoke the member service, eventually the whole member service resources may be exhausted, unable to continue to provide services. And this unavailability can travel up the request invocation chain, a phenomenon known as the avalanche effect.

Two, the avalanche effect common scenarios:

Hardware faults: such as faulty disk jitter, server downtime, network jitter, machine room power failure, optical fiber is cut off, etc. Traffic surge: For example, abnormal traffic, retry to increase traffic; Cache penetration: When a large number of caches fail in a short period of time, a large number of cache misses make the request directly hit the back-end service, resulting in overloaded operation of service providers and service unavailability. Program bugs: such as memory leakage caused by program logic, JVM FullGC for a long time, scheduled task execution during traffic peak; Synchronous wait: Services are invoked synchronously. Resources are exhausted due to synchronous wait.

Coping strategies for avalanche effect

The strategies that can be used for different scenarios that cause the avalanche effect are as follows:

Hardware faults: Multi-room DISASTER recovery (Dr) and remote multi-live.

Traffic surge: Automatic service expansion and traffic control (traffic limiting and retry shutdown).

Cache penetration: cache preloading, cache asynchronous loading, etc.

Program bugs: modify program bugs, release resources in time, execute scheduled tasks scattered to low traffic peak, etc.

Synchronous wait: resource isolation, MQ decoupling, quick failure of unavailable service invocation, etc. Resource isolation usually refers to different thread pools for different service invocations; Quick failures of unavailable service invocations are generally implemented through a fuse pattern combined with a timeout mechanism.

Hystrix, which means porcupine in Chinese, has the ability to protect itself because its back is covered with thorns. Hystrix mentioned in this paper is an open-source fault-tolerant framework of Netflix, which also has self-protection capability and achieves fault-tolerance and self-protection.

Netflix Hystrix is a tool/framework for providing service isolation, circuit breaker, degrade mechanisms in an SOA/ microservices architecture. Netflix Hystrix is an implementation of circuit breakers for availability of high micro service architectures and a weapon against an avalanche of services.

Click to get Spring Family bucket, SpringCloud Alibaba technology stack course and learning materials oh!