As singles’ Day approached, from the first singles’ Day in 2009, the volume of transactions was only 50 million, to last year’s 2019, the volume reached 268.4 billion. This year ushered in the 12th Double 11, think of quite exciting.

Ali people like to regard Double 11 as Team Building. As a popular saying goes: War is the best Team Building. Those who have not participated in double 11 are called colleagues, while those who have participated in double 11 are called comrade-in-arms.

In my last post, I explained the mechanism of service avalanche and circuit breaker through the story of The Three Kingdoms, and I made a wheel myself: fuse. This article will explain the two flow prevention components used by first-tier factories: Sentinel and Hystrix, as well as a horizontal comparison between them.

The main contents of this paper are as follows:

This article has been included on Github: github.com/Jackson0714… Click on my Github link

I. Fusing & Downgrade & Current limiting & Isolation

In the face of high concurrent traffic, we usually use four methods (fuses & downgrades & current limiting & isolation) to prevent the impact of instantaneous large flow on the system. And today to introduce these two flow defense soldiers, is specifically used in this respect. Next I first sweep a small blind to the students.

  • What is a circuit breaker?

Key words: circuit breaker protection. For example, service A invokes service B, and the request duration is too long due to network problems, service B breakdown, or service B processing time. If this situation occurs for several times within A certain period of time, service B can be directly disconnected (service A does not request service B any more). A call to service B directly returns degraded data without waiting for service B to execute. Therefore, the problems of service B do not cascade to service A.

  • What is a downgrade?

Key words: return degraded data. The website is in the peak traffic, the server pressure increases sharply, according to the current business situation and traffic, some services and pages are strategically degraded (stop service, all calls directly return degraded data). This relieves the pressure on server resources, ensures the normal operation of core business, and maintains the correct response of customers and most customers. Degraded data can simply be interpreted as a quick false return, and the front page tells the user “the server is busy, please try again later.”

  • What is current limiting?

    The traffic of requests is controlled and only part of the requests are allowed so that the service can bear the traffic pressure within its capacity.

  • What are the similarities between a circuit breaker and a downgrade?

    • Fuses and traffic limiting are used to ensure availability and reliability of most services in a cluster. Prevent core services from crashing.
    • The perception to the end user is that a function is unavailable.
  • What’s the difference between a circuit breaker and a downgrade?

    • A circuit breaker is an operation that is actively triggered when a fault occurs.
    • Downgrading is a global consideration that stops some normal services and frees resources.
  • What is quarantine?

    • Each service operates as an independent system, and a problem in one system does not affect other services.

Second, the Hystrix

What are Hystrix

Hystrix: A framework for high availability assurance. Produced by Netflix (Netflix can be understood as the domestic video website such as IQiyi).

The history of the Hystrix

  • In 2011, the API team developed the Hystrix framework to improve system availability and stability.

  • In 2012, the Hystrix region was mature and stable. Other teams are also starting to use Hystrix.

  • In November 2018, Hystrix announced on its Github page that it would no longer open new features, recommending that developers use other open source projects that are still active. But Hystrix is still valuable and powerful, and is used by many top-tier Internet companies in China.

Hystrix design philosophy

  • Stop the avalanche effect of services.
  • Quick failures and quick recoveries.
  • Degrade gracefully.
  • Use resource isolation technologies such as Bulkhead, Swimlane, and Circuit breaker.
  • Near real-time monitoring, alarm and operation and maintenance operations.

Hystrix thread pool isolation technology

Using A thread pool isolation, such as three services A, B, C, each service thread pool allocation 10, 30 A thread, when A service 10 threads in thread pool is out after use, if the call service A request quantity increase, also want to add A thread is no good, because A service allocation thread has finished, It does not take threads from other services, so it does not affect other services. Hystrix uses thread pool isolation mode by default.

Benefits of thread pool isolation

  • Dependent services have isolated thread pools so that even if their own case becomes full, it will not affect any other service calls.
  • The health of the thread pool is reported and the call configuration of dependent services can be modified in near real time.
  • Thread pools are asynchronous and you can build a layer of asynchronous invocation.
  • It has a mechanism for timeout detection, which is especially useful for inter-service calls.

Disadvantages of thread pool isolation technology

  • Thread pooling itself brings some problems, such as thread switching, thread management, and undoubtedly increases CPU overhead.
  • If thread utilization in a thread pool is low, it’s a waste.

Hystrix semaphore isolation technology

As shown in the figure below: In simple terms, there is A certain amount of semaphore in A pool. Before service A calls service B, it needs to apply for semaphore from the pool and then call service B.

Scenario comparison of thread pool isolation and semaphore

  • Thread pool isolation is suitable for most scenarios, but requires a timeout for the service.
  • Semaphore isolation technology, suitable for internal more complex business, does not involve network request problems.

Third, Sentinel

3.1. What is Sentinel

Sentinel: A traffic control component oriented to distributed service architecture, which mainly takes traffic as the entry point and helps developers guarantee the stability of micro-services from multiple dimensions such as traffic limiting, traffic shaping, fuse downgrading, system load protection and hotspot protection.

3.2 history of Sentinel

  • Sentinel was born in 2012 and its main function is inlet flow control.
  • From 2013 to 2017, Sentinel developed rapidly within Alibaba Group and became a basic technology module, covering all core scenarios. Sentinel has thus accumulated a large number of traffic aggregation scenarios and production practices.
  • Sentinel became open source in 2018 and continues to evolve.
  • In 2019, Sentinel continued to explore multilingual extensions with the release of a C++ native version and Envoy cluster traffic control support for Service Mesh scenarios to address multilingual traffic limiting issues under the Service Mesh architecture.
  • In 2020, Sentinel Go was released, continuing its evolution toward cloud native.

3.3. Characteristics of Sentinel

  • Rich application scenarios. The core scenarios that support Alibaba’s Double 11 include seckill, message peak-cutting, cluster flow control and real-time fusing downstream unavailable.
  • Complete real-time monitoring. You can see the per-second data for each machine connected to the application, as well as a summary of the cluster.
  • Extensive open source ecosystem. Spring Cloud, Dubbo, and gRPC all have access to Sentinel.
  • Complete SPI extension points. Implement extended interfaces to quickly customize logic.

To sum it up:

3.4. Composition of Sentinel

  • The core library (Java client) does not depend on any framework/library, can run in all Java runtime environments, and has good support for Spring Cloud, Dubbo and other frameworks.
  • The Console (Dashboard) is based on Spring Boot and can be packaged to run directly without the need for additional application containers such as Tomcat.

3.5. Sentinel resources

A resource in Sentinel is a core concept and can be anything in a Java application, a service provided, or even a piece of code.

The code that can be defined through the Sentinel API is a resource that can be protected by Sentinel. Resources can be identified in the following ways:

  • Method signature.
  • The URL.
  • Service name, etc.

3.6 Design concept of Sentinel

Sentinel, as a flow controller, can adjust random requests to the appropriate shape as needed, as shown in the following figure:

Four, contrast

4.1 Comparison of isolation design

  • Hystrix

Hystrix provides two isolation strategies, thread pool isolation and semaphore isolation.

The most recommended and commonly used in Hystrix is thread pool isolation. Thread pool isolation is to isolate the benefits of degree is high, will not affect other resources, but the thread itself has its own problems, when the thread context switching more CPU resources consumption, if the low latency requirements is higher, affect are high, and create a thread is the need to allocate memory, create the thread, the more the need to allocate memory will be more. And if you create a thread pool for each resource, thread switching can be even more costly.

Hystrix’s semaphore isolation can limit the number of concurrent calls to a resource. It is lightweight and does not explicitly create a thread pool, but has the drawback that it cannot automatically degrade slow calls. It can only wait for the client side to time out, and there is still the possibility of cascading blocking.

  • Sentinel

Sentinel can provide semaphore isolation through flow control in concurrent thread count mode, and it also has fuse degrade mode in response time to prevent too many slow calls from overrunning the concurrent number and affecting the entire system.

4.2 Comparison of fuse downgrading

Sentinel and Hystrix are based on fuse modes. Both support outlier ratio based fuses, but Sentinel is more powerful and can degrade fuses based on response time, outlier ratio and number of outliers.

4.3 Comparison of real-time statistics

Sentinel and Hystrix are real-time statistics based on sliding Windows, but Hystrix is an event-driven model based on RxJava, which releases the response events when the service invocation succeeds/fails/times out. Through a series of transformations and aggregations, real-time indicator statistics data streams are finally obtained. Can be consumed by fuses or dashboards. Sentinel is a sliding window based on LeapArray.

5. Outstanding characteristics of Sentinel

In addition to the three comparisons mentioned above, Sentinel has some features that Hystrix does not.

5.1 flow control

Traffic control: Monitors application traffic indicators, such as QPS or concurrent threads, and controls the traffic when it reaches a specified threshold to avoid being overwhelmed by instantaneous traffic peaks and ensure high availability of applications.

Sentinel can control flow based on QPS/ concurrency or call relationship.

There are several ways of flow control based on QPS:

  • Direct rejection: When the QPS exceeds a certain threshold, the QPS is directly rejected. This applies when the processing capacity of the system is well known.
  • Slow start preheating: When the system is in low water level for a long time, when the flow suddenly increases, directly pulling the system to high water level may instantly overwhelm the system. Through the “cold start”, the flow slowly increases to the upper limit of the threshold in a certain period of time, giving the cold system a time to warm up, preventing the cold system from being overwhelmed.

  • Uniform queuing: Requests pass at an even speed, corresponding to the leaky bucket algorithm.

Flow control based on call relationship:

  • Limit the flow according to the caller.
  • Traffic limiting by invoking link entry: Link traffic limiting.
  • Traffic limiting by related resources: Traffic limiting by associated resources.

5.2 The system ADAPTS to current limiting

The adaptive flow limiting of Sentinel system controls the flow of the application inlet from the whole dimension. With the help of TCP BBR idea, the monitoring indexes of several dimensions such as the Load of the application, CPU usage, overall average RT, inlet QPS and number of concurrent threads are combined. Through the adaptive flow control strategy, Let the system inlet flow and system load to achieve a balance, so that the system as far as possible to run in the maximum throughput at the same time to ensure the overall stability of the system.

Let’s imagine the process of the system processing request as a water pipe. The incoming request is to pour water into the pipe. When the system processes smoothly, the request does not need to queue up and directly passes through the pipe, and the RT of this request is the shortest. On the other hand, when requests pile up, the processing time of requests will be changed to: queue time + minimum processing time.

  • Corollary 1: If we can ensure the amount of water in the pipe and allow the water to flow smoothly, there will be no increase in queue requests; That is, the system load does not deteriorate further at this time.

  • Corollary 2: maximize the disposal capacity of the water pipe when the inlet flow is the maximum amount of water coming out of the pipe.

5.3. Real-time monitoring and control panel

Sentinel provides a lightweight open source console that provides machine discovery as well as health management, monitoring (single and cluster), rule management and push capabilities.

5.4 development and Ecology

Sentinel ADAPTS to Spring Cloud, Dubbo and gRPC, and can be quickly accessed with dependency and simple configuration. It is believed that Sentinel will be a powerful tool for traffic prevention and control in the future. I’m betting on Sentinel.

5.5. Comparison of Sentinel and Hystrix

Write in the last

Some readers asked me how to design the seckill system. In the previous article, I have revealed the architecture design of the seckill system. Here I summarize the eight points of concern about the seckill system:

  • The service has single responsibility and independent deployment
  • Inventory warm-up, rapid deduction
  • Seckill link encryption
  • Dynamic and static separation
  • Malicious request interception
  • Traffic peak
  • Current limiting & fusing & Downgrade
  • Queue peak clipping

I am Goku, trying to become a super Saiyan.

See you next time!