The author | Zhao Yihao (lodge any) Sentinel open source project principal source | alibaba cloud native public number

preface

The stability of microservices has always been a topic of great concern for developers. As services evolve from single architecture to distributed architecture and deployment modes change, the dependency relationship between services becomes more and more complex, and service systems also face huge high availability challenges.

In a production environment, you may encounter a variety of unstable situations, such as:

The system exceeded the maximum load due to the instantaneous flood peak flow during the rush, and the load skyrocketed. The system crashed and users could not place orders.
“Dark horse” hot commodity breakdown cache, DB was destroyed, crowded out the normal flow.
The calling end is dragged down by unstable third-party services, the thread pool is full, and calls pile up, causing the entire call link to freeze.

These volatile scenarios can have serious consequences, but it is often easy to overlook the high availability protections associated with traffic/dependencies. You may be wondering: How can we prevent the effects of these destabilizing factors? How to implement high availability protection against traffic? How to ensure that services are “rock solid”? At this time, we are going to invite Sentinel, the highly available protection middleware of Alibaba Double 11. In this year’s Tmall Double 11 promotion, Sentinel perfectly guaranteed the stability of the peak traffic of thousands of Alibaba services on Double 11. At the same time, Sentinel Go version officially announced GA recently. Here’s a look at the core scenario of Sentinel Go and the community’s exploration of cloud native.

Sentinel introduces

Sentinel is an open source flow control component of Alibaba that is oriented to distributed service architecture. It mainly takes traffic as the entry point and helps developers guarantee the stability of micro-services from multiple dimensions such as flow limiting, traffic shaping, fuse downgrading and system adaptive protection. Sentinel has undertaken the core scenarios of alibaba’s double 11 traffic promotion in the past 10 years, such as second kill, cold start, message peak cutting and valley filling, cluster flow control, real-time fusing of downstream unavailable services, etc. Sentinel is a powerful tool to ensure the high availability of micro services. It supports Java, Go, C++ and other languages. Istio/Envoy global flow control support is also provided to provide high availability protection for the Service Mesh.

Earlier this year, the Sentinel community announced the release of the Sentinel Go version, which provides native support for highly available protection and fault tolerance capabilities for the Go language’s microservices and basic components, marking a new step toward Sentinel’s diversification and cloud native. In the past six months, the community has launched nearly 10 versions, gradually aligned with the core high availability protection and fault tolerance capabilities, and continuously expanded the open source ecosystem, co-built with dubbo-Go, Ant MOSN and other open source communities.

Sentinel Go version 1.0 GA was released recently, marking the start of production availability of the Go version. Sentinel Go version 1.0 aligns high availability protection and fault tolerance capabilities at the core of the Java version, including features such as flow limiting, traffic shaping, concurrency control, fuse degradation, system adaptive protection, and hotspot protection. At the same time, Go version has covered the mainstream open source ecosystem, providing the adaptation of Gin, gRPC, Go-Micro, Dubbo-Go and other commonly used micro-service frameworks, and provides the support of dynamic data source extension such as ETCD, Nacos, Consul. Sentinel Go is also evolving in the direction of cloud native, with some of the cloud native aspects explored in version 1.0, including Kubernetes CRD Data-Source, Kubernetes HPA and others. For the Sentinel Go release, the flow control scenario we expect is not limited to the microservice application itself. Go language ecology occupies a high proportion in cloud native basic components, but these cloud native components often lack fine-grained and adaptive protection and fault tolerance mechanisms. In this case, Sentinel Go can be combined with some extension mechanisms of components to protect its own stability. Sentinel uses a high performance sliding window for second-level call indicator statistics, combined with token bucket, Leaky bucket and adaptive flow control algorithm to reveal the core’s high availability protection capability.

So how do we use Sentinel Go to ensure the stability of our microservices? Let’s look at some typical application scenarios.

Core scenarios for HIGH availability protection

1. Flow control and deployment

Traffic is very random and unpredictable. One second it may be calm, the next second there may be a flood peak (for example, double eleven zero). However, the capacity of our system is always limited. If the sudden flow exceeds the system’s capacity, it may lead to unprocessed requests, slow processing of accumulated requests, high CPU/Load, and finally lead to system crash. Therefore, we need to limit such sudden traffic to handle requests as much as possible while ensuring that the service is not overwhelmed, which is flow control. Flow control scenarios are very generic, and scenarios like pulse flow classes are applicable.

Typically in the case of Web portals or Service providers, we need to protect the Service providers themselves from being overwhelmed by traffic peaks. In this case, traffic control is usually based on the service provider’s service capability or restricted to a particular service caller. We can evaluate the bearing capacity of the core interface in combination with the preliminary pressure test, and configure the flow control rules in QPS mode. When the number of requests per second exceeds the set threshold, the redundant requests will be automatically rejected.

The following is the simplest configuration example of a Sentinel traffic limiting rule:

_, err = flow.LoadRules([]*flow.Rule{{Resource: "some-service", // Cache Resource name Count: 10, // The threshold is 10, and the default value is second dimension statistics, that is, the number of times per second in a single machine of the request is no more than 10.Copy the code

2. Warm-up preheating flow control

When the system is at a low water level for a long time and the flow suddenly increases, pulling the system directly to a high water level may overwhelm the system instantly. For example, when a service is just started, the database connection pool may not be initialized, and the cache may be empty. In this case, the surge of traffic can easily cause the service to crash. If you use the traditional limiting mode without smoothing/peak-clipping, there is a risk of being blocked (such as high concurrency in a flash). For this scenario, we can use the warm-up flow control mode of Sentinel to control the slow increase of the traffic passing through and gradually increase to the upper limit of the threshold within a certain period of time, instead of releasing all the traffic at a moment. Meanwhile, combined with the control effect of request interval control and queuing, we can prevent a large number of requests from being processed at the same time. This gives the cold system time to warm up and prevents the cold system from being overwhelmed.

3. Concurrency control and fuse degradation

A service often calls another module, perhaps another remote service, a database, a third-party API, and so on. For example, when making a payment, you may need to remotely call the API provided by UnionPay. Querying the price of an item may require a database query. However, the stability of the dependent service is not guaranteed. If the dependent service is unstable and the response time of the request is longer, the response time of the method that invokes the service is also longer, threads pile up, and eventually the business’s own thread pool may be exhausted and the service itself becomes unavailable.

Modern microservice architectures are distributed and consist of a very large number of services. Different services call each other and form a complex call link. The above problems can have a magnified effect in link calls. If a link in a complex link is unstable, it may cascade to make the whole link unavailable. Sentinel Go provides the following capabilities to avoid unavailability due to unstable factors such as slow call:

Concurrency Control (Isolation module) : As a means of lightweight isolation, controls the number of concurrent calls (that is, the number of ongoing calls) for certain calls, preventing too many slow calls from crowding out normal calls.
Circuitbreaker module: Automatically downgrades unstable weak dependent calls to temporarily cut off unstable calls to avoid the overall avalanche caused by local unstable factors.

Sentinel Go fuse degradation feature is based on the idea of fuse mode, which temporarily cuts off service invocation when unstable factors (such as response time becomes longer and error rate increases) occur, and then tries again after a certain period of time. On the one hand, it prevents the “aggravation” of the unstable service, on the other hand, it protects the caller of the service from being dragged down. Sentinel supports two fusing strategies: response time based (slow call ratio) and error based (error ratio/number of errors), which can effectively protect against a variety of unstable scenarios.

Note that the fuse mode is generally applicable to weakly dependent calls, that is, the degraded fallback does not affect the main business process. The developer needs to design the degraded fallback logic and return value. It is also important to note that even if the service caller introduces circuit breaker degradation, we still need to configure the request timeout on the HTTP or RPC client to provide a cushion.

4. Hotspot protection

Traffic is random and unpredictable. To avoid being overwhelmed by heavy traffic, we usually configure traffic limiting rules for core interfaces. However, it is not enough to configure common traffic control rules in some scenarios. Let’s take a look at a scenario where there are a lot of “hot” items at the peak of a rush, and those hot items have very high instantaneous traffic. In general, we can anticipate a wave of hot items in advance and cache the item information to “warm up” so that in the event of a large number of visits, it can be quickly returned without all hitting the DB. But every time there is a big rush, there are some “dark horse” commodities that we cannot predict in advance and are not warmed up. When the number of visits of these “dark horse” commodities surges, a large number of requests will break through the cache and be directly sent to the DB layer. As a result, the DB access is slow and the resource pool of normal commodity requests is crowded. In the end, the system may break down. At this time, Sentinel’s hotspot parameter flow control is used to automatically identify hotspot parameters and control the access QPS or concurrent amount of each hotspot value, which can effectively prevent the over-hot parameter access from crowding the normal call resources.

Another example is a scenario where we want to limit how often each user can call an API, and using the API name +userId as a buried resource name is obviously not appropriate. At this time, we can pass the userId as a parameter to the API burying point through WithArgs(XXX), and then configure the hotspot rules to limit the call frequency for each user. At the same time, Sentinel also supports separate configuration of current limiting values for some specific values to carry out fine flow control. Like other rules, hotspot flow control rules also support dynamic configuration through dynamic data sources. The RPC framework integration modules (such as Dubbo and gRPC) provided by Sentinel Go will automatically attach the parameter list of RPC call to embedded points, and users can directly configure hotspot flow control rules according to the corresponding parameter positions. Notice If you need to configure value limiting, only the basic type and string type are supported.

Sentinel Go hotspot traffic control is based on cache elimination mechanism + token bucket mechanism. Sentinel identifies hotspot parameters through elimination mechanism (such as LRU, LFU, ARC policy, etc.), and controls the traffic volume of each hotspot parameter through token bucket mechanism. The current version of Sentinel Go uses THE LRU strategy to count hot spot parameters. The community has already submitted PR for optimizing the elimination mechanism. In future versions, the community will introduce more cache elimination mechanisms to suit different scenarios.

5. System adaptive protection

With the above flow protection scenario, is everything ok? In fact, it is not true. In many cases, we cannot accurately evaluate the accurate capacity of an interface in advance, or even predict the flow characteristics of the core interface (such as whether there is pulse or not). At this time, the pre-configured rules may not be able to effectively protect the current service node. In some cases, we may suddenly find that the Load and CPU usage of the machine start to spike, but there is no way to quickly identify the cause or deal with the exception. At this time we actually need to do is to quickly stop the loss, first through some automatic protection means, will be on the verge of collapse of micro services “pull” back. In response to these situations, Sentinel Go provides a system adaptive protection rule that ADAPTS and dynamically adjusts traffic based on system metrics and service capacity.

The Sentinel system adaptive protection strategy draws on the idea of TCP BBR algorithm, combines the monitoring indexes of Load, CPU utilization, QPS, response time and concurrency of the system, and achieves a balance between the system inlet flow and the system Load through the adaptive flow control strategy. Let the system run as far as possible in the maximum throughput at the same time to ensure the overall stability of the system. System rules can be used as a bottom-saving defense policy for the entire service to ensure service continuity, which is effective in CPU-intensive scenarios. At the same time, the community is also combining automatic control theory and reinforcement learning and other means to continuously improve the effects and application scenarios of adaptive flow control. In future releases, the community will also introduce more experimental adaptive flow control strategies to cater for more usability scenarios.

Cloud Native Exploration

Cloud native is the most important part of the Sentinel Go version evolution. During the course of GA, the Sentinel Go community also made some explorations in scenarios such as Kubernetes and Service Mesh.

1. Kubernetes CRD data-source

In a production environment we typically need a configuration center to dynamically manage rule configurations. In the Kubernetes cluster, we can utilize the way of Kubernetes CRD naturally to manage the Sentinel rules of the application. In Go 1.0.0, the community provides the basic Sentinel rule CRD abstraction and corresponding data source implementation. Users only need to import the CRD definition file of Sentinel rules and register the corresponding data-source when accessing Sentinel. Then write YAML configuration according to the format defined by CRD and kubectl apply to the corresponding namespace to achieve dynamic configuration rules. Here is an example of a flow control rule:

apiVersion: datasource.sentinel.io/v1alpha1
kind: FlowRules
metadata:
  name: foo-sentinel-flow-rules
spec:
  rules:
    - resource: simple-resource
      threshold: 500
    - resource: something-to-smooth
      threshold: 100
      controlBehavior: Throttling
      maxQueueingTimeMs: 500
    - resource: something-to-warmup
      threshold: 200
      tokenCalculateStrategy: WarmUp
      controlBehavior: Reject
      warmUpPeriodSec: 30
      warmUpColdFactor: 3
Copy the code

Kubernetes CRD data-source module address: github.com/sentinel-gr…

The community will further refine the Rule CRD definition and work with other communities to explore standard abstractions related to high availability protection.

2. Service Mesh

Service Mesh is one of the trends in the evolution of microservices to cloud native. Under the Service Mesh architecture, some of the capabilities of Service governance and policy control are gradually sunk into the Data Plane layer. Last year, the Sentinel community took a stab at implementing the Envoy Global Rate Limiting gRPC Service, Sentinel RLS Token Server, in Java version 1.7.0. Use the Sentinel Cluster limiting Token Server to provide cluster traffic control capabilities to the Envoy service grid. With the release of Sentinel Go this year, the community is collaborating and integrating with more Service Mesh products. We worked with the ant MOSN community to support Sentinel Go’s flow control degradation capability in the MOSN Mesh, and it has also been implemented inside the ant. The community is also exploring more general solutions, such as implementing Sentinel plug-ins using Istio’s Envoy WASM extension mechanism to enable Istio/Envoy service grids to leverage Sentinel’s native flow control degradation and adaptive protection capabilities to ensure the stability of the entire cluster.

3. Kubernetes HPA based on Sentinel metrics

There are many ways to ensure service stability. In addition to rules to “control” traffic, “elasticity” is also a way of thinking. For applications deployed in Kubernetes, services can be scaled horizontally using Kubernetes HPA capabilities. By default, the HPA supports multiple system indicators and user-defined statistics. At present, we have implemented elastic scaling on Aliyun Kubernetes container service combined with AHAS Sentinel to support service-based average QPS and response time as conditions. The community is also trying to adapt some Sentinel service level indicator statistics (pass volume, reject volume, response time, etc.) to Kubernetes HPA through Prometheus or OpenTelemetry.

Of course, Elastic solutions based on Sentinel are not a panacea. They are only suitable for certain scenarios, such as Serverless scenarios with fast startup times. For slow service startup or scenarios with non-service capacity problems (such as insufficient DB capacity), elastic solutions do not solve stability problems well, and may even worsen service deterioration.

Let’s start hacking!

Knowing the above high availability protection scenarios and Sentinel’s exploration in the direction of cloud native, I believe you have a new understanding of the means of fault tolerance and stability of micro-services. You are welcome to play a demo and integrate the micro service into Sentinel to enjoy the high availability protection and fault tolerance capabilities to make the service “rock solid”. Sentinel Go 1.0 GA release was made possible by the community’s contributions. Thank you to all who contributed.

We also added two new committers to GA, @Sanxun0325 and @LuckyXiaoqiang. Their evolution in version 1.0 brought Warm Up flow control, Nacos dynamic data source, and a series of feature improvements and performance optimizations. Very active in helping the community answer questions and review code. Congratulations both! The community will continue to explore and evolve towards cloud native and adaptive intelligence in the future version. We also welcome more students to join the contribution team to participate in the future evolution of Sentinel and create infinite possibilities. We encourage contributions of any kind, including but not limited to:

bug fix
new features/improvements
dashboard
document/website
test cases

Developers can select issues of interest from the Good First Issues list on GitHub to participate in discussions and contribute. We focus on developers who are actively contributing, and core contributors are nominated as Committers to lead the community. We also welcome any questions or suggestions you may have, via GitHub Issue, Gitter, or the Stubble Group (30150716). Now start hacking!

Sentinel Go Repo: github.com/alibaba/sen…
Corporate users are welcome to register: github.com/alibaba/Sen…
Sentinel Ali Cloud Enterprise Edition: ahas.console.aliyun.com/

Sentinel Go, the same type of flow control downgrade component of Alibaba Double 11, was officially GA, helping to stabilize the cloud native service