Traffic limiting is one of the ways to ensure the high availability of services, especially in the microservice architecture. Traffic limiting of interfaces or resources can effectively ensure the availability and stability of services.
The current limiting measures used in previous projects were mainly Guava’s RateLimiter. RateLimiter is based on the token bucket flow control algorithm, which is very simple to use, but has relatively few features.
Now, we have a new option, Ali’s Sentinel.
Sentinel is a kind of current limiting and fusing middleware provided by Alibaba. Compared with RateLimiter, Sentinel provides rich functions of current limiting and fusing. It supports console configuration of traffic limiting, fusing rules, cluster traffic limiting, and visualization of the corresponding service invocation.
At present, many projects have been connected with Sentinel, and this paper mainly analyzes the current limiting function of Sentinel in detail, but the other capabilities of Sentinel are not studied in depth.
First, the overall process
Let’s take a look at the overall process:
(Sentinel website)
The image above is from the official website.
From the point of view of design pattern, the typical responsibility chain pattern. After the external request comes in, it will be processed by each node in the responsibility chain, and the current limiting and fusing of Sentinel are realized through these nodes in the responsibility chain.
In terms of the current limiting algorithm, Sentinel uses the sliding window algorithm to limit the current. In order to understand the principle, you need to start with the source code. Here, go directly to the Sentinel source code.
Second, read the source code
1. Source code reading entry and overall process
Read the source code first to find the source code entry. We often use @SentinelResource to tag a method. You can think of a method tagged by @SentinelResource as a Sentinel resource. So let’s use @SentinelResource as an entry point to find a cut and see what’s done to get a sense of how Sentinel works. Look directly at the SentinelResource section code for the annotation @SentinelResource.
You can clearly see how Sentinel behaves. Once in the SentinelResource slice, the Sphu.Entry method is executed, where the logical processing of flow limiting and fusing is performed on the intercepted method.
If fuses and traffic limiting are triggered, a BlockException will be thrown. We can specify the blockHandler method to handle the BlockException. For business exceptions, we can also configure the Fallback method to handle exceptions generated by intercepting method calls.
Therefore, Sentinel fusible current limiting is mainly processed in SphU. Entry method. The main processing logic is shown in the source code below.
It can be seen that in the SphU. Entry method, the process of Sentinel realizing the functions of current limiting and fusing can be summarized as follows:
- Get Sentinel Context;
- Obtain the corresponding responsibility chain of resources;
- Generate Entry for resource invocation;
- Execute each node in the chain of responsibility.
Next, the Sentinel service mechanism is systematically described in these aspects.
2. Get Sentinel Context
Context, as the name implies, is the Context in which Sentinel fuse limiting is executed, containing node and Entry information for resource calls.
Let’s look at the characteristics of Context:
- Context is thread-owned and is bound to the current thread using ThreadLocal.
- Context contains the content
Here, three important concepts of Sentinel are introduced: Conetxt, Node and Entry. These three classes are the core classes of Sentinel, providing resource call paths, resource call statistics, and more.
Context
Context is the Sentinel Context held by the current thread.
When entering the logic of Sentinel, it first retrieves the current thread’s Context, or creates a new one if there is none. When the task is complete, the context of the current thread is cleared. Context represents the call link Context, through all entries in a call link.
Context maintains information such as the entranceNode, the current node (curNode) of the invocation link, and the origin of the invocation. Context name is the name of the call link entry.
Node
Node is a statistical wrapper for a resource tagged @SentinelResource.
The entry node in Context that records this current thread resource call.
We can trace the resource invocation through the childList of the entry node. Each node has a @sentinelResource tag and its statistics, such as passQps, blockQps, RT, etc.
Entry
Entry is a token used in Sentinel to indicate whether or not the stream has passed the limit. If it returns normally, then you can access the Sentinel protected backend service. Otherwise, Sentinel will throw a BlockException.
In addition, it saves some basic information about the entry() method executed this time, including the resource Context, Node, and corresponding responsibility chain, etc. After the subsequent resource invocation, it also needs to perform some follow-up operations with the obtained entry, including exiting the responsibility chain corresponding to the entry. Complete some node statistics update, clear the current thread Context information, etc.
3. Get the chain of responsibility for the @sentinelResource tagged resource
The responsibility chain corresponding to the resource is where the flow limiting logic is implemented, which adopts the typical responsibility chain mode.
Let’s look at the default chain of responsibility:
The default process nodes in the responsibility chain include NodeSelectorSlot, ClusterBuilderSlot, StatisticSlot, FlowSlot, and Assist eslot. The call chain (ProcessorSlotChain) and all the slots it contains implement the ProcessorSlot interface, executing the processing logic of each node in a chain of responsibility mode and calling the next node.
Each node has a role to play, and we’ll see what they do later.
In addition, the chain of responsibility for the same resource (tagged by @SentinelResource) is consistent. In other words, each resource corresponds to a separate responsibility chain, you can see the source code of resource responsibility chain acquisition logic: first from the cache, not new.
4. Generate the call certificate Entry
The generated Entry is a CtEntry. Its construction parameters include a ResourceWrapper, the responsibility chain for the resource, and the Context for the current thread.
As you can see, the new CtEntry records the responsibility chain of the current resource and the Context, and updates the Context, setting the current Entry of the Context to itself. As you can see, CtEntry is a bidirectional linked list that builds the call link for Sentinel resources.
5. Implementation of the chain of responsibility
Then comes the execution of the chain of responsibility. The responsibility chain and the slots in it implement ProcessorSlot. The entry method of the responsibility chain executes each Slot in the responsibility chain in turn, so we enter each Slot in the responsibility chain. To highlight the point, this article will only examine slots related to the traffic limiting function.
5.1 NodeSelectorSlot — Get the Node corresponding to the current resource and build the Node call tree
This Node is responsible for obtaining or building the Node corresponding to the current resource. This Node is used for the statistics of subsequent resource calls and the judgment of flow limiting and fusing conditions. NodeSelectorSlot also completes the call link construction. Look at the source code:
Familiar code style. We know that there is a chain of responsibility for each resource. Each call chain has NodeSelectorSlot. The node cache map in NodeSelectSlot is a non-static variable, so the map is shared only for the current resource. Different resources have different NodeSelectSlot and Node cache. The relationship between resources and node cache map is shown in the following figure.
So the function of NodeSelectorSlot is:
- When the resource’s call chain is executed, get the Node corresponding to the current context, which represents the resource’s call.
- Set the obtained node to the current node and add it to the previous node to form a tree-like call path. (Via current Entry in Context)
- Triggers the execution of the next Slot.
An interesting question is why we use the Context name instead of the SentinelResource name to fetch the Node corresponding to the resource in the NodeSelectorSlot of the responsibility chain.
First, we know that a resource corresponds to a chain of responsibility. But the Context that goes into a resource call might be different. If the resource name is used as the key to retrieve the corresponding Node, the Node retrieved from the call method in different context will be the same. In this way, nodes corresponding to the same resource can be separated by Context.
For example, Sentinel functionality can be implemented not only through the @SentinelResource annotation method, but also through the introduction of sentinel-Dubo-Adapter, using Dubbo’s Filter mechanism to protect the Dubbo interface directly. Let’s compare @SentinelResource to Dubbo Context generation:
@SentinelResource
The generated context’s name is: sentinel_default_context. All resources have Context with this value.
Dubbo Filter way
The name of the generated context is the interface qualified name or method qualified name of Dubbo.
If resource calls to other SentinelResources nested under the Dubbo Filter mode occur, a different Context will appear for those resource calls.
So there’s a situation where different Dubbo interfaces come in, and all of them call the same @SentinelResource tag method, and the Context to which that method’s corresponding SentinelReource is executed will be different.
Another problem is that since resources are divided into different nodes by Context, what do we do if we want to see total resource statistics? This brings us to ClusterNode. ClusterBuilderSlot
5.2 ClusterBuilderSlot – A Node that aggregates the same resource in different contexts
This Node aggregates nodes corresponding to different contexts of the same resource for subsequent traffic limiting.
As you can see, ClusterNode is obtained with the resource name key. ClusterNode will become an attribute of the current node, which is mainly used to aggregate multiple nodes of the same resource in different contexts. The default traffic limiting conditions are determined based on ClusterNode statistics.
5.3 StatisticSlot — Resource invocation statistics
This node is responsible for calculating and updating statistics for resource calls. Unlike previous and subsequent slots, execution of StatisticSlot triggers execution of the next slot and does not execute its logic until the following slot is complete.
This is also very understandable, as a statistical component, always wait for the circuit breaker or current limiting process after the statistics can be done. Let’s take a look at the specific statistical process.
The above chart clearly describes the process of StatisticSlot statistics. Note that there are no exceptions and blocking exceptions, mainly the number of update threads, the number of passed requests, and the number of blocked requests. Both DefaultNode and ClusterNode inherit from StatisticNode. So Node data updates go to StatisticNode.
Referring to the Statistical block diagram of Sentinel data, the general process of Node statistical data update is described as follows:
We from StatisticNode addPassRequest () method to obtain, passQps, for example, to explore how StatisticNode update QPS count by request.
The count variables rollingCounterInSecond and rollingCounterInMinute are Metric, and their time dimensions are seconds and minutes, respectively. RollingCounterInSecond and rollingCounterInMinute use the Metric implementation class ArrayMetric.
Trace back to ArrayMetric:
The statistics are stored in ArrayMetric data, which is LeapArray.
LeapArray is an array of time Windows. The basic information includes: time window length (MS, windowLengthInMs), sample number (i.e., number of time Windows, sampleCount), time interval (MS, intervalInMs), and time window array (array). The time window length, sampling number and time interval are related as follows:
windowLengthInMs = intervalInMs / sampleCount
The intervalInMs used by rollingCounterInSecond in this code is 1000 (ms), which is 1s, and sampleCount=2. So, the window duration is windowLengthInMs = 500ms. The intervalInMs used by the rollingCounterInMinute is 60 x 1000 (ms), which is 60s. SampleCount =60, so windowLengthInMs = 1000ms, 1s.
The time window array (array) is of type AtomicReferenceArray, so this is an array reference of atomic operations. The array element type is WindowWrap. WindowWrap is a wrapper around a window of time, including the windowStart time, the window length, and the window counter (value, type MetricBucket). The actual window counts are performed by MetricBucket, and the counter counters (type (LongAdder)) are stored in MetricBucket. Take a look at the block diagram of the counting component below:
Back to StatisticNode. AddPassRequest method to rollingCounterInSecond. AddPass (count), for example, to explore how the Sentinel for sliding window count.
5.3.1 Obtaining the Current Time Window
(1) Take the array subscript corresponding to the current timestamp
long timeId = time / windowLength
int idx = (int)(timeId % array.length());
Time is the current time, windowLength is the time windowLength, and rollingCounterInSecond is 500ms. Array is the number of time Windows per unit time. The number of time Windows per unit time (1s) of rollingCounterInSecond is 2. TimeId is the exact division of the current time into the time window. For each increment of time by windowLength, timeId increases by 1, and the time window slides forward by one.
(2) Calculate the window start time
Window start time = Current time (ms) – Current time (ms) % Time window length (ms)
The obtained window start time is an integer multiple of the time window.
(3) Obtain the time window
First, get the time window from LeapArray’s array based on the array subscript.
- If the obtained time window is empty, create a NEW TIME window (CAS).
- If the obtained time window is not empty and the start time of the time window is equal to the calculated start time, it means that the current time is in this time window, and the time window is directly returned.
- If get time window is not empty, and the time window of the start time is less than the beginning of the period, we calculated the time window has expired last access time window (distance longer scenarios), need to update the time window (lock), will be the beginning of the time window time set as calculated at the beginning of time, the time window of the counter reset to zero.
- If the obtained time window is non-empty and the start time of the time window is greater than the start time of the time window we calculated, a new time window is created. It’s not going to go into this branch, because it says the current time is already behind the time window, and the time window you get is going to be in the future, so it doesn’t make sense.
5.3.2 Accumulate the counters in the time window
The time window counter is a LongAdder array that stores the number of passed requests, abnormal requests, blocked requests, and so on. The diagram below:
The count, block count, and exception count are updated when the Entry method of StatisticSlot is executed. The success count and response time are updated when the exit method of StatisticSlot is executed. In essence, the corresponding count is updated before and after the execution of the intercepted method. Of course, addPass just sums up the first element of the count array.
The count array element type is LongAdder. LongAdder is JDK8 added to JUC. It is a thread-safe “counter” that performs better than Atomic* tools.
5.4 FlowSlot: Flow limiting judgment
FlowSlot is the node for determining flow limiting conditions. The statistics previously performed by StatisticSlot on related resource calls will be used in FlowSlot limiting judgment.
Directly to the core logic of flow limiting operation — Flow limiting rule checker:
The main processes include:
- Obtain the traffic limiting rule corresponding to resources
- Check whether traffic is restricted according to traffic limiting rules
If the flow is restricted, a FlowException is thrown. FlowException is inherited from BlockException.
How does FlowSlot check for limiting flow?
By default, the cluster node of the current node is used for traffic limiting. The main current limiting method is QPS current limiting. Let’s look at the key code for a lower bound stream (DefaultController) :
- Get the current QPS count of the node;
- Check whether the threshold is exceeded after obtaining the new count
- If the value exceeds the threshold, false is returned, indicating that the flow is restricted. A FlowException is thrown. Otherwise, return true and the stream is not restricted.
As you can see, the limiting judgment is very simple, just need to check the QPS count. This is thanks to StatisticSlot statistics.
5.5 Summary of responsibility chain
From the above explanation, let’s look at the following picture, isn’t it clear?
(Sentinel website)
NodeSelectorSlot is used to fetch the nodes corresponding to the resource and build a Node call Tree, grouping the SentinelSource call links as Node trees. ClusterBuilderSlot Creates a ClusterNode for the current Node to aggregate nodes of different contexts corresponding to the same resource. This ClusterNode is used for traffic limiting.
ClusterNode inherits from StatisticNode and records statistics about resource processing. StatisticSlot Is used to update the count of resource calls for subsequent traffic limiting. FlowSlot Determines whether to perform flow limiting based on the call count of the Node corresponding to the resource. At this point, Sentinel’s chain of responsibility execution logic is complete.
6. Finishing touches on Sentienl
Take a look at the entry.exit () method, which executes on success, failure, or blocking.
- Check whether the entry to exit is the current entry of the current context.
- If the entry you want to exit is not the current entry of the current context, you do not exit this entry, but the current entry of the context and all its parent entries, and throw an exception.
- If the entry to be exited is the current entry of the current context (which is normal), exited all slots in the responsibility chain corresponding to the current entry. At this step, StatisticSlot updates the node success count and RT count;
- Set the current entry of the context to the parent entry of the exiting entry;
- If the parent entry is empty and context is the default context, the default context is automatically exited (ThreadLocal is cleared).
- Clears reference to the context from which entry was exited
7. To summarize
Sentinel’s current limiting process can be clearly understood by reading the source code of Sentinel.
- Context, Entry and Node are the core components of Sentinel. All kinds of information and resource invocation are held by these three categories.
- The operation of Sentinel information statistics, fusing and current limiting was completed by responsibility chain mode.
- NodeSelectSlot in the responsibility chain is responsible for selecting the Node corresponding to the current resource and constructing the Node call tree.
- ClusterBuilderSlot in the responsibility chain builds clusterNodes corresponding to the current Node and aggregates nodes corresponding to the same resource in different contexts.
- The StatisticSlot in the responsibility chain is used to collect statistics on the invocation of current resources and update various statistics of Node and its peer ClusterNode.
- FlowSlot in the responsibility chain limits traffic based on the ClusterNode (default) statistics of the current Node.
- Resource call statistics (such as PassQps) are counted using a sliding time window.
- When all work is done, exit the process, add some statistics, and clean up the Context.
Iii. References
Github.com/alibaba/Sen…
By Sun Yi