Design and thinking of microservice interface limiting

instructions

After the separation of microservices, the call relationship between systems is complex, the overall complex entropy of the platform increases, and the probability of error and the difficulty of debugging problems are several orders of magnitude higher. Therefore, service governance becomes a technical focus of microservices. The concept of service governance itself is relatively large, including authentication, traffic limiting, degradation, fusing, monitoring and alarm, etc. This article focuses on traffic limiting, and shares some thoughts on micro-service interface traffic limiting based on the author’s practical experience.

This article attempts to clarify the following questions. If you have similar questions or are interested in a particular topic, please feel free to read this article.

What problems might be encountered if there is no interface limiting in microservices architecture?
How to select an appropriate traffic limiting algorithm for microservice interface traffic limiting?
How to choose single-node or distributed traffic limiting based on scenario and performance requirements?
How to flexibly select different current limiting fuses according to service requirements?
How to select the appropriate traffic limiting time granularity and maximum traffic limiting value for the interface?
How to verify the validity and correctness of the traffic limiting function of the microservice interface?
How to build a high fault tolerance, high TPS, low delay traffic limiting framework?

At the end of the article, I also introduced the author’s open source stream limiting framework: Ratelimiter4j, welcome everyone to exchange and use.

Background of microservice interface traffic limiting

When dealing with high performance pressure scenarios such as seckill, great speed, Double 11 and 618, current limiting has become a standard technical solution, which plays a key role in ensuring the smooth operation of the system. Regardless of the application scenario, traffic limiting is to selectively fuse certain requests based on preset traffic limiting rules to limit traffic that exceeds expectations. Due to the limitations of space and the author’s experience, this paper focuses on the flow limiting of service interfaces under the microservice architecture.

For micro services, especially for some medium micro services, the interface request may come from many systems. For example, the interface of user service will be called by many internal systems, such as CRM and promotion system. For microservices that serve many calling systems and respond to a large number of interface requests, interface traffic limiting plays a significant role in the following scenarios, in addition to the big kill scenarios mentioned above.

As a microservice system that provides interface services, we cannot limit how callers use our interfaces. We have encountered some callers who run jobs concurrently to request our interfaces, and also encountered some code bugs of callers or burst traffic on business. As a result, the number of interface requests from this caller increases suddenly, and the service thread resources are excessively contended for, while the interface requests from other callers are queued for no response, and the overall request response time of microservices becomes longer or even times out. So to prevent the interface from being over-invoked, fine-grained access flow limiting is required for each caller.

In addition to limiting the access frequency of callers, we sometimes need to limit the access frequency of certain interfaces. Such as some slow interface, probably because of complex logic, processing time is not long, if unrestricted access to slow interface frequency, too much slow interface request will always take the service thread resources are not released, result in unable to respond to other interfaces, affect the micro service system overall throughput and interface response time, and even cause a lot of interface timeout. In addition to slow interfaces, some core interfaces need to perform traffic limiting for unexpected abnormal traffic in addition to invocation authentication once abnormal access has great impact on services.

What has been discussed above, we not only need for big promote seconds kill scenes of the coarse-grained service interface current limiting function: single machine such as restricting micro service cluster number of requests per second, we also need to different caller even more different interfaces for fine-grained current-limiting: such as restricting A caller to A service of an interface of the maximum number of requests per second.

This section describes the definition of flow in traffic limiting on an interface

How to interpret the word “flow” in stream limiting? What are the metrics to limit? Different scenarios define flow differently, including network traffic, bandwidth, transactions processed per second (TPS), hits per second (HITS per second), concurrent requests, or even a service indicator, such as the maximum number of SMS verification code requests allowed by a user in a certain period of time.

From the point of view of ensuring system stability and availability, the best traffic limiting indicator for microservice system is the number of concurrent requests. By limiting the number of concurrent processing requests, we can limit the number of requests consuming resources at any time. For example, by setting the number of servlet worker threads in the Web container to 200, there are only 200 requests being processed at most at any time, and the exceeding requests will be blocked and queued.

In the previous section, we in order to solve the caller to the service resources excessive contention problem, need for different caller different interfaces even do fine-grained current limit, so, we need to the system as a whole in addition to the restrictions on the number of concurrent requests do, also need to each caller, and even different interface restrictions on the number of concurrent requests. However, it is difficult to properly set the maximum number of concurrent requests for a caller. This value is difficult to obtain through monitoring statistics. Too small is easy to kill, too large is useless. So we also need other traffic limiting indicators.

Comparing the two indicators of TPS and HITS per second, we choose HITS per second as the traffic limiting indicator. Because, current limiting of TPS is actually unable to do, TPS said transactions per second, is the beginning of the transaction to receive interface request, is the end of the transaction processing is completed returns, so there is a certain time span, if the transaction start current limiting counter plus one, current-limiting counter minus one end of the transaction, it is equivalent to concurrent current limit. If the transaction request is received as the counting time point, the traffic limit is reduced to HITS per second. If the transaction end is counted as the counting time point, the value of the counter does not represent the current and subsequent system access pressure of the system.

Is traffic limiting for HITS per second an effective traffic limiting indicator? The answer is yes, this value is observable and statistical, so it is convenient to configure the traffic limiting rule. Moreover, this value reflects the current and subsequent performance pressure of the system to a certain extent, and the traffic limiting for this indicator can indeed limit the use of system resources.

With the definition of flow, let’s take a look at several commonly used flow limiting algorithms: fixed time Windows, sliding time Windows, token buckets, leaky buckets, and their improved versions.

Fixed and sliding time window flow limiting algorithm

The current limiting algorithm based on fixed time window is very simple. You need to select a start point in time, and then accumulate the counter for each incoming interface request. If the accumulated access times exceed the current flow limit in the current time window according to the traffic limiting rule (for example, the maximum number of interface requests per second), the traffic limiting fuse rejects the interface request. When the next time window is entered, the counter is reset.

The disadvantage of the traffic limiting algorithm based on fixed time Windows is that the traffic limiting strategy is too rough to deal with the burst traffic in the critical time of two time Windows. Let’s take an example: Assume that the traffic limiting rule is no more than 100 interface requests per second. In the first 1s time window, 100 interface requests are concentrated in the last 10ms, and in the second 1s time window, 100 interface requests are concentrated in the first 10ms. Although the traffic in both time Windows meets the requirements of traffic limiting (<=100 requests), there will be 200 interface requests concentrated in the critical 20ms of the two time Windows. If traffic limiting is not done, the 200 requests concentrated in the 20ms May overwhelm the system, as shown in Figure -1:

Sliding window algorithm is the time for fixed window algorithm, an improved flow after plastic sliding window algorithm, can guarantee any time window, all does not exceed the maximum permissible current limit value, on the flow curve of it will be more smooth, can solve the above mentioned some critical flow problem. Compared with the fixed time window traffic limiting algorithm, the time window of the sliding time window traffic limiting algorithm keeps sliding. Besides, a counter is required to record the number of interface requests in the time window, and the arrival time point of each interface request in the time window is also required to record, which consumes more memory. The algorithm model of sliding time window is as follows:

List = (t_1, t_2… T_k), the time window size is 1 second and the starting point is the smallest time point in the list. When a new request arrives at t_M time, we update the sliding time window and determine whether current limiting fuses through the following steps:

STEP 1: Check whether the request time t_M is within the current time window [t_start, t_start+1 SEC]. If yes, go to STEP 3; otherwise, go to STEP 2.

STEP 2: Slide the time window backwards, update the starting point of the time window, T_start, to the second smallest time point in the list, and delete the smallest time point from the list. Then, jump to STEP 1.

STEP 3: Judge whether the number of interface requests in the current time window is less than the maximum allowable interface request flow limiting value, that is, judge: List. size < max_hits_limit. If the value is smaller than max_HITs_limit, the traffic limit is not exceeded and the interface request is allowed.

The sliding time window traffic limiting algorithm can partially solve the critical problem of fixed time Windows. After the above example is shaped by the sliding time window algorithm, all the 100 requests in the first 1-second time window will pass, and the 100 requests in the first 10ms of the second time window will be fused.

Even if the sliding time window traffic limiting algorithm can ensure that the number of interface requests will not exceed the maximum flow limit in any time window, it still cannot prevent the problem of over-concentration of access in fine time granularity. For example, the 100 requests in the first 1s time window are all concentrated in the last 10ms. In other words, the time-window based traffic limiting algorithm, no matter it is a fixed time window or a sliding time window, can only flow in the upper limit of the selected time granularity, and does not limit the access frequency of finer granularity within the selected time granularity.

In order to deal with the above problems, there are many improved versions of the time-window flow limiting algorithm, such as:

Multi-layer traffic limiting. You can set multiple traffic limiting rules for the same interface. In addition to the maximum number of traffic limiting times per second, you can also set the maximum number of traffic limiting times per 100ms to 20 times (the value is larger than 10 times). In addition, there are also improved algorithms with large space complexity for the sliding time window flow limiting algorithm, which will not be elaborated here due to space limitation.

Token bucket and leaky bucket traffic limiting algorithm

Above, we have described two traffic limiting algorithms based on time Windows: fixed time window algorithm and sliding time window algorithm. Neither of the two traffic limiting algorithms can deal with burst traffic of fine time granularity, and the traffic shaping effect is not smooth enough in fine time granularity. This section describes two smoother traffic limiting algorithms: the token bucket algorithm and the leak-bucket algorithm. In some scenarios, these two algorithms are preferred over the time window algorithm. In fact, the algorithm ideas of token bucket and leaky bucket algorithm are basically similar, and the leaky bucket algorithm can be used as an improved version of the token bucket traffic limiting algorithm, so we mainly introduce the token bucket algorithm.

Let’s take a look at the most basic unimproved token bucket algorithm:

If the maximum number of access times is n within t seconds, a token is added to the bucket every t/n seconds.
A maximum of B tokens can be stored in a bucket. If the bucket is full by the time the token arrives, the token is discarded.
The interface request obtains the token from the token bucket first. If the token is obtained, the interface request is processed. If the token is not obtained, traffic limiting is performed.

The token bucket algorithm seems to be complicated, putting tokens into the bucket at regular intervals, but there is no need for a dedicated thread to do this. Each time before taking the token, calculate how many tokens need to be put into the token according to the timestamp of the last time and the current timestamp, and put them into the token at one time, so it is not too difficult to implement.

The leaky bucket algorithm is slightly different from the token bucket algorithm in that there is also a limit on the frequency of retrieving tokens, and the token must be fetched at a fixed speed of T/N. Therefore, it can be seen that leaky bucket algorithm has better shaping effect on traffic, smoother traffic, and any burst traffic will be restricted. Because the token bucket size is B, it can handle sudden traffic. Of course, there are many other improvements to the token bucket algorithm, such as:

Preheating barrel
Put multiple tokens in at once
Multiple tokens can be fetched at a time

Compared with the traffic limiting algorithm based on the time window, the traffic shaping effect of the token bucket and the missed bucket algorithm is much better than that of the time window algorithm, but the better the shaping effect is, the more appropriate it is. For the token bucket without preheating in advance, the rejection traffic limiting will lead to the killing of many requests by error. In the above algorithm, when n is small, such as 50, a token will be put into the bucket at an interval of 20ms, and the interface access may be very random within 1s, which will appear: Although the limit on maximum access frequency is effective on the curve and the traffic is smooth at a fine time granularity, it mistakenly kills many interface requests that should not have been rejected.

Therefore, the token bucket and leak-bucket algorithms are suitable for blocking traffic limiting. For example, when the traffic limiting of some background jobs exceeds the maximum access frequency, the request will not be rejected, but will be blocked until there is a token. For such as micro service interface current-limiting scene of response time is more sensitive, is more suitable for choosing veto type current limit algorithm based on time window, the sliding time window current limit space complexity is higher, the footprint will be more, so by contrast, although the ability to deal with critical breaking flow fixed window algorithm is poorer, but implementation is simple, The simplicity brings good performance and is not error-prone, so the fixed time window algorithm is also a good microservice interface traffic limiting algorithm.

Distributed modification of traffic limiting algorithm: Distributed traffic limiting algorithm

Compared with the single-node traffic limiting algorithm, the distributed traffic limiting algorithm refers to that the algorithm can be deployed on multiple machines in a distributed manner. The multiple machines cooperate to provide traffic limiting and limit traffic on the same interface or service. Compared with stand-alone traffic limiting algorithms, the biggest difference of distributed traffic limiting algorithm is that the interface request counter needs centralized storage. For example, our open source traffic limiting project Ratelimiter4j implements distributed traffic limiting algorithm based on Redis central counter.

Distributed traffic limiting algorithm after the introduction of the independent system Redis central counter, the complexity of the system suddenly increased a lot, because to solve some common technical problems of distributed system:

1. Data consistency problem

The interface flow limiting process consists of three steps:

Step 1: “read” the current interface access count n;

Step 2: “Judge” whether to limit current;

Step 3: Count n+1 for the write interface. If the traffic limiting on the interface succeeds

In the case of concurrency, the three CAS operations (compare and swap) have race conditions. In a multi-threaded environment, this can be done by locking threads or by Atomic objects in the Concurrent development package. In the distributed case, the idea is similar. Distributed locks can be used to ensure that only one process is accessing at the same time. However, introducing distributed locks requires the introduction of a new system and the code to maintain the lock, which costs a lot. Redis single-threaded mode +Lua script perfectly supports the atomicity of the above operations. I don’t have space to discuss the code, but see the open source project Ratelimiter4j for details.

2. Timeout problem

It’s not too hard to handle the various exceptions in Redis, catch them, encapsulate them into uniform exceptions, toss them up, or swallow them. However, if Redis access times out, the response time of the interface will be seriously affected or even caused by timeout, which is an unacceptable side effect. Therefore, we need to set a reasonable timeout time when accessing Redis. Once the timeout time expires, it will be judged as lose-limiting effect and continue to execute the interface logic. The Redis access timeout should be neither too large nor too small. Too large may affect the response time of the interface, and too small may cause too much loss limiting effect. We can obtain the Redis access time distribution through pressure measurement or online monitoring, and then set a reasonable timeout time based on the traffic limiting delay time that can be tolerated by the service interface.

3. Performance problems

The performance bottleneck of distributed traffic limiting algorithm mainly lies in the central counter Redis. According to the pressure measurement data of ratelimiter4j, without Redis Sharding, the performance of distributed traffic limiting algorithm based on single instance Redis is much lower than that of single memory-based traffic limiting algorithm. Based on our pressure measurement environment, the single-machine current limiting algorithm can reach 2 million TPS, while the distributed current limiting algorithm can only reach 50 thousand TPS. Therefore, when applying the distributed traffic limiting algorithm, it is necessary to consider whether the performance of the traffic limiting algorithm meets the application scenario. If the TPS of the microservice interface exceeds the TPS of the traffic limiting framework, the traffic limiting function will become a performance bottleneck affecting the performance of the interface itself.

In addition to TPS, network latency is also a special consideration, especially if central counters and flow limiting services are deployed across rooms and cities, the network latency between them can be very large, seriously affecting the response time of the microservice interface.

How to choose single-node or distributed traffic limiting

First of all, it needs to be explained that the single-machine and distributed traffic limiting mentioned here and the single-machine traffic limiting algorithm and distributed traffic limiting algorithm mentioned before are not a concept! To improve service performance and availability, microservices are deployed in multi-instance clusters. Single-node traffic limiting means that traffic limiting on interfaces of each instance in a cluster is implemented independently. For example, the interface access frequency of each instance is limited to 1000 times/second. Distributed traffic limiting means that service-level traffic limiting is provided to limit the access frequency of microservice clusters. For example, the caller of A is limited to request user services for A maximum of 10,000 times per minute. Distributed traffic limiting can be implemented using either the stand-alone traffic limiting algorithm or the distributed traffic limiting algorithm.

The original purpose of single-node traffic limiting is to prevent sudden traffic from overwhelming the server, so it is suitable for limiting concurrent traffic. Distributed traffic limiting applies to fine-grained traffic limiting or access quota. Different callers implement different traffic limiting rules for different interfaces. Therefore, hits per second traffic limiting is more suitable. From the point of view of ensuring system availability, single-node traffic limiting is more advantageous, while distributed traffic limiting is more suitable for preventing a certain caller from over-competing for service resources.

The common deployment architectures between distributed traffic limiting and microservices are as follows:

1. Integrate the traffic limiting function at the access layer (API-gateway)

This integration mode is the most reasonable architecture mode under the premise of apI-Gateway in microservice architecture. If apI-Gateway is deployed in single-instance mode, use the single-instance traffic limiting algorithm. If apI-Gateway is deployed with multiple instances, a distributed traffic limiting algorithm must be used in order to achieve service-level traffic limiting.

2. The traffic limiting function is encapsulated as RPC service

After receiving an interface request, the microservice checks whether the interface request exceeds the traffic limiting threshold through the RPC interface exposed by the traffic limiting service. This architecture requires the deployment of a traffic limiting service, which increases o&M costs. In this deployment architecture, the performance bottleneck will appear in the RPC communication between microservices and traffic limiting services. Even though the single-node traffic limiting algorithm can achieve 2 million TPS, the request traffic limiting of 100,000 TPS is not bad after the RPC framework.

3. The traffic limiting function is integrated in the microservice system

This architecture mode does not need to deploy services independently, which reduces operation and maintenance costs. However, the flow limiting code will have some coupling with the business code. However, the flow limiting function can be integrated in the section layer and decoupled from the business code as far as possible. If service-level distributed traffic limiting is implemented, the distributed traffic limiting algorithm must be used. If single-node traffic limiting is implemented for each micro-service instance, the single-node traffic limiting algorithm can be used.

Different fusing policies are used for different services

The fusing policy mentioned here is how to process interface requests when the interface reaches the upper limit of traffic limiting. Some traffic limiting fuses have been mentioned before. The so-called denial traffic limiting means that the request is rejected after the maximum allowed access frequency is exceeded, such as returning HTTP status code 429, etc. The so-called blocking traffic limiting means that the request is queued after the maximum allowed access frequency is exceeded. In addition, there are other traffic limiting fuses, such as logging, sending alarms, and service degradation.

The same system may have different fusing policies for different callers. For example, for callers sensitive to response time, we may adopt the fusing policy of direct rejection, and for callers insensitive to response time, such as background jobs, we may adopt the fusing policy of blocking queuing processing.

We’ll look at some other fusing strategy application scenarios, such as current limiting function has just launched, in order to verify the validity of the current limit algorithm and the rationality of the current limiting rules, to ensure that no shot request, can first use logging + the alarm current limiting fuse strategy, through the analysis of the log to determine current limiting function normal work, further upgraded to other current-limiting fuse strategy.

Different fusing strategies also have influence on the selection of traffic limiting algorithms. For example, token bucket and leak-bucket algorithms are more suitable for blocking fusing scenarios, and time window based fusing algorithms are more suitable for denying fusing scenarios.

How to configure a proper traffic limiting rule

Traffic limiting rules consist of three parts: time granularity, interface granularity, and maximum traffic limiting value. Whether traffic limiting rules are set properly directly affects whether traffic limiting is reasonable and effective.

For the selection of traffic limiting time granularity, we can choose not more than 1000 times in 1 second, not more than 10 times in 10 milliseconds, or not more than 60,000 times in 1 minute. Although these traffic limiting rules seem to be equivalent, too large time granularity will not achieve the effect of traffic limiting. For example, if you limit the number of requests per minute to 60,000, it is possible for 60,000 requests to be concentrated in a single second. On the other hand, too small a time granularity will lead to the improper killing of many requests that should not be restricted because the interface access is very random at the fine time granularity. Therefore, although the finer the time granularity is, the better the flow shaping effect is and the smoother the flow curve is, it is not the finer the more appropriate.

For traffic limiting of interfaces with heavy traffic, such as seckill and Double Eleven, traffic may be concentrated in a few seconds in these scenarios, and THE TPS may be tens or even hundreds of thousands. Therefore, a relatively small traffic limiting time should be selected. On the contrary, if the TPS of the interface is small, it is recommended to use a larger time granularity, such as limiting the number of calls to the interface to no more than 1000 times in a minute, if converted to: No more than 16 requests per second is an unreasonable limit, and even if it is more than 16 requests per second, there is no reason to reject interface requests because 16 requests per second is too small a rate for our system to handle. Even if 1000 requests were concentrated in one second of a minute, it would not affect the stability of the system, so the limit of 16 requests per second is not significant.

In addition to the time granularity, different interface granularity needs to be selected based on traffic limiting requirements, for example:

1) Limit the call frequency of each instance interface of microservices

2) Limit the overall access frequency of the microservice cluster

2) Limit how often a caller can call a service

3) Limit how often a caller can access an interface of a service

4) Limit the access frequency of a certain interface of a service

5) Limit the access frequency of a certain type of interface of a service

The maximum allowed access frequency must be set based on the performance pressure test data, expected service traffic, and online monitoring data. The maximum allowed access frequency must be no less than or equal to the TPS and expected service traffic, and refer to the online monitoring data.

How to judge whether the current limiting function is correct and effective

The validity mentioned here includes two aspects: the validity of traffic limiting algorithm and the validity of traffic limiting rules. Before rushing, seckilling, or other abnormal traffic, we need to verify the effectiveness of the traffic limiting function through experiments in advance, and use data to prove that the traffic limiting function can intercept unexpected abnormal traffic. Otherwise, improper selection of traffic limiting algorithms or improper setting of traffic limiting rules may cause that traffic limiting cannot protect services when the expected traffic comes, and the unexpected abnormal traffic may crush the system.

How do you test that the current limiting function works correctly? Although it can be tested by means of simulated flow or online flow playback, the most effective test method is to concentrate the flow on a small group of machines through diversion. For the test results, we need to record at least the following information for each request: corresponding interface, request time point, flow limiting result (pass or fusing), and then draw the following graph according to the recorded data:

From the diagram, you can know the traffic before and after traffic limiting at a glance, and see whether the traffic limiting rules and algorithms are reasonable and effective for traffic shaping.

In addition to prior verification, we also need to monitor the working status of traffic limiting at all times to know whether the traffic limiting function works properly in real time. In case of a traffic limiting exception, hot update traffic limiting configurations, such as enabling or disabling the traffic limiting function, adjusting the traffic limiting rules, and replacing the traffic limiting algorithm, can be implemented without restarting the service.

High fault tolerance and high performance open source stream limiting framework: Ratelimiter4j

Ratelimiter4j is a flow limiting framework with high performance, high fault tolerance and easy integration. From the point of view of functions, the implementation of flow limiting function is not complicated, while non-functional requirements are the technical difficulties of system development:

1) Low latency: no or little impact on the response time of the interface itself

Each microservice interface request needs to be checked to see if the limited access frequency is exceeded, which undoubtedly increases the response time of the interface. Response time is a very important performance metric for microservice interfaces, so keeping the flow limiting delay as small as possible was a special consideration when we developed the Ratelimiter4J flow limiting framework.

2) High fault tolerance: exceptions of the flow limiting framework do not affect the availability of microservices

Access traffic limiting itself is to provide system availability stability, and the abnormal traffic limiting itself should not adversely affect the availability of microservices, which is an unacceptable side effect. For example, if the Redis on which the distributed traffic limiting algorithm depends is down, the traffic limiting operation cannot be carried out. In this case, the service interface must continue to work normally.

3) High TPS: The TPS of the flow limiting framework should be at least greater than the TPS of the interface of the microservice itself

For large-scale services, the interface access frequency is relatively high, tens of thousands or even hundreds of thousands of TPS, the TPS supported by the flow limiting framework should be at least higher than the TPS of the service itself, otherwise the service will be brought down due to the performance of the flow limiting itself.

At present, ratelimiter4J framework organizes the traffic limiting rules into trie Tree data structure, which can realize the interface traffic limiting rules corresponding to the fast query request. The experiment proves that trie Tree data structure is very suitable for the interface format with hierarchical directory and high directory repetition like URL.

For distributed flow limiting, the results obtained by ratelimiter4j pressure test currently support a maximum of 50,000 TPS within the acceptable response time range. High concurrency is not sensitive to the impact of TPS, and the bottleneck is mainly in the Redis central counter. The next step is to optimize the performance by improving the algorithm and its central counter to support Sharding.

Ratelimiter4j making address: https://github.com/wangzheng0822/ratelimiter4j

Finally, conclusion

Traffic limiting has been mentioned in many technical articles related to microservices and service governance. Due to limited space and subject, it may not be in-depth enough. Based on specific practical experience, this paper focuses on analyzing traffic limiting of microservices interface, hoping that readers will have a deeper understanding of traffic limiting of microservices interface after reading this paper.

Show Disqus Comments

Gitalking …