Before Spring Cloud Gateway came out, Netflix Zuul would be the Gateway in the microservices world of Spring Cloud. However, Zuul 1.x has been criticized for its blocking API and lack of WebSocket support, as well as relying on Netflix to upgrade the new version of Zuul. After several delays, the Spring community decided to launch its own gateway component. Alternative to Netflix Zuul.

Since the Finchley release of Spring Cloud in June of ’18, Spring Cloud Gateway has gradually gained prominence. Developed based on Spring 5.0, Spring Boot 2.0, and Project Reactor technologies, it not only supports responsive and non-blocking apis, but also supports WebSockets and is tightly integrated with the Spring framework. Although Zuul has since come out with a 2.x version that uses an asynchronous non-blocking API at the bottom, which has greatly improved its performance, Spring has no plans to continue integrating it.

According to the official website, the main features of Spring Cloud Gateway are as follows:

  • Built on Spring Framework 5, Project Reactor and Spring Boot 2.0
  • Able to match routes on any request attribute
  • Predicates and filters are specific to routes
  • Hystrix Circuit Breaker integration
  • Spring Cloud DiscoveryClient integration
  • Easy to write Predicates and Filters
  • Request Rate Limiting
  • Path Rewriting

As you can see, the Spring Cloud Gateway can easily integrate with other components in the Spring Cloud ecosystem (e.g. Circuit breakers and service discovery), and provides a set of simple, easy-to-write Predicates and Filters for special request processing on each route.

Recently, I used the Spring Cloud Gateway in the project, and realized some advanced features on the basis of it, such as flow limiting and residue. I encountered many challenges in the process of using the Gateway, so I took some time to systematically study and summarize the project. First, I will introduce some common traffic limiting scenarios and algorithms. Then I will introduce some open source projects about traffic limiting and learn how others implement traffic limiting. Finally, I will introduce how I implement traffic limiting in the gateway and share some experience and pits encountered in the implementation process.

1. Common traffic limiting scenarios

Cache, degrade and flow limiting are known as the troika of high concurrency and distributed system. As the first checkpoint in the whole distributed system, the function of flow limiting is naturally essential. Traffic limiting controls the rate of service requests, improving the system’s ability to cope with sudden heavy traffic and making the system more elastic. There are a lot of practical application scenarios, such as the “double Eleven” event, snatching tickets on 12306, etc.

1.1 Traffic Limiting Objects

Through the above introduction, we may feel relatively vague about the concept of current limiting. What is current limiting? As the name implies, limiting traffic is restricting traffic, but traffic here is a relatively general concept. If you consider a variety of scenarios, traffic limiting can be very complex and closely related to specific business rules. Consider the following common scenarios:

  • The maximum number of requests made by an interface in a minute is 100
  • Limit the download speed of a user to 100KB/S
  • A user can send only five requests to an interface at a time
  • Restrict an IP source to deny access to any request

As you can see from the above example, different traffic limiting rules can be combined depending on the requester and the requested resource. You can limit traffic by the REQUester’s IP address, by the user corresponding to the request, or by a specific request parameter. The objects of flow limiting can be request frequency, transmission rate, or concurrency, etc. The two most common objects of flow limiting are request frequency and concurrency. The corresponding limiting methods are known as Request Rate limiting and Concurrent requests limiting. Current limiting transmission rate Commonly used in the download scenarios, such as some resources download station will limit the download speed of ordinary users, only purchase a membership to speed up, the current limit is actually similar to request frequency current limit, it’s just a limit what is requested, a limit is the size of the request data packet. This article mainly introduces request frequency limiting and concurrent amount limiting.

1.2 Processing methods of current limiting

When designing a traffic limiting scheme in a system, it is worth the designer to carefully consider how to return the result when the requester is intercepted by the traffic limiting rule. In general, we have the following three methods of current limiting:

  • Denial of service
  • Waiting in line
  • Service degradation

The simplest way to do this is to deny service, throw an exception and return an error message (such as HTTP status code 429 Too Many Requests), or send a 302 to the front end that redirects the user to an error page indicating that the resource is unavailable or try again later. However, for some important interfaces, such as seckill and order, we cannot reject them directly. We do not want users to request too fast, nor do we want the request to fail. In this case, the request will be queued in a message queue, which can reduce peak load and limit traffic. The third processing method is service degradation. When the traffic limiting condition is triggered, it directly returns the bottom data, such as the interface for querying commodity inventory, which can return stock by default.

1.3 Architecture of current limiting

Different traffic limiting schemes are required for different system architectures. As shown in the following figure, service deployment modes can be divided into single-machine mode and cluster mode:

The single-machine mode traffic limiting is very simple and can be implemented directly based on memory, while the cluster mode traffic limiting must rely on some “centralized” component, such as gateway or Redis, which leads to two different traffic limiting architectures: gateway layer traffic limiting and middleware traffic limiting.

As the entrance of the whole distributed system, the gateway undertakes all user requests, so limiting traffic in the gateway is the most appropriate. Gateway traffic limiting is sometimes referred to as access traffic limiting. In addition to the Spring Cloud Gateway we use, the most common Gateway layer component is Nginx, which can be accessed through its ngx_HTTP_limit_req_module module, The limit_conn_zone, limit_req_zone, limit_rate and other directives can easily achieve concurrent traffic limiting, request frequency traffic limiting, and transmission rate traffic limiting. I won’t go into too much detail about Nginx here, but you can refer to the official Nginx documentation for more information on these directives.

Another type of traffic limiting architecture is middleware traffic limiting, which can sink the logic of traffic limiting down to the service layer. However, each service in the cluster must uniformly summarize its traffic information to a certain place for other services to read, generally speaking, Redis is used more, Redis provides expiration features and Lua script execution is very suitable for traffic limiting. In addition to Redis middleware, there are many similar distributed caching systems available, such as Hazelcast, Apache Ignite, Infinispan, and more.

We can further extend the above architecture by changing the gateway to cluster mode. Although this is still a gateway layer-limiting architecture, since the gateway becomes clustered mode, the gateway must rely on middleware for traffic limiting, which is no different from the middleware traffic limiting discussed above.

2. Common traffic limiting algorithms

Through the above learning, we know that flow limiting can be divided into request frequency flow limiting and concurrent amount flow limiting, and can be divided into gateway layer flow limiting and distributed flow limiting according to the different system architecture. In different application scenarios, we need to adopt different traffic limiting algorithms. This section introduces some of the mainstream traffic limiting algorithms.

It is important to note that you can also use pooling techniques, such as thread pooling or connection pooling, to limit flow, but that is not the focus of this article.

2.1 Fixed Window Algorithm

Fixed window algorithm is one of the simplest traffic limiting algorithms. It maps the request time to a time window according to the traffic limiting conditions, and then uses a counter to accumulate the number of visits. For example, if the flow limiting condition is 5 times per minute, then map the time window in minutes. Assume that the time of a request is 11:00:45, and the time window is 11:00:00 to 11:00:59. Set a counter in this time window, and increment the counter by one for each request. When the counter for this time window exceeds 5, the limiting condition is triggered. When the request time falls in the next time window (11:01:00 ~ 11:01:59), the counter of the previous window is invalid, the current counter is cleared, and the count starts again.

The counter algorithm is very easy to implement and can be used for counting in a single machine scenario using AtomicLong, LongAdder, or Semaphore. In distributed scenarios, this can be implemented using Redis commands such as INCR and EXPIRE and EVAL or Lua scripts. The Redis website provides several simple implementations. This algorithm can be used for both frequency limiting and concurrent limiting.

However, the defect of this algorithm is also relatively obvious, that is, there are serious critical problems. Because the counter will clear after each time window, the flow limiting effect is not smooth enough, and malicious users can take advantage of this feature to bypass our flow limiting rules. As shown in the following figure, the traffic limiting condition is originally five times per minute. However, the malicious user initiates five requests in the last half minute of the 11:00:00 to 11:00:59 time window, and then initiates five requests in the first half minute of the 11:01:00 to 11:01:59 time window. This allowed our system to handle 10 requests in a minute.

2.2 Sliding Window algorithm (Rolling Window or Sliding Window)

In order to solve the critical problem of the fixed window algorithm, the time window can be divided into smaller time Windows, and then the corresponding small window can be deleted with time sliding, instead of directly sliding through a large window, which is the sliding window algorithm. We set a counter for each small time window, and the total number of requests for the large time window is the sum of the counters for each small time window. As shown in the figure below, our time window is 5 seconds, which can be divided into 5 small Windows by seconds. Every second, the time window slips by one second:

Each time a request is processed, the counter sum of all small time Windows needs to be calculated. Considering the performance problem, the partition of small time Windows should not be too many. For example, if the flow limiting condition is N per hour, 60 Windows can be divided by minute instead of 3600 Windows by second. Of course, if performance is not considered, the finer the partition granularity, the smoother the flow limiting effect will be. On the contrary, the coarser the partition granularity is, the less accurate the flow limiting effect will be and the greater the possibility of critical problems will be. When the partition granularity is 1, the sliding window algorithm degenerates into the fixed window algorithm. Because they both use Counters, they are also called Counters.

Further thinking, we find that if the partition granularity is the coarsest, that is, there is only one time window, the sliding window algorithm degenerates into a fixed window algorithm. So what if we set the granularity to the finest possible? So what’s the best way to divide the time window? When the time window is small enough, it means that only one request can fit in each time window, so we can omit the counter, just record the time of each request, and then count the number of requests over a period of time. The specific implementation can refer to Redis sorted set technique and Sliding Window log algorithm.

2.3 Leaky Bucket Algorithm

In addition to the counter algorithm, another natural idea for limiting traffic is to cache all requests into a queue and process them slowly at a fixed rate, which is essentially the Leaky Bucket algorithm. The leaky bucket algorithm assumes that requests are placed in a bucket of capacity M and discarded when the bucket is full. There is a hole in the bottom of the bucket, and the requests in the bucket leak out like water at a constant rate of R per second. We use the following graphic graph to represent the leaky bucket algorithm:

At the top of the bucket is a faucet, and our requests flow from the faucet to the bucket, sometimes fast and sometimes slow. This kind of quick and slow flow is called a Bursty flow. If the bucket is full, the excess water overflows and the request is discarded. The rate of water leakage from the bottom of the bucket is fixed, so it can be seen that the leaky bucket algorithm can smooth the requested rate.

The leaky bucket algorithm can be implemented with a queue, as shown in the following figure:

When a request arrives, it is not processed directly, but put into a queue, and then another thread reads the request from the queue and processes it at a fixed rate, thus achieving flow limiting. Note that this queue can be implemented in different ways, such as by setting the lifetime of requests, or by changing the queue to a PriorityQueue, which sorts requests according to their priority rather than fifO. If the queue is full, the request must be discarded. Leaky bucket algorithm has a defect, it is very inefficient in dealing with burst traffic, so people came up with the following token bucket algorithm.

2.4 Token Bucket Algorithm (Token Bucket)

Token Bucket algorithm is one of the most widely used traffic limiting algorithms. Its basic idea consists of two parts: Token generation and Token consumption.

  • Generate tokens: Assume that there is a bucket with a maximum of M tokens, and then put tokens into the bucket at a fixed rate (r tokens per second). When the bucket is full, no more tokens are added.
  • Consumption token: we need to take a token from the bucket to release each request. When there is no token in the bucket, flow limiting will be triggered. At this time, we can put the request into a buffer queue and wait, or reject it directly.

The token bucket algorithm is shown as follows:

In the figure above, we put the request in a buffer queue, and you can see that the logic of this part is almost identical to the leak-bucket algorithm, except that the request is processed at a fixed rate and the request is processed after the token is fetched from the bucket.

If you think about it carefully, you will find that the token bucket algorithm has a key problem, which is the bucket size setting. It is this parameter that enables the token bucket algorithm to have the ability to handle burst traffic. Sets the bucket size to 100, for example, to generate the token speed is set to 10 per second, so after the system idle for a period of time (token has been no consumption in the bucket, slowly will be filled with), suddenly to 50 at the request of the system can be directly according to the speed of 50 per second processing, as the token bucket soon used up, The processing speed will slow down again and become consistent with the token generation speed. This is the biggest difference between the token-bucket algorithm and the leak-bucket algorithm, which will only continue to process 10 requests per second no matter how many requests come in. Of course, processing sudden traffic improves system performance, but also brings some pressure to the system. If the bucket size is not set properly, sudden large traffic may directly crush the system.

Based on the above analysis of token buckets, there are generally two different implementations. The first approach is to start an internal thread that continuously adds tokens to the bucket, retrieving tokens from the bucket when processing requests, as shown in the figure above. The second approach does not rely on internal threads, but instead counts the number of tokens to populate in real time before each request is processed, fills them, and then retrives the tokens from the bucket. Here is a classic implementation of the second method, where capacity stands for token bucket size, refillTokensPerOneMillis stands for filling speed, how many tokens are filled per millisecond, availableTokens stands for how many tokens are left in the token bucket. LastRefillTimestamp indicates the last filling time.

1public class TokenBucket { 2 3 private final long capacity; 4 private final double refillTokensPerOneMillis; 5 private double availableTokens; 6 private long lastRefillTimestamp; 7 8 public TokenBucket(long capacity, long refillTokens, long refillPeriodMillis) { 9 this.capacity = capacity; 10 this.refillTokensPerOneMillis = (double) refillTokens / (double) refillPeriodMillis; 11 this.availableTokens = capacity; 12 this.lastRefillTimestamp = System.currentTimeMillis(); 13 } 14 15 synchronized public boolean tryConsume(int numberTokens) { 16 refill(); 17 if (availableTokens < numberTokens) { 18 return false; 19 } else { 20 availableTokens -= numberTokens; 21 return true; 22 } 23 } 24 25 private void refill() { 26 long currentTimeMillis = System.currentTimeMillis(); 27 if (currentTimeMillis > lastRefillTimestamp) { 28 long millisSinceLastRefill = currentTimeMillis - lastRefillTimestamp; 29 double refill = millisSinceLastRefill * refillTokensPerOneMillis; 30 this.availableTokens = Math.min(capacity, availableTokens + refill); 31 this.lastRefillTimestamp = currentTimeMillis; 32} 33} 34}Copy the code

You can create a token bucket (100 in size and 100 tokens per second) like this:

1TokenBucket limiter = new TokenBucket(100, 100, 1000);
Copy the code

As can be seen from the above code snippet, the implementation of token bucket algorithm is very simple and efficient, only through a few variables to achieve a complete flow limiting function. The core logic lies in the refill() method. When consuming the token, refill() calculates the difference between the current time and the last time, and calculates how many tokens should be filled according to the refill speed. After repopulating the tokens, determine if the requested token count is sufficient, return false if it is insufficient, subtract the token count if it is sufficient, and return true.

In practical applications, the original token bucket algorithm is often not used directly, and some improvements are generally made on the basis of it, such as dynamic adjustment of filling rate, overdraft of token total, distributed flow limiting based on Redis, etc., but generally it is in line with the overall framework of token bucket algorithm. We’ll learn more about this later when we look at some open source projects.

Some open source projects

There are many open source projects that implement the function of traffic limiting. This section will learn how to implement traffic limiting by studying some open source projects.

3.1 RateLimiter Guava

Google Guava is a powerful core library that contains many useful utility classes, such as collections, caching, concurrent libraries, string processing, I/O, and more. In the concurrency library, Guava provides two classes related to flow limiting: RateLimiter and SmoothRateLimiter. Guava’s RateLimiter is based on the token bucket algorithm, but it improves on the traditional token bucket algorithm and supports two different traffic limiting methods: SmoothBursty and SmoothWarmingUp.

To create a SmoothBursty, do the following:

1RateLimiter limiter = RateLimiter.create(5);
Copy the code

RateLimiter. Create (5) indicates that the current limiter has a capacity of 5 and generates 5 tokens per second, or one every 200 milliseconds. We can use limiter.acquire() to consume tokens, returning 0 if there are enough tokens in the bucket, blocking wait if there are not enough, and returning the wait time. We asked several times in succession:

1System.out.println(limiter.acquire());
2System.out.println(limiter.acquire());
3System.out.println(limiter.acquire());
4System.out.println(limiter.acquire());
Copy the code

The following output is displayed:

10.0
20.198239
30.196083
40.200609
Copy the code

As you can see, once the limiter is created, it starts with a token and then generates one every 200 milliseconds, so the first request returns 0 and subsequent requests block for about 200 milliseconds. In addition, SmoothBursty has the ability to deal with unexpected situations and also allows the consumption of future tokens, as in the following example:

1RateLimiter limiter = RateLimiter.create(5);
2System.out.println(limiter.acquire(10));
3System.out.println(limiter.acquire(1));
4System.out.println(limiter.acquire(1));
Copy the code

You get output similar to the following:

10.0
21.997428
30.192273
40.200616
Copy the code

After the limiter was created, there was only one initial token, but our request for 10 tokens was also passed, only after looking at the later request, we found that the second request took about 2 seconds to make up for the previous overdrawn token.

Another type of flow limiter supported by Guava is a SmoothWarmingUp, which can be created by:

1RateLimiter limiter = RateLimiter.create(2, 3, TimeUnit.SECONDS);
2System.out.println(limiter.acquire(1));
3System.out.println(limiter.acquire(1));
4System.out.println(limiter.acquire(1));
5System.out.println(limiter.acquire(1));
6System.out.println(limiter.acquire(1));
Copy the code

The first parameter is once again the number of tokens created per second, which in this case is two tokens per second, or one every 500 milliseconds. The second parameter represents the time between the cold start rate and the average rate, known as the warm up period. Let’s look at the output:

10.0
21.329289
30.994375
40.662888
50.501287
Copy the code

The first request still gets the token immediately, but the subsequent request is completely different from the above smoothed burst limit request, which would normally generate a token in 500 milliseconds, but we found that the second request waited for 1.3 seconds instead of 0.5 seconds, and the third and fourth requests also waited for some time. However, it can be seen that the wait time is slowly approaching 0.5s, and it is not until the fifth request that the wait time becomes normal. The time interval from the first request to the fifth request is the warm-up phase, and we can calculate the warm-up time is set to 3 seconds.

3.2 Bucket4j

Bucket4j is a powerful stream limiting library based on the token bucket algorithm. It not only supports single-machine flow limiting, Distributed limiting is also supported through distributed ccaching such as Hazelcast, Ignite, Coherence, Infinispan, or other JCache API (JSR 107) compliant specifications.

Before using Bucket4j, it is important to understand some of the core concepts in Bucket4j:

  • Bucket
  • Bandwidth
  • Refill

The Bucket interface represents the concrete implementation of the token Bucket and is the entry point for our operations. It provides a method for our consumption such as tryConsume and tryConsumeAndReturnRemaining token. Buckets can be created using the following constructor:

1Bucket bucket = Bucket4j.builder().addLimit(limit).build();
2if(bucket.tryConsume(1)) {
3    System.out.println("ok");
4} else {
5    System.out.println("error");
6}
Copy the code

Bandwidth means Bandwidth, which can be understood as traffic limiting rules. Bucket4j provides two ways to create Bandwidth: Simple and Classic. Here is the Bandwidth created in simple, which means the bucket size is 10 and the fill rate is 10 tokens per minute:

1Bandwidth limit = Bandwidth.simple(10, Duration.ofMinutes(1));
Copy the code

The simple method has the same bucket size and fill speed, while the Classic method is more flexible and allows you to customize the fill speed. The following example shows a bucket size of 10 and a fill speed of 5 tokens per minute:

1Refill filler = Refill.greedy(5, Duration.ofMinutes(1));
2Bandwidth limit = Bandwidth.classic(10, filler);
Copy the code

Among them, Refill is used to fill the token bucket, which can be used to define the filling rate. Bucket4j has two strategies for filling tokens: intervally and Greedy. In the example above, we use the greedy policy. If we use the interval policy, we can create Refill as follows:

1Refill filler = Refill.intervally(5, Duration.ofMinutes(1));
Copy the code

The interval strategy refers to populating all tokens once in a while, such as the example above, which populates 5 tokens every minute, as shown below:

The greedy strategy will fill the token as greedily as possible. The same example above will divide one minute into 5 smaller time units and fill one token every 12 seconds, as shown below:

Now that you’ve looked at some of the core concepts in Bucket4j, let’s take a look at some of the features on the website:

  • Based on token bucket algorithm
  • High-performance, lock-free implementation
  • There are no precision issues, and all calculations are based on integers
  • Supports distributed traffic limiting through distributed cache systems that comply with JCache API specifications
  • Multiple bandwidths are supported for each Bucket
  • Support for synchronous and asynchronous apis
  • Support for pluggable listening apis for integrated monitoring and logging
  • It can be used not only for traffic limiting, but also for simple scheduling

Bucket4j provides a wealth of documentation, and it is recommended that you read through the official documentation for basic usage and advanced features before using Bucket4j. Rate Limiting Spring MVC endpoints with Bucket4j This article explains in detail how to use interceptors and Bucket4j in Spring MVC to create a business-free flow limiting solution. It also explains how to use Hazelcast to implement distributed flow limiting. Rate Limiting a Spring API Using Bucket4j is a good introduction to the basic knowledge of Bucket4j. At the end of the article, it also provides the integration method of Spring Boot Starter. Combined with Spring Boot Actuator, it is easy to integrate the traffic limiting indicators into the monitoring system.

Bucket4j’s capabilities are clearly superior to Guava’s stream limiter, which is intended to be a generic utility class, not a stream limiter. Bucket4j met most of our requirements, supporting stand-alone and distributed traffic limiting as well as integrated monitoring, perfect with Prometheus and Grafana. It is worth noting that many open source projects such as the JHipster API Gateway use Bucket4j to implement stream limiting.

The only downside of Bucket4j is that it only supports request frequency limiting, not concurrency limiting. In addition, Bucket4j supports distributed limiting, but it is based on distributed caching systems such as Hazelcast and cannot use Redis. This is a problem in many projects that use Redis for caching, so we need to explore this in the open source world.

3.3 Resilience4j

Resilience4j is a lightweight, easy-to-use high availability framework. Those of you who have used earlier versions of Spring Cloud have no doubt heard of Netflix Hystrix, which inspired the design of Resilience4j. Since Hystrix stopped maintenance, officials also recommend using Resilience4j instead of Hystrix.

The underlying Resilience4j uses Vavr, which is a very lightweight Java functional library, making It very suitable for functional programming. Resilience4j provides encapsulation of functional interfaces or lambda expressions in decorator mode, providing a wave of high availability mechanisms: Retry, Circuit Breaker, Rate Limiter, Timer Limiter, Bulkhead, Caceh and Fallback. We focus on two functions here: Rate Limiter and Bulkhead. Rate Limiter is traffic limiting of request frequency, and Bulkhead is concurrent traffic limiting.

Resilience4j provides two current-limiting implementation: SemaphoreBasedRateLimiter and AtomicRateLimiter. SemaphoreBasedRateLimiter based on semaphore implementation, each time the user’s request to apply for a semaphore, and record the time of application, the application request, through allows application failures are current limiting, also has an internal thread regularly scan expired semaphore and release, obviously this is the token bucket algorithm. AtomicRateLimiter is similar to the classic implementation above, requiring no additional threads and automatically populating each request based on the time since the last request and how fast the token was generated. For the difference between the two, please refer to the article Rate Limiter Internals in Resilience4j.

Resilience4j also provides two isolated implementations: SemaphoreBulkhead and ThreadPoolBulkhead, which control the number of concurrent requests through semaphores or thread pools. Please refer to the official documents for details.

Here is an example of using both limiting and isolating:

1// Create a Bulkhead Maximum number of concurrent requests: 150 2BulkheadConfig bulkheadConfig = BulkHeadConfig.custom () 3. MaxConcurrentCalls (150) 4. MaxWaitTime (100) 5 .build(); 6Bulkhead bulkhead = Bulkhead.of("backendName", bulkheadConfig); Create a RateLimiter, 9RateLimiterConfig rateLimiterConfig = ratelimiterconfig.custom () 10. TimeoutDuration (duration.ofmillis (100)) 11 .limitRefreshPeriod(Duration.ofSeconds(1)) 12 .limitForPeriod(1) 13 .build(); 14RateLimiter rateLimiter = RateLimiter.of("backendName", rateLimiterConfig); 15 to 16 / / use the Bulkhead and RateLimiter decoration business logic 17: supplier < String > : supplier = () - > backendService. DoSomething (); 18Supplier<String> decoratedSupplier = Decorators.ofSupplier(supplier) 19 .withBulkhead(bulkhead) 20 .withRateLimiter(rateLimiter) 21 .decorate(); 24Try<String> try = try. ofSupplier(decoratedSupplier); 25assertThat(try.isSuccess()).isTrue();Copy the code

Resilience4j is much more powerful than Bucket4j in terms of features and supports concurrent flow limiting. However, the biggest regret is that Resilience4j does not support distributed flow limiting.

3.4 other

There are many open source projects related to streaming restriction that are impossible to cover, but these are just the tip of the iceberg:

  • Github.com/mokies/rate…
  • Github.com/wangzheng08…
  • Github.com/wukq/rate-l…
  • Github.com/marcosbarbe…
  • Github.com/onblog/Snow…
  • Gitee.com/zhanghaiyan…
  • Github.com/Netflix/con…

It can be seen that the flow limiting technology is widely used in practical projects, and people are always happy to implement their own flow limiting algorithm. New algorithms and new implementations emerge in an endless stream. But looking around, I have yet to find an open source project that fully meets my needs.

In fact, my requirements are very simple. I need to simultaneously satisfy two different traffic limiting scenarios: request frequency traffic limiting and concurrent traffic limiting, and simultaneously satisfy two different traffic limiting architectures: single-machine traffic limiting and distributed traffic limiting. Let’s start implementing these types of traffic limiting in the Spring Cloud Gateway. Through the previous projects, we learned from each other and basically implemented them with mature technologies. However, for the last case, distributed concurrent traffic limiting, there is no existing solution available online. After several nights of discussion with my colleagues, I came up with a new type of traffic limiting algorithm based on double window sliding. I am here to throw out a brick and introduce a piece of gold, welcome your criticism and correction, if you have a better method, welcome to discuss.

Achieve traffic limiting in the gateway

When introducing the features of Spring Cloud Gateway at the beginning of the article, we noticed that there is a Request Rate Limiting clause, which indicates that the Gateway has the function of Limiting traffic. However, Spring Cloud Gateway has many limitations in Limiting traffic. For example, it does not support single-machine traffic limiting, does not support concurrent traffic limiting, and its request frequency traffic limiting is not satisfactory, these need to be solved by ourselves.

4.1 Single-machine request frequency traffic limiting

The Spring Cloud Gateway defines an interface for limiting traffic, RateLimiter, as follows:

1public interface RateLimiter<C> extends StatefulConfigurable<C> {
2    Mono<RateLimiter.Response> isAllowed(String routeId, String id);
3}
Copy the code

This interface isAllowed. The first parameter routeId indicates the ID of the requested route. Based on routeId, traffic limiting configurations can be obtained. Or anything else you can get from ServerWebExchange. We see in the RequestRateLimiterGatewayFilterFactory calls to isAllowed logic:

1@Override 2public GatewayFilter apply(Config Config) {3 // KeyResolver 4 KeyResolver resolver = getOrDefault(config.keyResolver, defaultKeyResolver); RateLimiter 6 RateLimiter<Object> limiter = getOrDefault(config. RateLimiter, 7 defaultRateLimiter); 8 boolean denyEmpty = getOrDefault(config.denyEmptyKey, this.denyEmptyKey); 9 HttpStatusHolder emptyKeyStatus = HttpStatusHolder 10 .parse(getOrDefault(config.emptyKeyStatus, this.emptyKeyStatusCode)); 11 12 return (exchange, Chain) -> resolver.resolve(exchange).defaultifEmpty (EMPTY_KEY) IsAllowed () method 15 if (empty_key.equals (key)) {16 if (denyEmpty) {17 setResponseStatus(exchange, emptyKeyStatus); 18 return exchange.getResponse().setComplete(); 19 } 20 return chain.filter(exchange); 23 String routeId = config.getrouteID (); 23 String routeId = config.getrouteid (); 24 if (routeId == null) { 25 Route route = exchange 26 .getAttribute(ServerWebExchangeUtils.GATEWAY_ROUTE_ATTR); 27 routeId = route.getId(); 28 } 29 return limiter.isAllowed(routeId, key).flatMap(response -> { 30 31 for (Map.Entry<String, String> header : response.getHeaders() 32 .entrySet()) { 33 exchange.getResponse().getHeaders().add(header.getKey(), 34 header.getValue()); Filter 37 if (response.isallowed ()) {38 return chain.filter(exchange); 41 setResponseStatus(exchange, config.getStatusCode()); 42 return exchange.getResponse().setComplete(); 43}); 44}); 45}Copy the code

The resolve method of the KeyResolver interface allows you to define a stream limiting object.

1public interface KeyResolver {
2    Mono<String> resolve(ServerWebExchange exchange);
3}
Copy the code

For example, HostAddrKeyResolver can limit traffic by IP address:

1public interface KeyResolver { 2 Mono<String> resolve(ServerWebExchange exchange); 3} 4 For example, HostAddrKeyResolver can limit traffic by IP address:  5public class HostAddrKeyResolver implements KeyResolver { 6 @Override 7 public Mono<String> resolve(ServerWebExchange exchange) { 8 return Mono.just(exchange.getRequest().getRemoteAddress().getAddress().getHostAddress()); 10 9}}Copy the code

The RateLimiter interface provides only one implementation class, RedisRateLimiter:

Obviously based on Redis flow limiting, although through Redis can also achieve single-machine flow limiting, but always feel some overqualified, and for those without Redis environment is very unfriendly. So, we want to implement true local limiting.

We found a new Feature Feature/local-rate-limiter in the Spring Cloud Gateway pull Request, and looking at the commit record, this new Feature is likely to be incorporated into the 3.0.0 release. (Resilience4) Localratelimitter. Java (Resilience4) : Bucket4j (Resilience4)

 1public Mono<Response> isAllowed(String routeId, String id) {
 2    Config routeConfig = loadConfiguration(routeId);
 3
 4    // How many requests per second do you want a user to be allowed to do?
 5    int replenishRate = routeConfig.getReplenishRate();
 6
 7    // How many seconds for a token refresh?
 8    int refreshPeriod = routeConfig.getRefreshPeriod();
 9
10    // How many tokens are requested per request?
11    int requestedTokens = routeConfig.getRequestedTokens();
12
13    final io.github.resilience4j.ratelimiter.RateLimiter rateLimiter = RateLimiterRegistry
14            .ofDefaults()
15            .rateLimiter(id, createRateLimiterConfig(refreshPeriod, replenishRate));
16
17    final boolean allowed = rateLimiter.acquirePermission(requestedTokens);
18    final Long tokensLeft = (long) rateLimiter.getMetrics().getAvailablePermissions();
19
20    Response response = new Response(allowed, getHeaders(routeConfig, tokensLeft));
21    return Mono.just(response);
22}
Copy the code

Interestingly, there is also an earlier version of this class that is implemented based on Bucket4j:

 1public Mono<Response> isAllowed(String routeId, String id) {
 2
 3    Config routeConfig = loadConfiguration(routeId);
 4
 5    // How many requests per second do you want a user to be allowed to do?
 6    int replenishRate = routeConfig.getReplenishRate();
 7
 8    // How much bursting do you want to allow?
 9    int burstCapacity = routeConfig.getBurstCapacity();
10
11    // How many tokens are requested per request?
12    int requestedTokens = routeConfig.getRequestedTokens();
13
14    final Bucket bucket = bucketMap.computeIfAbsent(id,
15            (key) -> createBucket(replenishRate, burstCapacity));
16
17    final boolean allowed = bucket.tryConsume(requestedTokens);
18
19    Response response = new Response(allowed,
20            getHeaders(routeConfig, bucket.getAvailableTokens()));
21    return Mono.just(response);
22}
Copy the code

The implementation methods are similar. Bucket4j and Resilience4j have been introduced in detail above and will not be described here. However, it can also be seen that the Spring ecosystem is relatively optimistic about Resilience4j, and we can also introduce it into our project.

4.2 Frequency limiting of distributed requests

This section describes how to implement frequency limiting for single machine requests. Next, let’s look at frequency limiting for distributed requests. The Spring Cloud Gateway comes with a stream limiting implementation, RedisRateLimiter, that can be used for distributed stream limiting. Request_rate_limiter. lua: SRC /main/resources/ meta-INF /scripts

1local tokens_key = KEYS[1] 2local timestamp_key = KEYS[2] 3 4local rate = tonumber(ARGV[1]) 5local capacity = tonumber(ARGV[2]) 6local now = tonumber(ARGV[3]) 7local requested = tonumber(ARGV[4]) 8 9local fill_time = capacity/rate  10local ttl = math.floor(fill_time*2) 11 12local last_tokens = tonumber(redis.call("get", tokens_key)) 13if last_tokens == nil then 14 last_tokens = capacity 15end 16 17local last_refreshed = tonumber(redis.call("get", timestamp_key)) 18if last_refreshed == nil then 19 last_refreshed = 0 20end 21 22local delta = math.max(0, now-last_refreshed) 23local filled_tokens = math.min(capacity, last_tokens+(delta*rate)) 24local allowed = filled_tokens >= requested 25local new_tokens = filled_tokens 26local allowed_num = 0 27if allowed then 28 new_tokens = filled_tokens - requested 29 allowed_num = 1 30end 31 32if ttl > 0 then 33 redis.call("setex", tokens_key, ttl, new_tokens) 34 redis.call("setex", timestamp_key, ttl, now) 35end 36 37return { allowed_num, new_tokens }Copy the code

This code is almost identical to the classic Java code that introduced the token bucket algorithm above. Lua script is used here, mainly to make use of Redis single thread characteristics, as well as the atomicity of lua script execution, to avoid the phenomenon that the number of requests may exceed the upper limit in concurrent access. Imagine that there is currently one token left in the token bucket, and two requests arrive at the same time. The determination of whether there are enough tokens is also simultaneous, and both requests assume that there is one token left, so both requests are allowed.

There are two ways to configure the Spring Cloud Gateway’s built-in traffic limiting. The first way is through configuration files, such as the code shown below, that can limit the flow of a route:

 1spring:
 2  cloud:
 3    gateway:
 4      routes:
 5      - id: test
 6        uri: http://httpbin.org:80/get
 7        filters:
 8        - name: RequestRateLimiter
 9          args:
10            key-resolver: '#{@hostAddrKeyResolver}'
11            redis-rate-limiter.replenishRate: 1
12            redis-rate-limiter.burstCapacity: 3
Copy the code

Where key-resolver uses SpEL expression #{@beanname} to obtain hostAddrKeyResolver object from Spring container. BurstCapacity is the size of token bucket. The replenishRate represents the replenishment rate of tokens per second to the bucket.

The second way is through the following code:

 1@Bean
 2public RouteLocator myRoutes(RouteLocatorBuilder builder) {
 3  return builder.routes()
 4    .route(p -> p
 5      .path("/get")
 6      .filters(filter -> filter.requestRateLimiter()
 7        .rateLimiter(RedisRateLimiter.class, rl -> rl.setBurstCapacity(3).setReplenishRate(1)).and())
 8      .uri("http://httpbin.org:80"))
 9    .build();
10}
Copy the code

This can limit the flow of a route. However, it should be noted that the replenishRate of the Spring Cloud Gateway has a large pit, the replenishRate does not support decimal Settings, that is, the rate of token replenishment to the bucket is at least 1 token per second. Therefore, If my traffic limiting rule is 10 requests per minute (which should be filled every 6 seconds, or 1/6 token per second), Spring Cloud Gateway will not be able to properly limit traffic in this case. There has also been an issue online, “Support greater than a second resolution for the rate limiter,” but it has not been resolved yet.

4.3 Single-node concurrent traffic limiting

When learning Resilience4j above, we mentioned a functional feature of Resilience4j called Bulkhead. A Bulkhead is the Bulkhead of a ship. A Bulkhead can be used to insulate different compartments, so that if one of the compartments is damaged and flooded, only one of the compartments is lost and the others are not affected. Drawing on the experience of shipbuilding industry, this pattern has also been introduced into software industry, and we call it the Bulkhead pattern. The bulkhead mode is generally used for service isolation. For some important system resources, such as CPU, memory, connection number, etc., each service can set its own resource limit to prevent an abnormal service from consuming all resources of the system. This idea of service isolation can also be used for concurrent flow limiting.

As mentioned above, Resilience4j provides two kinds of Bulkhead implementations: SemaphoreBulkhead and ThreadPoolBulkhead, which are also two common implementations of Bulkhead mode: semaphore with count and thread pool with fixed size. Given the cost of thread switching in multi-threaded scenarios, semaphores are recommended by default.

In the operating system Basics course, we studied two terms: Mutex and Semaphores. Mutex is used for thread mutual exclusion, which is similar to the critical area. Only threads with mutex objects have access to resources. Because there is only one mutex object, only one thread can access this shared resource in any case, thus ensuring that multiple threads can safely access and operate shared resources. Different from mutex, semaphores allow multiple threads to use a shared resource at the same time, but they also set the maximum number of threads that can access the shared resource, thus enabling concurrency control.

Here is a simple example of using semaphores to limit concurrent access:

1public class SemaphoreTest { 2 3 private static ExecutorService threadPool = Executors.newFixedThreadPool(100); 4 private static Semaphore semaphore = new Semaphore(10); 5 6 public static void main(String[] args) { 7 for (int i = 0; i < 100; i++) { 8 threadPool.execute(new Runnable() { 9 @Override 10 public void run() { 11 try { 12 semaphore.acquire(); 13 System.out.println("Request processing ..." ); 14 semaphore.release(); 15 } catch (InterruptedException e) { 16 e.printStack(); 17} 18} 19}); 20 } 21 threadPool.shutdown(); 23 22}}Copy the code

Here we have created 100 threads to execute simultaneously, but since the semaphore count is 10, only 10 threads are processing requests at a time. Speaking of counting, in fact, there are many classes in Java besides Semaphore that can be used as counting, such as AtomicLong or LongAdder, which are very common in concurrent flow limiting, but do not provide the same blocking capability as semaphores:

1public class AtomicLongTest { 2 3 private static ExecutorService threadPool = Executors.newFixedThreadPool(100); 4 private static AtomicLong atomic = new AtomicLong(); 5 6 public static void main(String[] args) { 7 for (int i = 0; i < 100; i++) { 8 threadPool.execute(new Runnable() { 9 @Override 10 public void run() { 11 try { 12 if(atomic.incrementAndGet() > 10) { 13 System.out.println("Request rejected ..." ); 14 return; 15 } 16 System.out.println("Request processing ..." ); 17 atomic.decrementAndGet(); 18 } catch (InterruptedException e) { 19 e.printStack(); 20} 21} 22}); 23 } 24 threadPool.shutdown(); 26 25}}Copy the code

4.4 Implement distributed concurrent traffic limiting

By implementing concurrent flow limiting on a single machine, we have mastered several common methods: semaphores, thread pools, and counters, which are concepts on a single machine. If you can implement distributed semaphores, distributed thread pools, and distributed counters, then it’s easy to implement distributed concurrency limiting.

Distributed thread pool is a term I coined myself. I can’t find a similar concept on the Internet. The closest concept is resource scheduling and distribution, but it doesn’t feel like it.

There is such a thing as distributed semaphores, as Apache Ignite provides IgniteSemaphore for creating distributed semaphores in a very similar way. You can also use Redis’ ZSet to implement distributed semaphores. You can also use Redis in Action to teach you how to implement Counting Semaphores. In addition, Redisson also implements a Redis based distributed Semaphore, RSemaphore, similar in usage to Semaphore. Using distributed semaphores it is easy to implement distributed concurrent flow limiting in almost the same way as the single-machine concurrent flow limiting above.

Finally, there are many implementation schemes for distributed counters. For example, it is easy to implement using Redis INCR, or even using MySQL database. But the use of the counter to pay attention to the atomicity of the operation, each request should go through these three steps: take the current value of the counter, judge whether the threshold is exceeded, reject the exceeds, the value of the counter increased. This is actually the same thing as the P operation of the semaphore, and the release corresponds to the V operation.

So, using distributed semaphores and counters can achieve concurrent flow limiting? The problem, of course, is not so simple. In fact, there is a serious BUG in the code snippet above that implements single-machine concurrent current-limiting via semaphores and counters:

1semaphore.acquire(); 2System.out.println("Request processing ..." ); 3semaphore.release();Copy the code

Imagine what would happen if an exception occurred while processing the request? Obviously, the semaphore is picked up by the thread, but it is never released. If there are too many requests, this will cause the semaphore to be filled up and the last request will not come in. In single-machine scenarios, this problem can be easily solved by adding a finally:

1try { 2 semaphore.acquire(); 3 System.out.println("Request processing ..." ); 4} catch (InterruptedException e) { 5 e.printStack(); 6} finally { 7 semaphore.release(); 8}Copy the code

Because the code in finally must execute no matter what exception occurs, it is guaranteed that the semaphore will be released. But in distributed systems, it’s not as simple as adding a finally. This is because the exceptions that can occur in a distributed system are not necessarily code exceptions that can be caught, but also service crashes or unexpected system outages. Even normal service restarts can cause distributed semaphores to fail to be released.

After several evenings of discussion with my colleagues I came up with two solutions to this problem: the first was to use a counter with TTL, and the second was a tricky algorithm based on double-window sliding.

The first method is easier to understand. We give each request a unique ID and write a key/value pair in Redis. The key is requests_xxx (XXX is the request ID) and value is 1. Set a TTL for the key (if your application has very long requests, such as some WebSockket requests that may last several hours, you may need to open a thread to refresh the TTL for the key periodically). If the number of concurrent requests exceeds the upper limit, the request is rejected. If the number of concurrent requests exceeds the upper limit, the request is rejected. This is a good way to deal with the problem of service crash or restart. Because each key is set to TTL, after a period of time, these keys will automatically disappear, so that no semaphore is occupied. However, it is inefficient to use the KEYS command to query the number of requests. In the case of a large number of requests, the gateway performance will be seriously affected. We could replace the KEYS command with SCAN and get a slight performance boost, but overall the results are still pretty poor.

For the first approach, we can further optimize, instead of writing a key-value pair for each request, we can give each instance in each distributed system a unique ID and write a key-value pair in Redis with key instances_xxx (XXX is the instance ID). Value is the current concurrency for this instance. Similarly, we set a TTL for the key and start a thread to refresh the TTL periodically. After each request is received, the counter increases by one, the request ends, and the counter decreases by one. This is the same as in the single-machine scenario, except that when determining the number of concurrent requests, you still need to use KEYS or SCAN to fetch all instances and count the total number of concurrent requests. However, since the number of instances is limited, the performance is significantly improved over the previous approach.

The second method, which I call the double-window sliding algorithm, combines a TTL counter with a sliding window algorithm. We set a time window in minutes. In Redis, there is a key corresponding to 202009051130, and the value is a counter, indicating the number of requests. When a request is accepted, add one to the current time window, and when the request ends, subtract one from the current time window. Note that the time window for receiving the request and ending the request may not be the same. In addition, we need a local list of all requests being processed by the current instance and the time window corresponding to the request, and a timed thread smaller than the time window (say 30 seconds) to migrate expired requests, where the request time window is inconsistent with the current time window. So how do you migrate? We first need to count how many requests in the list are expired, then update the expired request time in the list to the current time window, and move the corresponding amount from the previous time window in Redis to the current time window, that is, the previous time window minus X, the current time window plus X. Due to the migration threads execute regularly, so the overdue request will always be moved to the current window, eventually Redis last time window and the current time window only in the two time window has a data, then the data in the earlier time window will be migrating back, so I can give the key set a TTL 3 minutes or 5 minutes. When determining the amount of concurrency, there are only two keys, so you only need to add the two values using MGET. The following flow chart describes the operation process of the algorithm in detail:

There are a few details to note:

  1. At the end of the request, just subtract one from the current time window in Redis, even if it’s a negative number. The request in the request list does not need to be deleted immediately, but can be marked with the end mark and deleted uniformly in the migration thread (of course, if the request start time and end time are in the same window, can be deleted directly);
  2. The migration interval must be shorter than the time window, which is generally set to 30s.
  3. The key in Redis must be set to TTL. The time must be at least two time Windows, usually set to 3 minutes.
  4. The migration process involves two operations: “subtract from previous time window” and “add in current time window”.
  5. MGET can read the value of two time Windows at once, instead of getting twice.
  6. The process of obtaining the amount of concurrency and determining whether the amount of concurrency exceeds the limit should also pay attention to the atomicity of the operation.

conclusion

As an important part of the microservice architecture, gateway plays the role of one man and ten thousand men do not open, so the stability and performance requirements of gateway services are very high. Generation after generation of programmers have come up with 18 techniques to ensure the stability of gateway services: limiting traffic, fusing, isolation, caching, degradation, and so on. This article from the limit into hand, a detailed description of the limiting scenarios and algorithms, as well as the source code implementation and may step on the pit. Although traffic limiting is only a very small function of the gateway, it affects all aspects of the gateway and is very important in the design of the system architecture.

Although I tried to introduce the stream limiting more completely from different perspectives, I only saw a glimpse of it. There are still a lot of contents that have not been introduced. For example, The Sentinel component of Alibaba open source can also be used for stream limiting. And Netflix is no longer supporting the Hystrix project because they’re instead focusing on their concurrency-limits project, which aims to create an adaptive, flexible flow limiting component. It borrows TCP congestion control algorithm to implement automatic traffic limiting. For those who are interested, please visit its project homepage for more information.

This article is too long to avoid omissions, if you have any questions, please feel free to comment.

Source: www.aneasystone.com/archives/20…

Recent hot articles recommended:

1.1,000+ Java Interview Questions and Answers (2021)

2. Don’t use if/ else on full screen again, try strategy mode, it smells good!!

3. Oh, my gosh! What new syntax is xx ≠ null in Java?

4.Spring Boot 2.5 is a blockbuster release, and dark mode is exploding!

5. “Java Development Manual (Songshan version)” the latest release, quick download!

Feel good, don’t forget to click on + forward oh!